I first played with GPT-2 back in 2019, when OpenAI released the 1558M parameter model and the whole NLP world lost its mind. I ran it in a Colab notebook, typed in a prompt about Montevideo, and watched it hallucinate streets that don't exist. It was messy, it was slow, and I was hooked.

Fast forward to 2026, and everyone's paying per token to chat with cloud APIs. Which is fine — until you realize you've spent $300 in a month on API calls, your data is sitting on someone else's GPU, and you can't run anything when the internet goes out. Which, in Montevideo during a thunderstorm, happens more often than I'd like.

So I started running LLMs locally. And honestly? It changed how I think about AI. Here's what I learned.

Code on a developer screen
Running models locally means you own the entire pipeline — no API keys, no rate limits, no surprise bills.

The Hardware Question: What Do You Actually Need?

Let me be straight: you don't need a data centre. But you do need to be honest about what you want to run.

Here's what I've tested on my hardware, and what actually works:

ModelParametersQuantisationRAM NeededMy Experience
Phi-4-mini3.8BQ4_K_M~3 GBRuns on anything. Surprisingly good at code.
Llama 3.1 8B8BQ4_K_M~6 GBThe sweet spot for quality vs speed on consumer hardware.
Mistral 7B7BQ4_K_M~5 GBGreat for reasoning tasks. Fast inference.
Qwen 2.5 14B14BQ4_K_M~10 GBNeeds 16 GB VRAM to breathe. Worth it for complex tasks.
Llama 3.1 70B70BQ4_K_M~40 GBNot happening on a single GPU. Mac Studio or multi-GPU only.

I run the 8B models on a single GPU with 12 GB VRAM and they fly. The 14B models are usable but you notice the latency. The 70B models? Forget it on consumer hardware — that's what cloud APIs are still for.

Ollama: The Easiest Way to Get Started

I've tried a lot of local inference setups — llama.cpp, text-generation-webui, vLLM — and for most people, Ollama is where you should start. It's a single binary, it handles model downloading and quantisation for you, and it has a dead-simple API.

# Install Ollama (Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model — that's it
ollama run llama3.1:8b

# Or be more specific about quantisation
ollama run mistral:7b-q4_K_M

Behind the scenes, Ollama is using llama.cpp with optimised builds for your hardware. It automatically picks the right GPU backend (CUDA, ROCm, Metal). You don't need to compile anything or fight with CMake for three hours. Trust me — I've done that, and it's not how I want to spend a Saturday anymore.

Docker Makes It Reproducible

If you've read my Docker posts before, you know I containerise everything. Local LLMs are no different. Here's my setup:

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - open_webui_data:/app/backend/data
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_data:
  open_webui_data:

Two containers. One command to start. And Open WebUI gives you a ChatGPT-like interface running entirely on your machine. No API keys, no monthly bills, no data leaving your network.

This is the same pattern I've been using since 2017 when I wrote about Docker Compose for Hadoop/Spark/Kafka clusters. The tools change, but the approach doesn't — containerise it, make it reproducible, move on.

Developer workspace with code on screen
Running LLMs locally means your prompts, your code, and your data never leave your machine.

When Local Beats Cloud

Here's where local models genuinely outperform cloud APIs for day-to-day work:

Code Assistance

I use local LLMs as coding companions. Not for writing entire applications — that's still hit or miss — but for the stuff I used to open Stack Overflow for:

# Ask the model to explain a regex
prompt = """Explain this regex and suggest improvements:
^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$
"""

Response time? Under 2 seconds on my hardware. Try getting that from an API when you're in the middle of a flow state and the endpoint is rate-limiting you.

Document Processing

Summarising PDFs, extracting structured data from unstructured text, translating technical docs — these are tasks where I'm processing potentially sensitive data. Client contracts, API documentation with proprietary info, internal runbooks. With a local model, that data never leaves my machine.

Offline Work

Between the thunderstorms in Montevideo and my habit of working from cafés with questionable Wi-Fi, having AI that works without internet is a genuine productivity boost, not a gimmick.

When Cloud Still Wins

Let me be honest about where local falls short:

  • Complex reasoning — the 70B+ models are just better at multi-step logic. Local 8B models will give you an answer, but it might be wrong in subtle ways.
  • Long context windows — cloud models handle 128k+ tokens. Local models on consumer hardware choke past 8-16k.
  • Multimodal tasks — vision and image generation are still better served by cloud APIs, unless you have serious GPU memory.

I use both. Local for 80% of daily tasks, cloud for the remaining 20% that need more firepower. The cost difference is significant — I went from $200-300/month in API calls to roughly $0 for local + $30-50 for cloud on the hard stuff.

Practical Tips from Running This Setup for Months

A few things I wish I'd known earlier:

  1. Watch your VRAM — if the model doesn't fit in GPU memory, it spills to system RAM and performance tanks. Q4_K_M quantisation is your friend — the quality loss is minimal and you cut memory usage by 70%.
  2. Use Modelfiles for customisation — Ollama lets you create custom models with system prompts, temperature settings, and parameter tuning, so you don't repeat yourself:
# Create a Modelfile
FROM llama3.1:8b

PARAMETER temperature 0.7
PARAMETER top_p 0.9

SYSTEM """You are a coding assistant focused on Python and JavaScript.
You explain concepts clearly and provide working code examples.
When the user asks about architecture, you consider scalability and simplicity."""

# Build and run
ollama create dev-helper -f Modelfile
ollama run dev-helper
  1. Keep models updated but don't chase every release — the model landscape moves fast. I update monthly, not weekly. The improvement between minor versions is rarely worth the download time.
  2. Combine local and cloud in your scripts — here's a pattern I use in Python:
import httpx

LOCAL_LLM = "http://localhost:11434/api/generate"
CLOUD_LLM = "https://api.anthropic.com/v1/messages"  # for the hard stuff

def ask_llm(prompt: str, use_cloud: bool = False) -> str:
    """Route to local or cloud based on task complexity."""
    if use_cloud:
        # Cloud for complex reasoning tasks
        return call_cloud_api(prompt)
    else:
        # Local for everything else — fast, free, private
        response = httpx.post(
            LOCAL_LLM,
            json={"model": "llama3.1:8b", "prompt": prompt, "stream": False},
            timeout=60.0
        )
        return response.json()["response"]

# Daily driver — local and fast
code_review = ask_llm("Review this function for edge cases: ...")

# When you need the big guns
complex_analysis = ask_llm(
    "Analyse this multi-service architecture for bottlenecks: ...",
    use_cloud=True
)
  1. Monitor GPU temperatures — running inference for hours will heat up your GPU. I use nvidia-smi -l 5 in a separate terminal to keep an eye on it. If you're on a Mac, asitop does the same for Apple Silicon.

The Bigger Picture

When I first ran GPT-2 in a Colab notebook in 2019, it felt like magic. When I run Llama 3.1 on my own hardware today, it still feels like magic — but it's magic I actually understand and control.

There's something satisfying about knowing your AI stack end to end. The model lives on your disk, the inference runs on your GPU, the data stays on your network. No API keys to rotate (or get stolen — I learned that one the hard way), no surprise bills, no rate limits at 2 AM when you're deep in a project.

If you're paying for API calls and haven't tried running a model locally yet, start with Ollama and an 8B model. It takes 5 minutes to set up, and you'll immediately understand whether local AI works for your use case. For most day-to-day coding and text tasks, it genuinely does.

And if you're in Montevideo and the internet goes out during a storm — well, your local LLM doesn't care. It'll still be there, ready to help you debug that regex at 2 AM.