Developing Local AI Systems: Deploying Ollama and Llama 3 with Docker

Running heavy language models on your own hardware used to be a nightmare of dependency hell and mismatched CUDA versions. With Ollama and Docker, we’ve finally hit a point where deploying a local LLM feels like spinning up any other microservice. I’ve been building internal tooling for local RAG (Retrieval-Augmented Generation) pipelines, and containerizing the inference engine is the only way to keep development environments consistent across the team.

The Architectural Shift

When I deploy models locally, I treat Ollama as a sidecar container. The primary advantage here is isolation. By keeping the model runtime in a Docker container, I don't pollute my host OS with Python environments or specific library versions.

From an operational standpoint, the biggest bottleneck is GPU passthrough. You need the NVIDIA Container Toolkit installed on your host. Without it, your container won't see the VRAM, and you’ll be stuck running inference on your CPU, which is painfully slow for anything beyond a simple chat interface.

Setting Up the Infrastructure

I prefer using a docker-compose.yml file to manage the lifecycle of the Ollama service. It makes scaling and updating the model as simple as changing a tag and running a compose command.

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama_container
    ports:
      - "11434:11434"
    volumes:
      # Persist models so you don't re-download 5GB+ every restart
      - ./ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

  # Simple web UI for testing the Llama 3 instance
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama

Automating Model Loading

One common pitfall I see is assuming the model is ready as soon as the container starts. Ollama needs a moment to initialize. If your application sends a request immediately, it will fail. I use a simple Bash script to verify the service is ready before I trigger the model pull.

#!/bin/bash
# wait-for-ollama.sh

echo "Checking if Ollama is responsive..."
until curl -s http://localhost:11434/api/tags | grep -q "llama3"; do
  echo "Model not found. Pulling Llama 3..."
  docker exec ollama_container ollama pull llama3
  sleep 2
done

echo "Llama 3 is ready for inference."

Performance and Debugging Insights

If you’re running into performance issues, check your shared memory. Docker defaults to a 64MB /dev/shm, which is often too small for large language models. If you see inference crashes or "out of memory" errors despite having free VRAM, add shm_size: '1gb' to your docker-compose.yml service definition.

Another thing I’ve noticed: watch your host's disk space. Each model layer is cached. If you’re experimenting with different models (e.g., Llama 3, Mistral, Phi-3), your ollama_data directory will grow rapidly. I run a cleanup cron job on my dev machine to prune unused images to keep the local registry tidy.

Why This Works for Development

By decoupling the inference engine from the application logic, I can swap out the model backend without touching my core application code. If I want to test a newer model, I just update the environment variable or the pull command. This setup gives me a production-like environment on my laptop, ensuring that when I eventually push to a server with a dedicated A100 or H100, the API calls and logic behave exactly as they did during local development.

The Architectural Shift

Setting Up the Infrastructure

Automating Model Loading

Performance and Debugging Insights

Why This Works for Development

Aditya Shenvi