Developing Local AI Systems: Deploying Ollama and Llama 3 with Docker
Running heavy language models on your own hardware used to be a nightmare of dependency hell and mismatched CUDA versions. With Ollama and Docker, we’ve finally hit a point where deploying a local LLM feels like spinning up any other microservice. I’ve been building internal tooling for local RAG (Retrieval-Augmented Generation) pipelines, and containerizing the inference engine is the only way to keep development environments consistent across the team.
The Architectural Shift
When I deploy models locally, I treat Ollama as a sidecar container. The primary advantage here is isolation. By keeping the model runtime in a Docker container, I don't pollute my host OS with Python environments or specific library versions.
From an operational standpoint, the biggest bottleneck is GPU passthrough. You need the NVIDIA Container Toolkit installed on your host. Without it, your container won't see the VRAM, and you’ll be stuck running inference on your CPU, which is painfully slow for anything beyond a simple chat interface.
Setting Up the Infrastructure
I prefer using a docker-compose.yml file to manage the lifecycle of the Ollama service. It makes scaling and updating the model as simple as changing a tag and running a compose command.
# docker-compose.yml
services:
ollama:
image: ollama/ollama:latest
container_name: ollama_container
ports:
- "11434:11434"
volumes:
# Persist models so you don't re-download 5GB+ every restart
- ./ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
# Simple web UI for testing the Llama 3 instance
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
Automating Model Loading
One common pitfall I see is assuming the model is ready as soon as the container starts. Ollama needs a moment to initialize. If your application sends a request immediately, it will fail. I use a simple Bash script to verify the service is ready before I trigger the model pull.
#!/bin/bash
# wait-for-ollama.sh
echo "Checking if Ollama is responsive..."
until curl -s http://localhost:11434/api/tags | grep -q "llama3"; do
echo "Model not found. Pulling Llama 3..."
docker exec ollama_container ollama pull llama3
sleep 2
done
echo "Llama 3 is ready for inference."
Performance and Debugging Insights
If you’re running into performance issues, check your shared memory. Docker defaults to a 64MB /dev/shm, which is often too small for large language models. If you see inference crashes or "out of memory" errors despite having free VRAM, add shm_size: '1gb' to your docker-compose.yml service definition.
Another thing I’ve noticed: watch your host's disk space. Each model layer is cached. If you’re experimenting with different models (e.g., Llama 3, Mistral, Phi-3), your ollama_data directory will grow rapidly. I run a cleanup cron job on my dev machine to prune unused images to keep the local registry tidy.
Why This Works for Development
By decoupling the inference engine from the application logic, I can swap out the model backend without touching my core application code. If I want to test a newer model, I just update the environment variable or the pull command. This setup gives me a production-like environment on my laptop, ensuring that when I eventually push to a server with a dedicated A100 or H100, the API calls and logic behave exactly as they did during local development.
Aditya Shenvi
AI Engineer & Full-Stack Architect. Passionate about building intelligent systems, elegant UIs, and scaling web infrastructure. Open to exciting engineering opportunities in April 2026 and beyond.