Scaling LLMs in Kubernetes: Setting up GPU Orchestrations with KEDA
Running LLMs in production is a constant battle between keeping GPU costs down and ensuring latency doesn't spike when traffic hits. If you leave your pods running 24/7, your cloud bill will burn through your budget before the end of the month. If you scale too slowly, your users sit staring at a loading spinner.
I’ve spent the last few months refining a pattern using KEDA (Kubernetes Event-driven Autoscaling) to handle GPU-accelerated inference. By moving away from static replica counts and using custom metrics, I’ve managed to keep GPU utilization high while keeping costs predictable.
The Architectural Shift
Standard Horizontal Pod Autoscalers (HPA) look at CPU or memory, which are terrible proxies for LLM load. A GPU-bound model might be sitting at 10% CPU usage while the VRAM is fully saturated or the inference queue is backed up.
Instead, I use KEDA with a Prometheus scaler. We track the tgi_request_queue_size (if using Text Generation Inference) or custom metrics from the NVIDIA DCGM exporter. This allows the cluster to scale based on actual demand rather than system-level noise.
Configuring the KEDA Scaler
To get this working, you need the NVIDIA device plugin installed and your Prometheus instance scraping the GPU metrics. Here is the YAML configuration I use for a production-grade deployment.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llm-gpu-scaler
namespace: inference-prod
spec:
scaleTargetRef:
name: llama3-inference-service
minReplicaCount: 0 # Scale to zero when idle to save costs
maxReplicaCount: 10
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-service.monitoring.svc.cluster.local
# Metric: Average number of pending requests in the queue
metricName: tgi_request_queue_size
threshold: '5'
query: |
sum(tgi_request_queue_size{pod=~"llama3-inference-service-.*"})
/ count(tgi_request_queue_size{pod=~"llama3-inference-service-.*"})
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Prevent flapping
policies:
- type: Percent
value: 100
periodSeconds: 15
Operational Trade-offs
When you scale to zero, you introduce "cold start" latency. Loading a 70B parameter model into VRAM isn't instantaneous. Here is how I handle the trade-offs:
- Model Caching: I use a
ReadWriteManypersistent volume (or S3-backed storage like JuiceFS) to cache model weights on the node local SSDs. This drastically reduces the time it takes for a pod to become "Ready." - Readiness Probes: Don't just check if the container is running. Set your readiness probe to hit the
/healthendpoint of your inference server. If the model isn't loaded into VRAM, the pod shouldn't receive traffic. - The "Minimum Floor": Even if you want to scale to zero, keep at least one replica running during business hours. The latency hit of a cold start is often unacceptable for user-facing chat interfaces.
Debugging Common Issues
I’ve hit a few walls while setting this up that might save you a Friday night of debugging:
- GPU Fragmentation: Sometimes Kubernetes schedules a pod to a node that has enough CPU/RAM but not enough VRAM. Ensure you have
nvidia.com/gpu: 1explicitly set in your resource requests. If your KEDA scaler triggers a scale-up, but the pod stays inPending, check your node affinity and taints. - Metric Lag: Prometheus scrape intervals can be slow. If your autoscaler is lagging, check your
scrape_interval. I keep mine at 15 seconds for inference namespaces. - Resource Quotas: If you are using GKE or EKS, make sure your project-level GPU quota isn't lower than your
maxReplicaCount. I once had a scaling event fail because I hit my regional quota, resulting in a series ofPendingpods while users were timing out.
Scaling LLMs effectively isn't just about the model—it's about how you manage the infrastructure around it. Start small, monitor your queue depths, and always have a fallback mechanism for when the autoscaler fails to keep up with a sudden spike.
Aditya Shenvi
AI Engineer & Full-Stack Architect. Passionate about building intelligent systems, elegant UIs, and scaling web infrastructure. Open to exciting engineering opportunities in April 2026 and beyond.