arrow_backBACK_TO_TRANSMISSIONS
DEVOPS2026-02-18schedule3 MIN READ

Scaling LLMs in Kubernetes: Setting up GPU Orchestrations with KEDA

visibility0 VIEWS
1 ACTIVE READER
SHARE:
Scaling LLMs in Kubernetes: Setting up GPU Orchestrations with KEDA

Running LLMs in production is a constant battle between keeping GPU costs down and ensuring latency doesn't spike when traffic hits. If you leave your pods running 24/7, your cloud bill will burn through your budget before the end of the month. If you scale too slowly, your users sit staring at a loading spinner.

I’ve spent the last few months refining a pattern using KEDA (Kubernetes Event-driven Autoscaling) to handle GPU-accelerated inference. By moving away from static replica counts and using custom metrics, I’ve managed to keep GPU utilization high while keeping costs predictable.

The Architectural Shift

Standard Horizontal Pod Autoscalers (HPA) look at CPU or memory, which are terrible proxies for LLM load. A GPU-bound model might be sitting at 10% CPU usage while the VRAM is fully saturated or the inference queue is backed up.

Instead, I use KEDA with a Prometheus scaler. We track the tgi_request_queue_size (if using Text Generation Inference) or custom metrics from the NVIDIA DCGM exporter. This allows the cluster to scale based on actual demand rather than system-level noise.

Configuring the KEDA Scaler

To get this working, you need the NVIDIA device plugin installed and your Prometheus instance scraping the GPU metrics. Here is the YAML configuration I use for a production-grade deployment.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llm-gpu-scaler
  namespace: inference-prod
spec:
  scaleTargetRef:
    name: llama3-inference-service
  minReplicaCount: 0 # Scale to zero when idle to save costs
  maxReplicaCount: 10
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-service.monitoring.svc.cluster.local
      # Metric: Average number of pending requests in the queue
      metricName: tgi_request_queue_size
      threshold: '5' 
      query: |
        sum(tgi_request_queue_size{pod=~"llama3-inference-service-.*"}) 
        / count(tgi_request_queue_size{pod=~"llama3-inference-service-.*"})
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300 # Prevent flapping
          policies:
          - type: Percent
            value: 100
            periodSeconds: 15

Operational Trade-offs

When you scale to zero, you introduce "cold start" latency. Loading a 70B parameter model into VRAM isn't instantaneous. Here is how I handle the trade-offs:

  1. Model Caching: I use a ReadWriteMany persistent volume (or S3-backed storage like JuiceFS) to cache model weights on the node local SSDs. This drastically reduces the time it takes for a pod to become "Ready."
  2. Readiness Probes: Don't just check if the container is running. Set your readiness probe to hit the /health endpoint of your inference server. If the model isn't loaded into VRAM, the pod shouldn't receive traffic.
  3. The "Minimum Floor": Even if you want to scale to zero, keep at least one replica running during business hours. The latency hit of a cold start is often unacceptable for user-facing chat interfaces.

Debugging Common Issues

I’ve hit a few walls while setting this up that might save you a Friday night of debugging:

  • GPU Fragmentation: Sometimes Kubernetes schedules a pod to a node that has enough CPU/RAM but not enough VRAM. Ensure you have nvidia.com/gpu: 1 explicitly set in your resource requests. If your KEDA scaler triggers a scale-up, but the pod stays in Pending, check your node affinity and taints.
  • Metric Lag: Prometheus scrape intervals can be slow. If your autoscaler is lagging, check your scrape_interval. I keep mine at 15 seconds for inference namespaces.
  • Resource Quotas: If you are using GKE or EKS, make sure your project-level GPU quota isn't lower than your maxReplicaCount. I once had a scaling event fail because I hit my regional quota, resulting in a series of Pending pods while users were timing out.

Scaling LLMs effectively isn't just about the model—it's about how you manage the infrastructure around it. Start small, monitor your queue depths, and always have a fallback mechanism for when the autoscaler fails to keep up with a sudden spike.


engineering

Aditya Shenvi

AI Engineer & Full-Stack Architect. Passionate about building intelligent systems, elegant UIs, and scaling web infrastructure. Open to exciting engineering opportunities in April 2026 and beyond.

SYS_CLOCK: SYNCEDBUILD: v3.2.1NODE: ACTIVEPING: 12msSTATUS: NOMINALCOMPILE: SUCCESSDEPLOY: STABLECACHE: WARMSYS_CLOCK: SYNCEDBUILD: v3.2.1NODE: ACTIVEPING: 12msSTATUS: NOMINALCOMPILE: SUCCESSDEPLOY: STABLECACHE: WARM
EVENT_HORIZON

ARCHITECT // ENGINEER // DREAMER —
Building the neural frontier.

NAVIGATION

SIGNAL_PORTS

SYSTEM_STATUS

All systems nominal

CORE: STABLE // SYNC: OK
LAST_DEPLOY: 2026-07-05

© 2026 ADITYA SHENVI // EVENT_HORIZON // ALL_RIGHTS_RESERVED