
Scaling LLMs with Kubernetes: Production Deployment
- kubernetes
- July 29, 2025
Scaling Large Language Models (LLMs) in production requires a robust infrastructure that can handle dynamic workloads, provide high availability, and optimize costs through intelligent autoscaling.
The LLM Scaling Challenge
Modern LLM deployments face several critical challenges:
1. Dynamic Workload Patterns
- Peak hours: Online shopping, customer support, content generation
- Non-peak hours: Reduced traffic requiring cost optimization
- Unpredictable spikes: Viral content, marketing campaigns, seasonal events
2. Resource Management
- GPU utilization: Expensive hardware must be used efficiently
- Memory constraints: Large models require careful memory management
- Network bottlenecks: Distributed inference across multiple nodes
3. Cost Optimization
- Infrastructure costs: GPU instances are expensive
- Operational overhead: Manual scaling is inefficient
- Resource waste: Over-provisioning during low-traffic periods
The solution: Kubernetes-based LLM deployment with intelligent autoscaling.
LLM Optimization with TensorRT-LLM
Before scaling, you need optimized models. NVIDIA TensorRT-LLM provides the foundation for efficient LLM inference.
Key Optimizations
1. Kernel Fusion
- Combines multiple operations into single GPU kernels
- Reduces memory bandwidth requirements
- Improves overall throughput by 20-40%
2. Quantization
- INT8 quantization for faster inference
- FP16 for balance of speed and accuracy
- Dynamic quantization for optimal performance
3. Advanced Attention Mechanisms
- Paged KV cache for efficient memory usage
- Multi-head attention optimization
- Flash attention for reduced memory footprint
4. In-flight Batching
- Processes multiple requests simultaneously
- Improves GPU utilization
- Reduces per-request latency
Building TensorRT-LLM Engines
# Example: Building GPT-2 engine with optimizations
python build.py \
--model_dir gpt2 \
--output_dir ./engines/gpt2 \
--dtype float16 \
--use_gpt_attention_plugin \
--use_gemms_plugin \
--use_paged_kv_cache \
--max_batch_size 8 \
--max_input_len 512 \
--max_output_len 128
Key parameters:
- Tensor Parallelism (TP): Distributes model across multiple GPUs
- Pipeline Parallelism (PP): Splits model layers across devices
- Minimum GPUs required: TP × PP
Kubernetes Infrastructure Setup
Required Components
1. NVIDIA Device Plugin
# Enables GPU discovery in Kubernetes
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml
2. Node Feature Discovery
# Discovers GPU capabilities on nodes
kubectl create -f https://raw.githubusercontent.com/kubernetes-sigs/node-feature-discovery/master/deployment/node-feature-discovery.yaml
3. GPU Feature Discovery
# Provides detailed GPU metrics
kubectl create -f https://raw.githubusercontent.com/NVIDIA/gpu-feature-discovery/master/deployments/static/nvidia-gpu-feature-discovery-daemonset.yaml
4. NVIDIA DCGM Exporter
# Exports GPU metrics to Prometheus
kubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/deployment/dcgm-exporter.yaml
Monitoring Stack
Prometheus Configuration:
# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'triton-metrics'
static_configs:
- targets: ['triton-service:8002']
- job_name: 'dcgm-exporter'
static_configs:
- targets: ['dcgm-exporter:9400']
Deploying LLMs with Triton Inference Server
Triton Server Deployment
# triton-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-server
spec:
replicas: 1
selector:
matchLabels:
app: triton-server
template:
metadata:
labels:
app: triton-server
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
ports:
- containerPort: 8000 # HTTP
- containerPort: 8001 # gRPC
- containerPort: 8002 # Metrics
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: NVIDIA_DRIVER_CAPABILITIES
value: "compute,utility"
volumeMounts:
- name: model-repository
mountPath: /models
- name: engine-files
mountPath: /engines
resources:
limits:
nvidia.com/gpu: 1
volumes:
- name: model-repository
persistentVolumeClaim:
claimName: model-repo-pvc
- name: engine-files
hostPath:
path: /opt/triton/engines
Service Configuration
# triton-service.yaml
apiVersion: v1
kind: Service
metadata:
name: triton-service
spec:
selector:
app: triton-server
ports:
- name: http
port: 8000
targetPort: 8000
- name: grpc
port: 8001
targetPort: 8001
- name: metrics
port: 8002
targetPort: 8002
type: ClusterIP
Autoscaling Strategies
Horizontal Pod Autoscaler (HPA)
Custom Metrics Configuration:
# custom-metrics-api.yaml
apiVersion: custom.metrics.k8s.io/v1beta1
kind: MetricValueList
metadata:
name: triton-queue-compute-ratio
items:
- describedObject:
kind: Deployment
name: triton-server
metricName: triton_queue_compute_ratio
timestamp: "2024-01-01T00:00:00Z"
value: "0.5"
HPA Configuration:
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: triton-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: triton-server
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Object
object:
metric:
name: triton_queue_compute_ratio
describedObject:
apiVersion: v1
kind: Service
name: triton-service
target:
type: AverageValue
averageValue: 0.8
Advanced Autoscaling with NIM Operator
The NVIDIA NIM Operator provides enterprise-grade LLM deployment and scaling:
# nim-deployment.yaml
apiVersion: nim.nvidia.com/v1alpha1
kind: NIMDeployment
metadata:
name: llm-deployment
spec:
replicas: 2
model:
name: gpt2
version: "1.0"
format: tensorrt
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
monitoring:
prometheus:
enabled: true
port: 8002
Multi-Node Scaling Strategies
Load Balancing
Ingress Configuration:
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: triton-ingress
annotations:
nginx.ingress.kubernetes.io/upstream-hash-by: "$remote_addr"
spec:
rules:
- host: llm.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: triton-service
port:
number: 8000
StatefulSet for Persistent Models
# triton-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: triton-statefulset
spec:
serviceName: triton-service
replicas: 3
selector:
matchLabels:
app: triton-server
template:
metadata:
labels:
app: triton-server
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
volumeMounts:
- name: model-storage
mountPath: /models
volumeClaimTemplates:
- metadata:
name: model-storage
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
Performance Monitoring and Optimization
Key Metrics to Monitor
1. GPU Utilization
- Target: 70-90% utilization
- Monitor: NVIDIA DCGM metrics
- Alert: Below 50% or above 95%
2. Queue Depth
- Target: Minimal queue buildup
- Monitor: Triton queue metrics
- Scale up: When queue depth > 5
3. Response Latency
- Target: < 200ms TTFT, < 50ms ITL
- Monitor: End-to-end request timing
- Alert: P95 latency > 500ms
4. Throughput
- Target: Maximize tokens/second
- Monitor: Requests per second
- Optimize: Batch size and concurrency
Grafana Dashboard
# grafana-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard
data:
dashboard.json: |
{
"dashboard": {
"title": "LLM Performance Dashboard",
"panels": [
{
"title": "GPU Utilization",
"type": "graph",
"targets": [
{
"expr": "DCGM_FI_DEV_GPU_UTIL",
"legendFormat": "GPU {{gpu}}"
}
]
},
{
"title": "Queue Depth",
"type": "graph",
"targets": [
{
"expr": "triton_queue_size",
"legendFormat": "Queue Size"
}
]
}
]
}
}
Cost Optimization Strategies
1. Spot Instances for Non-Critical Workloads
# spot-instance-deployment.yaml
spec:
template:
spec:
nodeSelector:
node.kubernetes.io/instance-type: g4dn.xlarge
tolerations:
- key: "kubernetes.azure.com/scalesetpriority"
operator: "Equal"
value: "spot"
effect: "NoSchedule"
2. Mixed Precision for Cost Savings
# fp16-deployment.yaml
spec:
template:
spec:
containers:
- name: triton
env:
- name: TRITON_OPTIMIZATION_LEVEL
value: "1"
- name: TRITON_PRECISION
value: "FP16"
3. Intelligent Scaling Policies
# scaling-policy.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: intelligent-hpa
spec:
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 15
Production Best Practices
1. Resource Planning
- GPU Memory: Plan for 1.5x model size
- CPU: 4-8 cores per GPU
- Network: 10Gbps minimum for multi-node
- Storage: NVMe SSDs for model loading
2. Security Considerations
# security-context.yaml
spec:
template:
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
containers:
- name: triton
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
3. Backup and Recovery
# backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: model-backup
spec:
schedule: "0 2 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: backup-tool:latest
command: ["/backup-models.sh"]
Real-World Performance Expectations
Single GPU Performance (A100)
- 7B Models: 100-300 tokens/second
- 13B Models: 50-150 tokens/second
- 70B Models: 20-80 tokens/second (multi-GPU)
Multi-Node Scaling
- Linear scaling: Up to 8 nodes
- Network overhead: 5-15% performance loss
- Memory efficiency: 90-95% utilization
Cost Optimization Results
- Spot instances: 60-80% cost savings
- Mixed precision: 20-40% performance improvement
- Autoscaling: 30-50% infrastructure cost reduction
Troubleshooting Common Issues
1. GPU Memory Issues
# Check GPU memory usage
kubectl exec -it triton-pod -- nvidia-smi
# Monitor memory allocation
kubectl logs triton-pod | grep "CUDA out of memory"
2. Network Bottlenecks
# Check network performance
kubectl exec -it triton-pod -- iperf3 -c other-node
# Monitor network metrics
kubectl top nodes
3. Scaling Issues
# Check HPA status
kubectl describe hpa triton-hpa
# Verify custom metrics
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/deployments/triton-server/triton_queue_compute_ratio"
Conclusion
Scaling LLMs with Kubernetes requires a comprehensive approach that combines:
- Model Optimization: TensorRT-LLM for efficient inference
- Infrastructure Setup: Proper GPU support and monitoring
- Intelligent Autoscaling: HPA with custom metrics
- Cost Optimization: Spot instances and mixed precision
- Production Readiness: Security, monitoring, and backup
The key to success is starting with a solid foundation and iteratively optimizing based on real-world metrics. Remember that LLM scaling is not just about handling more requests - it’s about doing so efficiently while maintaining quality and controlling costs.
For production deployments, consider using the NVIDIA NIM Operator for enterprise-grade LLM management, or build custom solutions using the patterns outlined in this guide.
The future of LLM deployment lies in intelligent, automated scaling that responds to real-time demand while optimizing for both performance and cost. Kubernetes provides the foundation, but success requires careful planning, monitoring, and continuous optimization.
For more detailed information about specific tools and methodologies, check out the NVIDIA Triton documentation and the comprehensive Kubernetes autoscaling guides .