Scaling LLMs with Kubernetes: Production Deployment

ai
July 29, 2025

Scaling Large Language Models (LLMs) in production requires a robust infrastructure that can handle dynamic workloads, provide high availability, and optimize costs through intelligent autoscaling.

The LLM Scaling Challenge

Modern LLM deployments face several critical challenges:

1. Dynamic Workload Patterns

Peak hours: Online shopping, customer support, content generation
Non-peak hours: Reduced traffic requiring cost optimization
Unpredictable spikes: Viral content, marketing campaigns, seasonal events

2. Resource Management

GPU utilization: Expensive hardware must be used efficiently
Memory constraints: Large models require careful memory management
Network bottlenecks: Distributed inference across multiple nodes

3. Cost Optimization

Infrastructure costs: GPU instances are expensive
Operational overhead: Manual scaling is inefficient
Resource waste: Over-provisioning during low-traffic periods

The solution: Kubernetes-based LLM deployment with intelligent autoscaling.

LLM Optimization with TensorRT-LLM

Before scaling, you need optimized models. NVIDIA TensorRT-LLM provides the foundation for efficient LLM inference.

Key Optimizations

1. Kernel Fusion

Combines multiple operations into single GPU kernels
Reduces memory bandwidth requirements
Improves overall throughput by 20-40%

2. Quantization

INT8 quantization for faster inference
FP16 for balance of speed and accuracy
Dynamic quantization for optimal performance

3. Advanced Attention Mechanisms

Paged KV cache for efficient memory usage
Multi-head attention optimization
Flash attention for reduced memory footprint

4. In-flight Batching

Processes multiple requests simultaneously
Improves GPU utilization
Reduces per-request latency

Building TensorRT-LLM Engines

# Example: Building GPT-2 engine with optimizations
python build.py \
    --model_dir gpt2 \
    --output_dir ./engines/gpt2 \
    --dtype float16 \
    --use_gpt_attention_plugin \
    --use_gemms_plugin \
    --use_paged_kv_cache \
    --max_batch_size 8 \
    --max_input_len 512 \
    --max_output_len 128

Key parameters:

Tensor Parallelism (TP): Distributes model across multiple GPUs
Pipeline Parallelism (PP): Splits model layers across devices
Minimum GPUs required: TP × PP

Kubernetes Infrastructure Setup

Required Components

1. NVIDIA Device Plugin

# Enables GPU discovery in Kubernetes
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml

2. Node Feature Discovery

# Discovers GPU capabilities on nodes
kubectl create -f https://raw.githubusercontent.com/kubernetes-sigs/node-feature-discovery/master/deployment/node-feature-discovery.yaml

3. GPU Feature Discovery

# Provides detailed GPU metrics
kubectl create -f https://raw.githubusercontent.com/NVIDIA/gpu-feature-discovery/master/deployments/static/nvidia-gpu-feature-discovery-daemonset.yaml

4. NVIDIA DCGM Exporter

# Exports GPU metrics to Prometheus
kubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/deployment/dcgm-exporter.yaml

Monitoring Stack

Prometheus Configuration:

# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    scrape_configs:
      - job_name: 'triton-metrics'
        static_configs:
          - targets: ['triton-service:8002']
      - job_name: 'dcgm-exporter'
        static_configs:
          - targets: ['dcgm-exporter:9400']

Deploying LLMs with Triton Inference Server

Triton Server Deployment

# triton-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: triton-server
  template:
    metadata:
      labels:
        app: triton-server
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
        ports:
        - containerPort: 8000  # HTTP
        - containerPort: 8001  # gRPC
        - containerPort: 8002  # Metrics
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: "compute,utility"
        volumeMounts:
        - name: model-repository
          mountPath: /models
        - name: engine-files
          mountPath: /engines
        resources:
          limits:
            nvidia.com/gpu: 1
      volumes:
      - name: model-repository
        persistentVolumeClaim:
          claimName: model-repo-pvc
      - name: engine-files
        hostPath:
          path: /opt/triton/engines

Service Configuration

# triton-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: triton-service
spec:
  selector:
    app: triton-server
  ports:
  - name: http
    port: 8000
    targetPort: 8000
  - name: grpc
    port: 8001
    targetPort: 8001
  - name: metrics
    port: 8002
    targetPort: 8002
  type: ClusterIP

Autoscaling Strategies

Horizontal Pod Autoscaler (HPA)

Custom Metrics Configuration:

# custom-metrics-api.yaml
apiVersion: custom.metrics.k8s.io/v1beta1
kind: MetricValueList
metadata:
  name: triton-queue-compute-ratio
items:
- describedObject:
    kind: Deployment
    name: triton-server
  metricName: triton_queue_compute_ratio
  timestamp: "2024-01-01T00:00:00Z"
  value: "0.5"

HPA Configuration:

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: triton-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: triton-server
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Object
    object:
      metric:
        name: triton_queue_compute_ratio
      describedObject:
        apiVersion: v1
        kind: Service
        name: triton-service
      target:
        type: AverageValue
        averageValue: 0.8

Advanced Autoscaling with NIM Operator

The NVIDIA NIM Operator provides enterprise-grade LLM deployment and scaling:

# nim-deployment.yaml
apiVersion: nim.nvidia.com/v1alpha1
kind: NIMDeployment
metadata:
  name: llm-deployment
spec:
  replicas: 2
  model:
    name: gpt2
    version: "1.0"
    format: tensorrt
  resources:
    limits:
      nvidia.com/gpu: 1
    requests:
      nvidia.com/gpu: 1
  autoscaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70
    targetMemoryUtilizationPercentage: 80
  monitoring:
    prometheus:
      enabled: true
      port: 8002

Multi-Node Scaling Strategies

Load Balancing

Ingress Configuration:

# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: triton-ingress
  annotations:
    nginx.ingress.kubernetes.io/upstream-hash-by: "$remote_addr"
spec:
  rules:
  - host: llm.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: triton-service
            port:
              number: 8000

StatefulSet for Persistent Models

# triton-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: triton-statefulset
spec:
  serviceName: triton-service
  replicas: 3
  selector:
    matchLabels:
      app: triton-server
  template:
    metadata:
      labels:
        app: triton-server
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
        volumeMounts:
        - name: model-storage
          mountPath: /models
  volumeClaimTemplates:
  - metadata:
      name: model-storage
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 100Gi

Performance Monitoring and Optimization

Key Metrics to Monitor

1. GPU Utilization

Target: 70-90% utilization
Monitor: NVIDIA DCGM metrics
Alert: Below 50% or above 95%

2. Queue Depth

Target: Minimal queue buildup
Monitor: Triton queue metrics
Scale up: When queue depth > 5

3. Response Latency

Target: < 200ms TTFT, < 50ms ITL
Monitor: End-to-end request timing
Alert: P95 latency > 500ms

4. Throughput

Target: Maximize tokens/second
Monitor: Requests per second
Optimize: Batch size and concurrency

Grafana Dashboard

# grafana-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboard
data:
  dashboard.json: |
    {
      "dashboard": {
        "title": "LLM Performance Dashboard",
        "panels": [
          {
            "title": "GPU Utilization",
            "type": "graph",
            "targets": [
              {
                "expr": "DCGM_FI_DEV_GPU_UTIL",
                "legendFormat": "GPU {{gpu}}"
              }
            ]
          },
          {
            "title": "Queue Depth",
            "type": "graph",
            "targets": [
              {
                "expr": "triton_queue_size",
                "legendFormat": "Queue Size"
              }
            ]
          }
        ]
      }
    }

Cost Optimization Strategies

1. Spot Instances for Non-Critical Workloads

# spot-instance-deployment.yaml
spec:
  template:
    spec:
      nodeSelector:
        node.kubernetes.io/instance-type: g4dn.xlarge
      tolerations:
      - key: "kubernetes.azure.com/scalesetpriority"
        operator: "Equal"
        value: "spot"
        effect: "NoSchedule"

2. Mixed Precision for Cost Savings

# fp16-deployment.yaml
spec:
  template:
    spec:
      containers:
      - name: triton
        env:
        - name: TRITON_OPTIMIZATION_LEVEL
          value: "1"
        - name: TRITON_PRECISION
          value: "FP16"

3. Intelligent Scaling Policies

# scaling-policy.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: intelligent-hpa
spec:
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15

Production Best Practices

1. Resource Planning

GPU Memory: Plan for 1.5x model size
CPU: 4-8 cores per GPU
Network: 10Gbps minimum for multi-node
Storage: NVMe SSDs for model loading

2. Security Considerations

# security-context.yaml
spec:
  template:
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
      containers:
      - name: triton
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true

3. Backup and Recovery

# backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: model-backup
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: backup-tool:latest
            command: ["/backup-models.sh"]

Real-World Performance Expectations

Single GPU Performance (A100)

7B Models: 100-300 tokens/second
13B Models: 50-150 tokens/second
70B Models: 20-80 tokens/second (multi-GPU)

Multi-Node Scaling

Linear scaling: Up to 8 nodes
Network overhead: 5-15% performance loss
Memory efficiency: 90-95% utilization

Cost Optimization Results

Spot instances: 60-80% cost savings
Mixed precision: 20-40% performance improvement
Autoscaling: 30-50% infrastructure cost reduction

Troubleshooting Common Issues

1. GPU Memory Issues

# Check GPU memory usage
kubectl exec -it triton-pod -- nvidia-smi

# Monitor memory allocation
kubectl logs triton-pod | grep "CUDA out of memory"

2. Network Bottlenecks

# Check network performance
kubectl exec -it triton-pod -- iperf3 -c other-node

# Monitor network metrics
kubectl top nodes

3. Scaling Issues

# Check HPA status
kubectl describe hpa triton-hpa

# Verify custom metrics
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/deployments/triton-server/triton_queue_compute_ratio"

Conclusion

Scaling LLMs with Kubernetes requires a comprehensive approach that combines:

Model Optimization: TensorRT-LLM for efficient inference
Infrastructure Setup: Proper GPU support and monitoring
Intelligent Autoscaling: HPA with custom metrics
Cost Optimization: Spot instances and mixed precision
Production Readiness: Security, monitoring, and backup

The key to success is starting with a solid foundation and iteratively optimizing based on real-world metrics. Remember that LLM scaling is not just about handling more requests - it’s about doing so efficiently while maintaining quality and controlling costs.

For production deployments, consider using the NVIDIA NIM Operator for enterprise-grade LLM management, or build custom solutions using the patterns outlined in this guide.

The future of LLM deployment lies in intelligent, automated scaling that responds to real-time demand while optimizing for both performance and cost. Kubernetes provides the foundation, but success requires careful planning, monitoring, and continuous optimization.

For more detailed information about specific tools and methodologies, check out the NVIDIA Triton documentation and the comprehensive Kubernetes autoscaling guides .

For any questions and consultation, reach out at nurbol.sakenov@outlook.com