Scaling LLMs with Kubernetes: Production Deployment

Scaling LLMs with Kubernetes: Production Deployment

Scaling Large Language Models (LLMs) in production requires a robust infrastructure that can handle dynamic workloads, provide high availability, and optimize costs through intelligent autoscaling.

The LLM Scaling Challenge

Modern LLM deployments face several critical challenges:

1. Dynamic Workload Patterns

  • Peak hours: Online shopping, customer support, content generation
  • Non-peak hours: Reduced traffic requiring cost optimization
  • Unpredictable spikes: Viral content, marketing campaigns, seasonal events

2. Resource Management

  • GPU utilization: Expensive hardware must be used efficiently
  • Memory constraints: Large models require careful memory management
  • Network bottlenecks: Distributed inference across multiple nodes

3. Cost Optimization

  • Infrastructure costs: GPU instances are expensive
  • Operational overhead: Manual scaling is inefficient
  • Resource waste: Over-provisioning during low-traffic periods

The solution: Kubernetes-based LLM deployment with intelligent autoscaling.

LLM Optimization with TensorRT-LLM

Before scaling, you need optimized models. NVIDIA TensorRT-LLM provides the foundation for efficient LLM inference.

Key Optimizations

1. Kernel Fusion

  • Combines multiple operations into single GPU kernels
  • Reduces memory bandwidth requirements
  • Improves overall throughput by 20-40%

2. Quantization

  • INT8 quantization for faster inference
  • FP16 for balance of speed and accuracy
  • Dynamic quantization for optimal performance

3. Advanced Attention Mechanisms

  • Paged KV cache for efficient memory usage
  • Multi-head attention optimization
  • Flash attention for reduced memory footprint

4. In-flight Batching

  • Processes multiple requests simultaneously
  • Improves GPU utilization
  • Reduces per-request latency

Building TensorRT-LLM Engines

# Example: Building GPT-2 engine with optimizations
python build.py \
    --model_dir gpt2 \
    --output_dir ./engines/gpt2 \
    --dtype float16 \
    --use_gpt_attention_plugin \
    --use_gemms_plugin \
    --use_paged_kv_cache \
    --max_batch_size 8 \
    --max_input_len 512 \
    --max_output_len 128

Key parameters:

  • Tensor Parallelism (TP): Distributes model across multiple GPUs
  • Pipeline Parallelism (PP): Splits model layers across devices
  • Minimum GPUs required: TP × PP

Kubernetes Infrastructure Setup

Required Components

1. NVIDIA Device Plugin

# Enables GPU discovery in Kubernetes
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml

2. Node Feature Discovery

# Discovers GPU capabilities on nodes
kubectl create -f https://raw.githubusercontent.com/kubernetes-sigs/node-feature-discovery/master/deployment/node-feature-discovery.yaml

3. GPU Feature Discovery

# Provides detailed GPU metrics
kubectl create -f https://raw.githubusercontent.com/NVIDIA/gpu-feature-discovery/master/deployments/static/nvidia-gpu-feature-discovery-daemonset.yaml

4. NVIDIA DCGM Exporter

# Exports GPU metrics to Prometheus
kubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/deployment/dcgm-exporter.yaml

Monitoring Stack

Prometheus Configuration:

# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    scrape_configs:
      - job_name: 'triton-metrics'
        static_configs:
          - targets: ['triton-service:8002']
      - job_name: 'dcgm-exporter'
        static_configs:
          - targets: ['dcgm-exporter:9400']    

Deploying LLMs with Triton Inference Server

Triton Server Deployment

# triton-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: triton-server
  template:
    metadata:
      labels:
        app: triton-server
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
        ports:
        - containerPort: 8000  # HTTP
        - containerPort: 8001  # gRPC
        - containerPort: 8002  # Metrics
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: "compute,utility"
        volumeMounts:
        - name: model-repository
          mountPath: /models
        - name: engine-files
          mountPath: /engines
        resources:
          limits:
            nvidia.com/gpu: 1
      volumes:
      - name: model-repository
        persistentVolumeClaim:
          claimName: model-repo-pvc
      - name: engine-files
        hostPath:
          path: /opt/triton/engines

Service Configuration

# triton-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: triton-service
spec:
  selector:
    app: triton-server
  ports:
  - name: http
    port: 8000
    targetPort: 8000
  - name: grpc
    port: 8001
    targetPort: 8001
  - name: metrics
    port: 8002
    targetPort: 8002
  type: ClusterIP

Autoscaling Strategies

Horizontal Pod Autoscaler (HPA)

Custom Metrics Configuration:

# custom-metrics-api.yaml
apiVersion: custom.metrics.k8s.io/v1beta1
kind: MetricValueList
metadata:
  name: triton-queue-compute-ratio
items:
- describedObject:
    kind: Deployment
    name: triton-server
  metricName: triton_queue_compute_ratio
  timestamp: "2024-01-01T00:00:00Z"
  value: "0.5"

HPA Configuration:

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: triton-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: triton-server
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Object
    object:
      metric:
        name: triton_queue_compute_ratio
      describedObject:
        apiVersion: v1
        kind: Service
        name: triton-service
      target:
        type: AverageValue
        averageValue: 0.8

Advanced Autoscaling with NIM Operator

The NVIDIA NIM Operator provides enterprise-grade LLM deployment and scaling:

# nim-deployment.yaml
apiVersion: nim.nvidia.com/v1alpha1
kind: NIMDeployment
metadata:
  name: llm-deployment
spec:
  replicas: 2
  model:
    name: gpt2
    version: "1.0"
    format: tensorrt
  resources:
    limits:
      nvidia.com/gpu: 1
    requests:
      nvidia.com/gpu: 1
  autoscaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70
    targetMemoryUtilizationPercentage: 80
  monitoring:
    prometheus:
      enabled: true
      port: 8002

Multi-Node Scaling Strategies

Load Balancing

Ingress Configuration:

# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: triton-ingress
  annotations:
    nginx.ingress.kubernetes.io/upstream-hash-by: "$remote_addr"
spec:
  rules:
  - host: llm.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: triton-service
            port:
              number: 8000

StatefulSet for Persistent Models

# triton-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: triton-statefulset
spec:
  serviceName: triton-service
  replicas: 3
  selector:
    matchLabels:
      app: triton-server
  template:
    metadata:
      labels:
        app: triton-server
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
        volumeMounts:
        - name: model-storage
          mountPath: /models
  volumeClaimTemplates:
  - metadata:
      name: model-storage
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 100Gi

Performance Monitoring and Optimization

Key Metrics to Monitor

1. GPU Utilization

  • Target: 70-90% utilization
  • Monitor: NVIDIA DCGM metrics
  • Alert: Below 50% or above 95%

2. Queue Depth

  • Target: Minimal queue buildup
  • Monitor: Triton queue metrics
  • Scale up: When queue depth > 5

3. Response Latency

  • Target: < 200ms TTFT, < 50ms ITL
  • Monitor: End-to-end request timing
  • Alert: P95 latency > 500ms

4. Throughput

  • Target: Maximize tokens/second
  • Monitor: Requests per second
  • Optimize: Batch size and concurrency

Grafana Dashboard

# grafana-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboard
data:
  dashboard.json: |
    {
      "dashboard": {
        "title": "LLM Performance Dashboard",
        "panels": [
          {
            "title": "GPU Utilization",
            "type": "graph",
            "targets": [
              {
                "expr": "DCGM_FI_DEV_GPU_UTIL",
                "legendFormat": "GPU {{gpu}}"
              }
            ]
          },
          {
            "title": "Queue Depth",
            "type": "graph",
            "targets": [
              {
                "expr": "triton_queue_size",
                "legendFormat": "Queue Size"
              }
            ]
          }
        ]
      }
    }    

Cost Optimization Strategies

1. Spot Instances for Non-Critical Workloads

# spot-instance-deployment.yaml
spec:
  template:
    spec:
      nodeSelector:
        node.kubernetes.io/instance-type: g4dn.xlarge
      tolerations:
      - key: "kubernetes.azure.com/scalesetpriority"
        operator: "Equal"
        value: "spot"
        effect: "NoSchedule"

2. Mixed Precision for Cost Savings

# fp16-deployment.yaml
spec:
  template:
    spec:
      containers:
      - name: triton
        env:
        - name: TRITON_OPTIMIZATION_LEVEL
          value: "1"
        - name: TRITON_PRECISION
          value: "FP16"

3. Intelligent Scaling Policies

# scaling-policy.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: intelligent-hpa
spec:
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15

Production Best Practices

1. Resource Planning

  • GPU Memory: Plan for 1.5x model size
  • CPU: 4-8 cores per GPU
  • Network: 10Gbps minimum for multi-node
  • Storage: NVMe SSDs for model loading

2. Security Considerations

# security-context.yaml
spec:
  template:
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
      containers:
      - name: triton
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true

3. Backup and Recovery

# backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: model-backup
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: backup-tool:latest
            command: ["/backup-models.sh"]

Real-World Performance Expectations

Single GPU Performance (A100)

  • 7B Models: 100-300 tokens/second
  • 13B Models: 50-150 tokens/second
  • 70B Models: 20-80 tokens/second (multi-GPU)

Multi-Node Scaling

  • Linear scaling: Up to 8 nodes
  • Network overhead: 5-15% performance loss
  • Memory efficiency: 90-95% utilization

Cost Optimization Results

  • Spot instances: 60-80% cost savings
  • Mixed precision: 20-40% performance improvement
  • Autoscaling: 30-50% infrastructure cost reduction

Troubleshooting Common Issues

1. GPU Memory Issues

# Check GPU memory usage
kubectl exec -it triton-pod -- nvidia-smi

# Monitor memory allocation
kubectl logs triton-pod | grep "CUDA out of memory"

2. Network Bottlenecks

# Check network performance
kubectl exec -it triton-pod -- iperf3 -c other-node

# Monitor network metrics
kubectl top nodes

3. Scaling Issues

# Check HPA status
kubectl describe hpa triton-hpa

# Verify custom metrics
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/deployments/triton-server/triton_queue_compute_ratio"

Conclusion

Scaling LLMs with Kubernetes requires a comprehensive approach that combines:

  1. Model Optimization: TensorRT-LLM for efficient inference
  2. Infrastructure Setup: Proper GPU support and monitoring
  3. Intelligent Autoscaling: HPA with custom metrics
  4. Cost Optimization: Spot instances and mixed precision
  5. Production Readiness: Security, monitoring, and backup

The key to success is starting with a solid foundation and iteratively optimizing based on real-world metrics. Remember that LLM scaling is not just about handling more requests - it’s about doing so efficiently while maintaining quality and controlling costs.

For production deployments, consider using the NVIDIA NIM Operator for enterprise-grade LLM management, or build custom solutions using the patterns outlined in this guide.

The future of LLM deployment lies in intelligent, automated scaling that responds to real-time demand while optimizing for both performance and cost. Kubernetes provides the foundation, but success requires careful planning, monitoring, and continuous optimization.

For more detailed information about specific tools and methodologies, check out the NVIDIA Triton documentation and the comprehensive Kubernetes autoscaling guides .

Related Posts

LLM Benchmarking: Performance Measurement

LLM Benchmarking: Performance Measurement

Benchmarking LLMs is more complex than it appears - different tools measure the same metrics differently, making comparisons challenging.

Read More
Cloud Storage Choices for Kubernetes

Cloud Storage Choices for Kubernetes

Kubernetes (K8s) has become the go-to platform for orchestrating containerized applications.

Read More