
LLM Benchmarking: Performance Measurement
- ai
- July 29, 2025
Benchmarking LLMs is more complex than it appears - different tools measure the same metrics differently, making comparisons challenging.
This guide will walk you through the fundamental concepts of LLM benchmarking, the key metrics you need to understand, and how to properly measure performance for your specific use case.
The Two Types of LLM Evaluation
Before diving into metrics, it’s important to understand the two distinct approaches to evaluating LLM performance:
1. Load Testing
Load testing focuses on simulating real-world traffic patterns to assess how your LLM deployment handles scale. This approach helps identify:
- Server capacity limitations
- Autoscaling effectiveness
- Network latency issues
- Resource utilization patterns
Load testing answers the question: “Can my system handle the expected traffic?”
2. Performance Benchmarking
Performance benchmarking measures the actual model efficiency and optimization. This approach focuses on:
- Model throughput and latency
- Token-level performance metrics
- Memory and compute efficiency
- Model configuration optimization
Performance benchmarking answers the question: “How efficiently does my model process requests?”
The key insight: You need both approaches. Load testing ensures your infrastructure can handle the load, while performance benchmarking ensures your model is optimized for efficiency.
Core LLM Benchmarking Metrics
1. Time to First Token (TTFT)
TTFT measures the time from when a request is sent until the first token is received. This is crucial for user experience - users expect to see a response quickly.
Why TTFT matters:
- Determines perceived responsiveness
- Affected by model size and input sequence length
- Critical for real-time applications like chatbots
Industry standard: Most applications require TTFT under 200ms for good user experience.
2. Inter-Token Latency (ITL)
ITL measures the time between consecutive tokens during generation. This metric reveals how efficiently the model generates text.
Key characteristics:
- Should be consistent throughout generation
- Affected by KV cache management
- Grows with sequence length due to attention computation
Formula: ITL = (generation_time) / (output_tokens - 1)
Note: The denominator subtracts 1 because ITL measures the time between tokens, not including the first token.
3. End-to-End Latency
The total time from request submission to final token reception:
Formula: E2E_Latency = TTFT + (ITL × output_tokens)
Important considerations:
- Different tools measure this differently
- Some include network overhead, others don’t
- GenAI-Perf excludes the final “done” signal
4. Throughput (Tokens Per Second - TPS)
TPS measures how many tokens the system can generate per second across all concurrent requests.
Two common definitions:
GenAI-Perf approach:
TPS = total_output_tokens / (last_response_time - first_request_time)
LLMPerf approach:
TPS = total_output_tokens / total_benchmark_duration
Key difference: LLMPerf includes setup/teardown overhead, which can account for up to 33% of benchmark duration in single-concurrency scenarios.
Critical Benchmarking Parameters
Sequence Length Parameters
Input Sequence Length (ISL):
- Affects memory requirements for KV cache
- Longer ISL increases TTFT
- Should match your application’s typical input length
Output Sequence Length (OSL):
- Affects generation time and memory usage
- Longer OSL increases ITL
- Should reflect your application’s output requirements
Common ISL/OSL pairs by use case:
- Chatbots: 512/128 tokens
- Document summarization: 2048/256 tokens
- Code generation: 1024/512 tokens
- Creative writing: 256/1024 tokens
Load Control Parameters
Concurrency:
- Number of simultaneous requests
- Should range from 1 to maximum batch size
- Higher concurrency increases system TPS but decreases per-request TPS
Request Rate:
- Requests per second
- Can cause unbounded queue growth if exceeded
- Recommendation: Use concurrency instead of request rate
Model Parameters
Precision:
- FP16: Good balance of speed and accuracy
- INT8: Faster but may reduce quality
- FP32: Highest accuracy but slowest
Batch Size:
- Affects memory utilization and throughput
- Larger batches = higher throughput but higher latency
- Should match your concurrency settings
Industry-Standard Benchmarking Tools
1. NVIDIA GenAI-Perf
NVIDIA’s official benchmarking tool with several advantages:
Strengths:
- Precise metric definitions
- Sliding window technique for stable measurements
- Excludes warm-up/cool-down periods
- Supports both concurrency and request rate modes
Key features:
- Measures actual model performance, not infrastructure
- Provides detailed token-level metrics
- Optimized for NVIDIA hardware
2. LLMPerf
Open-source benchmarking tool with different methodology:
Strengths:
- Includes infrastructure overhead in measurements
- Batch-based request sending
- Good for end-to-end system evaluation
Key differences:
- Includes setup/teardown time in TPS calculations
- Uses batch processing with drain periods
- More realistic for production scenarios
3. Custom Benchmarking
For production systems, you might need custom benchmarks that match your specific:
- Traffic patterns
- Request distributions
- Quality requirements
Best Practices for Accurate Benchmarking
1. Use Representative Workloads
Your benchmark should reflect your actual use case:
- Real input/output sequence lengths
- Typical request patterns
- Expected concurrency levels
2. Measure Multiple Metrics
Don’t focus on a single metric:
- TTFT for user experience
- ITL for generation efficiency
- TPS for system capacity
- Memory usage for cost optimization
3. Test Across Different Conditions
Benchmark under various scenarios:
- Different sequence lengths
- Multiple concurrency levels
- Various model configurations
- Peak vs. average load conditions
4. Account for Overhead
Understand what your benchmarking tool includes:
- Network latency
- Serialization overhead
- Framework startup time
- Memory allocation costs
5. Use Stable Measurements
- Exclude warm-up and cool-down periods
- Use sliding windows for consistent results
- Run multiple iterations
- Report confidence intervals
Real-World Performance Expectations
Based on industry benchmarks with modern hardware (A100/H100):
7B Parameter Models
- TTFT: 50-150ms
- ITL: 20-50ms per token
- Throughput: 100-300 tokens/second per GPU
- Memory: 12-16GB VRAM
13B Parameter Models
- TTFT: 100-250ms
- ITL: 30-70ms per token
- Throughput: 50-150 tokens/second per GPU
- Memory: 24-32GB VRAM
70B Parameter Models
- TTFT: 300-800ms
- ITL: 80-150ms per token
- Throughput: 20-80 tokens/second per GPU
- Memory: 80-140GB VRAM (multi-GPU)
Cost Optimization Through Benchmarking
Understanding the Cost-Performance Trade-off
- Higher throughput = lower cost per token
- Lower latency = better user experience
- Memory efficiency = lower infrastructure costs
Optimization Strategies
- Right-size your model: Use the smallest model that meets quality requirements
- Optimize batch sizes: Balance throughput and latency
- Choose appropriate precision: FP16 often provides good balance
- Scale horizontally: Add more GPUs for higher throughput
- Use efficient inference engines: TensorRT-LLM, vLLM, or Triton
Common Benchmarking Pitfalls
1. Comparing Apples to Oranges
- Different tools measure metrics differently
- Hardware configurations vary significantly
- Model versions and configurations matter
2. Ignoring Real-World Factors
- Network latency in distributed systems
- Memory bandwidth limitations
- CPU bottlenecks in preprocessing
3. Focusing on Single Metrics
- High throughput doesn’t guarantee low latency
- Good TTFT doesn’t ensure consistent ITL
- Memory efficiency affects long-term costs
4. Insufficient Testing
- Single data points aren’t reliable
- Need multiple iterations and confidence intervals
- Should test edge cases and failure scenarios
Tools and Resources
Open-Source Benchmarking Tools
- GenAI-Perf: NVIDIA’s official tool
- LLMPerf: Open-source alternative
- vLLM: Includes built-in benchmarking
- Triton: Enterprise-grade benchmarking
Commercial Solutions
- MLPerf: Industry-standard benchmarks
- Cloud provider tools: AWS, GCP, Azure benchmarking
- Model-specific tools: Hugging Face, Anthropic, OpenAI
Conclusion
Proper LLM benchmarking requires understanding both the metrics and the methodology. The key is to:
- Choose the right metrics for your use case
- Use consistent methodology across comparisons
- Test representative workloads that match production
- Account for all overhead in your measurements
- Optimize for your specific requirements (latency vs. throughput vs. cost)
Remember: Benchmarking is not a one-time activity. As your models, hardware, and requirements evolve, regular benchmarking helps ensure optimal performance and cost efficiency.
The most successful LLM deployments combine both load testing (for infrastructure validation) and performance benchmarking (for model optimization) to achieve the best results for their specific use cases.
For more detailed information about specific benchmarking tools and methodologies, check out the NVIDIA GenAI-Perf documentation and the comprehensive benchmarking guides from major cloud providers.