LLM Benchmarking: Performance Measurement

ai
July 27, 2025

Benchmarking LLMs is more complex than it appears - different tools measure the same metrics differently, making comparisons challenging.

This guide will walk you through the fundamental concepts of LLM benchmarking, the key metrics you need to understand, and how to properly measure performance for your specific use case.

The Two Types of LLM Evaluation

Before diving into metrics, it’s important to understand the two distinct approaches to evaluating LLM performance:

1. Load Testing

Load testing focuses on simulating real-world traffic patterns to assess how your LLM deployment handles scale. This approach helps identify:

Server capacity limitations
Autoscaling effectiveness
Network latency issues
Resource utilization patterns

Load testing answers the question: “Can my system handle the expected traffic?”

2. Performance Benchmarking

Performance benchmarking measures the actual model efficiency and optimization. This approach focuses on:

Model throughput and latency
Token-level performance metrics
Memory and compute efficiency
Model configuration optimization

Performance benchmarking answers the question: “How efficiently does my model process requests?”

The key insight: You need both approaches. Load testing ensures your infrastructure can handle the load, while performance benchmarking ensures your model is optimized for efficiency.

Core LLM Benchmarking Metrics

1. Time to First Token (TTFT)

TTFT measures the time from when a request is sent until the first token is received. This is crucial for user experience - users expect to see a response quickly.

Why TTFT matters:

Determines perceived responsiveness
Affected by model size and input sequence length
Critical for real-time applications like chatbots

Industry standard: Most applications require TTFT under 200ms for good user experience.

2. Inter-Token Latency (ITL)

ITL measures the time between consecutive tokens during generation. This metric reveals how efficiently the model generates text.

Key characteristics:

Should be consistent throughout generation
Affected by KV cache management
Grows with sequence length due to attention computation

Formula: ITL = (generation_time) / (output_tokens - 1)

Note: The denominator subtracts 1 because ITL measures the time between tokens, not including the first token.

3. End-to-End Latency

The total time from request submission to final token reception:

Formula: E2E_Latency = TTFT + (ITL × output_tokens)

Important considerations:

Different tools measure this differently
Some include network overhead, others don’t
GenAI-Perf excludes the final “done” signal

4. Throughput (Tokens Per Second - TPS)

TPS measures how many tokens the system can generate per second across all concurrent requests.

Two common definitions:

GenAI-Perf approach:

TPS = total_output_tokens / (last_response_time - first_request_time)

LLMPerf approach:

TPS = total_output_tokens / total_benchmark_duration

Key difference: LLMPerf includes setup/teardown overhead, which can account for up to 33% of benchmark duration in single-concurrency scenarios.

Critical Benchmarking Parameters

Sequence Length Parameters

Input Sequence Length (ISL):

Affects memory requirements for KV cache
Longer ISL increases TTFT
Should match your application’s typical input length

Output Sequence Length (OSL):

Affects generation time and memory usage
Longer OSL increases ITL
Should reflect your application’s output requirements

Common ISL/OSL pairs by use case:

Chatbots: 512/128 tokens
Document summarization: 2048/256 tokens
Code generation: 1024/512 tokens
Creative writing: 256/1024 tokens

Load Control Parameters

Concurrency:

Number of simultaneous requests
Should range from 1 to maximum batch size
Higher concurrency increases system TPS but decreases per-request TPS

Request Rate:

Requests per second
Can cause unbounded queue growth if exceeded
Recommendation: Use concurrency instead of request rate

Model Parameters

Precision:

FP16: Good balance of speed and accuracy
INT8: Faster but may reduce quality
FP32: Highest accuracy but slowest

Batch Size:

Affects memory utilization and throughput
Larger batches = higher throughput but higher latency
Should match your concurrency settings

Industry-Standard Benchmarking Tools

1. NVIDIA GenAI-Perf

NVIDIA’s official benchmarking tool with several advantages:

Strengths:

Precise metric definitions
Sliding window technique for stable measurements
Excludes warm-up/cool-down periods
Supports both concurrency and request rate modes

Key features:

Measures actual model performance, not infrastructure
Provides detailed token-level metrics
Optimized for NVIDIA hardware

2. LLMPerf

Open-source benchmarking tool with different methodology:

Strengths:

Includes infrastructure overhead in measurements
Batch-based request sending
Good for end-to-end system evaluation

Key differences:

Includes setup/teardown time in TPS calculations
Uses batch processing with drain periods
More realistic for production scenarios

3. Custom Benchmarking

For production systems, you might need custom benchmarks that match your specific:

Traffic patterns
Request distributions
Quality requirements

Best Practices for Accurate Benchmarking

1. Use Representative Workloads

Your benchmark should reflect your actual use case:

Real input/output sequence lengths
Typical request patterns
Expected concurrency levels

2. Measure Multiple Metrics

Don’t focus on a single metric:

TTFT for user experience
ITL for generation efficiency
TPS for system capacity
Memory usage for cost optimization

3. Test Across Different Conditions

Benchmark under various scenarios:

Different sequence lengths
Multiple concurrency levels
Various model configurations
Peak vs. average load conditions

4. Account for Overhead

Understand what your benchmarking tool includes:

Network latency
Serialization overhead
Framework startup time
Memory allocation costs

5. Use Stable Measurements

Exclude warm-up and cool-down periods
Use sliding windows for consistent results
Run multiple iterations
Report confidence intervals

Real-World Performance Expectations

Based on industry benchmarks with modern hardware (A100/H100):

7B Parameter Models

TTFT: 50-150ms
ITL: 20-50ms per token
Throughput: 100-300 tokens/second per GPU
Memory: 12-16GB VRAM

13B Parameter Models

TTFT: 100-250ms
ITL: 30-70ms per token
Throughput: 50-150 tokens/second per GPU
Memory: 24-32GB VRAM

70B Parameter Models

TTFT: 300-800ms
ITL: 80-150ms per token
Throughput: 20-80 tokens/second per GPU
Memory: 80-140GB VRAM (multi-GPU)

Cost Optimization Through Benchmarking

Understanding the Cost-Performance Trade-off

Higher throughput = lower cost per token
Lower latency = better user experience
Memory efficiency = lower infrastructure costs

Optimization Strategies

Right-size your model: Use the smallest model that meets quality requirements
Optimize batch sizes: Balance throughput and latency
Choose appropriate precision: FP16 often provides good balance
Scale horizontally: Add more GPUs for higher throughput
Use efficient inference engines: TensorRT-LLM, vLLM, or Triton

Common Benchmarking Pitfalls

1. Comparing Apples to Oranges

Different tools measure metrics differently
Hardware configurations vary significantly
Model versions and configurations matter

2. Ignoring Real-World Factors

Network latency in distributed systems
Memory bandwidth limitations
CPU bottlenecks in preprocessing

3. Focusing on Single Metrics

High throughput doesn’t guarantee low latency
Good TTFT doesn’t ensure consistent ITL
Memory efficiency affects long-term costs

4. Insufficient Testing

Single data points aren’t reliable
Need multiple iterations and confidence intervals
Should test edge cases and failure scenarios

Tools and Resources

Open-Source Benchmarking Tools

GenAI-Perf: NVIDIA’s official tool
LLMPerf: Open-source alternative
vLLM: Includes built-in benchmarking
Triton: Enterprise-grade benchmarking

Commercial Solutions

MLPerf: Industry-standard benchmarks
Cloud provider tools: AWS, GCP, Azure benchmarking
Model-specific tools: Hugging Face, Anthropic, OpenAI

Conclusion

Proper LLM benchmarking requires understanding both the metrics and the methodology. The key is to:

Choose the right metrics for your use case
Use consistent methodology across comparisons
Test representative workloads that match production
Account for all overhead in your measurements
Optimize for your specific requirements (latency vs. throughput vs. cost)

Remember: Benchmarking is not a one-time activity. As your models, hardware, and requirements evolve, regular benchmarking helps ensure optimal performance and cost efficiency.

The most successful LLM deployments combine both load testing (for infrastructure validation) and performance benchmarking (for model optimization) to achieve the best results for their specific use cases.

For more detailed information about specific benchmarking tools and methodologies, check out the NVIDIA GenAI-Perf documentation and the comprehensive benchmarking guides from major cloud providers.

For any questions and consultation, reach out at nurbol.sakenov@outlook.com