"You can benchmark accuracy all day, but if your model takes 10 seconds to produce the first token, users will leave before seeing how smart it is."
Eval, Latency-Conscious AI Agent
Previous sections in this chapter covered what to measure for model quality. This section covers how to measure system performance: the speed, throughput, and hardware efficiency of LLM training and inference. Performance benchmarking is where model evaluation meets systems engineering. MLPerf provides standardized benchmarks for comparing hardware and software stacks. Inference benchmarking tools measure the metrics that directly affect user experience: Time to First Token (TTFT), Time Per Output Token (TPOT), and throughput under concurrent load. Cross-hardware portability determines whether your training or serving workload can run on TPUs, AMD GPUs, or Intel accelerators, each with its own software stack and trade-offs. Advanced inference scheduling (chunked prefill, cross-instance migration) and KV cache management (compression, reuse, tiered storage) push the boundaries of serving efficiency. Together, these topics complete the evaluation picture: not just "how good is the model?" but "how fast and cost-effective is the system?"
Prerequisites
This section builds on the evaluation fundamentals from Section 29.1 and the evaluation harness ecosystems from Section 29.9. Understanding of inference optimization techniques from Chapter 9 and the production serving architectures from Section 31.1 provides essential context. Familiarity with GPU hardware concepts from Appendix G is helpful but not required.
1. MLPerf Training and Inference Suites
MLPerf, managed by MLCommons, is the industry-standard benchmark suite for comparing ML hardware and software performance. It provides reproducible, audited results that enable fair comparisons across vendors and configurations. For LLM workloads, MLPerf includes both training and inference benchmarks with specific rules and scenarios.
1.1 MLPerf Training for LLMs
The MLPerf Training benchmark measures the time to train a model to a specified quality target. For LLMs, the benchmark uses GPT-3 175B (or equivalent) trained on a fixed dataset to a target validation loss. Key aspects of the benchmark:
- Metric: Time to train (TtT) in minutes, from random initialization to the quality target.
- Rules: Submissions must use the reference model architecture and dataset. Optimizations to the training loop (parallelism strategies, kernel optimizations, communication overlap) are allowed and encouraged. Submissions are categorized as "closed" (strict rules) or "open" (any architecture allowed).
- Interpreting results: Look at both absolute TtT and the number of accelerators used. A system that trains in 5 minutes using 16,384 GPUs is less cost-efficient than one that trains in 20 minutes using 2,048 GPUs. The "performance per accelerator" metric is often more informative than raw TtT.
1.2 MLPerf Inference for LLMs
MLPerf Inference defines four scenarios that capture different deployment patterns:
| Scenario | Description | LLM Relevance | Key Metric |
|---|---|---|---|
| Single Stream | One request at a time, measure latency | Edge/mobile LLM inference | 90th percentile latency |
| Multi-Stream | Multiple concurrent queries | Batch processing pipelines | Latency per stream |
| Server | Variable arrival rate, meet latency SLA | Production API serving | Max QPS within latency SLA |
| Offline | Process all samples as fast as possible | Batch inference, evaluation runs | Samples per second |
For production LLM serving, the Server scenario is most relevant. It measures the maximum queries per second (QPS) a system can handle while keeping the 99th percentile latency below a specified threshold. This directly maps to the production serving requirements discussed in Section 31.3.
Who: An infrastructure architect at a mid-size SaaS company planning GPU procurement for a new Llama 2 70B serving cluster.
Situation: The company needed to serve 500 concurrent users with sub-2-second time-to-first-token. The budget allowed for either 8 NVIDIA H100 GPUs at $30,000 each or 16 AMD MI300X GPUs at $15,000 each, both totaling $240,000.
Problem: Vendor benchmarks were not directly comparable. The team needed a standardized source to compare throughput, latency, and cost efficiency for their specific workload profile.
Decision: They consulted MLPerf Inference v4.1 (2024) Server scenario results. The H100 achieved 14,200 tokens/sec on Llama 2 70B; the MI300X achieved approximately 10,800 tokens/sec. On a cost-per-token basis, the MI300X delivered 72% of the H100's throughput at 50% of the price. However, the H100 showed lower P99 TTFT, critical for the team's latency-sensitive chat application.
Result: The team chose 8 H100s for the latency-sensitive chat tier and earmarked MI300X for a planned batch processing tier where throughput per dollar mattered more than tail latency. The standardized MLPerf data let them justify the split strategy to finance with concrete numbers.
Lesson: Always normalize MLPerf results by cost for your specific scenario. A system that achieves 2x throughput at 3x the price is not a better deal. Calculate dollars-per-million-tokens for your target workload and factor in whether latency or throughput is the binding constraint.
When comparing MLPerf results across vendors, always normalize by cost. A system that achieves 2x the throughput at 3x the price is not a better deal. Calculate the "dollars per million tokens" metric for your target scenario (Server or Offline), and factor in the total cost of ownership including power consumption, cooling, and networking infrastructure.
MLPerf results should be interpreted with care. Submissions are heavily optimized by hardware vendors for the specific benchmark workloads, and real-world performance often differs significantly. A system's MLPerf score reflects peak performance under ideal conditions. Production workloads introduce variable request sizes, mixed model versions, and shared infrastructure that reduce effective throughput. Use MLPerf as a directional comparison tool, not as a production capacity planning number. Always validate with your own workload.
2. Benchmarking LLM Inference
While MLPerf provides standardized cross-vendor comparisons, production teams also need to benchmark their specific model, hardware, and serving configuration. This section covers the key metrics, tools, and methodologies for LLM inference benchmarking.
2.1 Key Inference Metrics
LLM inference has a two-phase structure (prefill and decode) that produces distinct metrics:
- Time to First Token (TTFT): The latency from request arrival to the first output token. Dominated by the prefill phase, which processes the entire input prompt. TTFT is the primary metric for perceived responsiveness in streaming applications.
- Time Per Output Token (TPOT): The average time between successive output tokens during the decode phase. Determines the "typing speed" of the model's response. For streaming chat interfaces, TPOT below 50ms (20 tokens/sec) is generally perceived as "instant."
- End-to-End Latency: Total time from request to complete response. Equals TTFT + (output_tokens * TPOT). Important for non-streaming applications and batch processing.
- Throughput: Total tokens generated per second across all concurrent requests. The primary metric for cost efficiency. Higher throughput means more requests served per GPU-hour.
- P50/P99 Latency: Percentile latencies capture tail behavior. A system with 100ms P50 TTFT but 2-second P99 TTFT has a "long tail" problem that affects 1 in 100 users. Production SLAs are typically defined on P99.
2.2 Benchmarking with Standard Tools
Several tools provide structured LLM inference benchmarking. The choice depends on whether you need quick local benchmarks or rigorous production-representative tests.
# Code Fragment 29.14.2: Benchmarking vLLM with its built-in benchmark tool
# This tool sends concurrent requests and measures TTFT, TPOT, throughput
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 1 &
# Run the benchmark with a realistic request distribution
python -m vllm.benchmark_serving \
--backend openai \
--model meta-llama/Llama-3.1-8B-Instruct \
--base-url http://localhost:8000 \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered.json \
--num-prompts 1000 \
--request-rate 10 \
--seed 42
# Output includes:
# - Successful requests: 1000/1000
# - Throughput: 3,245 tokens/sec
# - Mean TTFT: 87ms, P50: 72ms, P99: 312ms
# - Mean TPOT: 18ms, P50: 15ms, P99: 45ms
# - Mean E2E latency: 1.23s, P50: 0.98s, P99: 3.45s
AIPerf is a more recent benchmarking harness designed for modern LLM inference workloads. It supports configurable request distributions, mixed model serving, and real-world traffic patterns (bursty arrivals, variable prompt/completion lengths):
# Code Fragment 29.14.3: Using AIPerf for comprehensive inference benchmarking
# AIPerf supports multiple backends, realistic traffic patterns,
# and detailed latency breakdowns
aiperf benchmark \
--backend vllm \
--endpoint http://localhost:8000/v1/chat/completions \
--model meta-llama/Llama-3.1-70B-Instruct \
--concurrency-levels 1,4,8,16,32,64 \
--prompt-length-distribution "normal:2048:512" \
--completion-length-distribution "normal:256:128" \
--duration 300 \
--warmup 30 \
--output results.json
# Generates a concurrency vs. throughput/latency report:
# Concurrency | QPS | TTFT P50 | TTFT P99 | TPOT P50 | Throughput
# -----------|-------|----------|----------|----------|----------
# 1 | 0.8 | 45ms | 52ms | 12ms | 245 t/s
# 4 | 3.1 | 58ms | 95ms | 14ms | 920 t/s
# 8 | 5.8 | 82ms | 180ms | 16ms | 1,680 t/s
# 16 | 9.2 | 145ms | 420ms | 22ms | 2,580 t/s
# 32 | 12.1 | 310ms | 980ms | 35ms | 3,100 t/s
# 64 | 13.5 | 680ms | 2,100ms | 52ms | 3,250 t/s
--prompt-length-distribution and --completion-length-distribution flags generate normal-distributed token counts that match real-world traffic patterns. The tabular output reveals the throughput-latency curve: throughput plateaus at 32 concurrent requests while P99 TTFT crosses 1 second at 64.Every LLM serving system has a characteristic throughput-latency curve. At low concurrency, latency is minimal but throughput is wasted (GPUs are idle between requests). As concurrency increases, throughput grows but latency also rises. At some point, the system saturates: throughput plateaus while latency grows rapidly. The optimal operating point is just before saturation, where throughput is near-maximum and latency is still within SLA. Benchmarking at multiple concurrency levels reveals this curve and tells you the maximum QPS your deployment can handle while meeting your latency SLA.
3. Cross-Hardware Portability
NVIDIA GPUs dominate LLM training and inference, but alternative hardware platforms offer competitive performance at different price points. Understanding the software stacks, trade-offs, and kernel gaps of each platform is essential for teams evaluating hardware options or building portable training and serving pipelines.
3.1 Google TPUs with JAX and MaxText
Google's Tensor Processing Units (TPUs) are custom accelerators designed for matrix-heavy ML workloads. TPU v5p offers 459 TFLOPS of BF16 compute and 95 GB of HBM2e per chip, connected via a high-bandwidth inter-chip interconnect (ICI) that enables efficient multi-host parallelism without InfiniBand.
MaxText is Google's reference implementation for LLM training on TPUs. It demonstrates best practices for JAX-based distributed training, including FSDP, tensor parallelism, and XLA compiler optimization:
# Code Fragment 29.14.2: MaxText configuration for Llama-style model on TPU v5p
# MaxText uses a YAML config + Python to define the model and training loop
# maxtext_config.yaml
"""
model_name: "llama-3-8b"
base_output_directory: "gs://my-bucket/maxtext-runs/"
dataset_path: "gs://my-bucket/tokenized-data/"
# Model architecture (Llama 3 8B equivalent)
num_layers: 32
num_heads: 32
num_kv_heads: 8 # GQA
head_dim: 128
mlp_dim: 14336
vocab_size: 128256
max_target_length: 8192
# Training configuration
per_device_batch_size: 4
learning_rate: 3e-4
adam_b1: 0.9
adam_b2: 0.95
weight_decay: 0.1
# Parallelism (for a v5p-256 pod: 256 chips)
ici_fsdp_parallelism: 64 # FSDP across 64 chips on ICI
ici_tensor_parallelism: 4 # TP across 4 chips
dcn_data_parallelism: 1 # No cross-pod DP
# Precision
dtype: "bfloat16"
quantization: "" # Options: "int8", "fp8"
"""
# Launch training on TPU v5p-256
# python3 MaxText/train.py MaxText/configs/base.yml \
# run_name=llama-8b-tpu \
# load_parameters_path="" \
# steps=100000
Trade-offs of TPU vs. NVIDIA GPU:
- Advantages: Tight integration with JAX/XLA compiler produces highly optimized code. ICI interconnect avoids the need for InfiniBand. TPU pods scale to thousands of chips with uniform topology. Cost-per-FLOP can be 30-50% lower than equivalent NVIDIA hardware on Google Cloud.
- Disadvantages: JAX ecosystem is smaller than PyTorch (fewer libraries, less community support). Custom CUDA kernels (FlashAttention, Triton kernels) must be rewritten in Pallas (JAX's kernel language). Debugging is harder because XLA compilation abstracts away the hardware. TPU is only available on Google Cloud.
3.2 AMD GPUs with ROCm
AMD's Instinct MI300X (192 GB HBM3, 1,307 TFLOPS BF16) competes directly with the NVIDIA H100. The ROCm (Radeon Open Compute) software stack provides a CUDA-compatible interface, and many CUDA applications can be ported with the hipify tool.
vLLM on ROCm provides production-ready LLM serving on AMD GPUs. As of 2025, vLLM supports MI250X, MI300X, and MI300A on ROCm 6.x:
# Code Fragment 29.14.4: Running vLLM on AMD MI300X with ROCm
# Pull the ROCm-specific vLLM image
docker pull vllm/vllm-rocm:v0.6.4
# Launch vLLM serving on MI300X
docker run --device=/dev/kfd --device=/dev/dri \
--group-add video \
--shm-size 32g \
-p 8000:8000 \
vllm/vllm-rocm:v0.6.4 \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--dtype bfloat16
# Benchmark on MI300X
python -m vllm.benchmark_serving \
--backend openai \
--model meta-llama/Llama-3.1-70B-Instruct \
--base-url http://localhost:8000 \
--dataset-name sharegpt \
--num-prompts 500 \
--request-rate 8
--device=/dev/kfd and --device=/dev/dri flags expose the AMD GPU kernel driver and render nodes to the container. The same benchmark_serving tool works identically on ROCm, enabling direct cross-hardware performance comparison.Kernel gaps on ROCm: While most PyTorch operations work on ROCm via HIP translation, some specialized kernels have reduced performance or are missing entirely:
| Kernel/Feature | CUDA/NVIDIA | ROCm/AMD | Performance Gap |
|---|---|---|---|
| FlashAttention-2 | Native, optimized | CK (Composable Kernel) port | ~15% slower |
| FP8 tensor cores | Transformer Engine | Partial support (MI300X) | ~30% slower |
| Triton kernels | Native | ROCm Triton backend | Varies, 0-25% |
| NCCL collectives | NCCL | RCCL (ROCm fork) | ~10% slower |
| PagedAttention (vLLM) | Optimized | Supported | ~10% slower |
| AWQ/GPTQ quantization | Native | Supported | ~5-15% slower |
3.3 Intel Gaudi with SynapseAI
Intel Gaudi (formerly Habana Labs) accelerators target cost-effective LLM training and inference. The Gaudi 3 chip offers 1,835 TFLOPS BF16 and 128 GB HBM2e. The SynapseAI software stack provides a PyTorch-compatible interface through the Habana PyTorch Bridge.
Trade-offs: Gaudi offers competitive price-performance for training workloads (available as AWS DL1/DL2q instances), but the ecosystem is much smaller than NVIDIA's. Custom kernels are written in TPC (Tensor Processing Core) ISA, which has a steep learning curve. Model support in the optimum-habana library covers major architectures (Llama, Mistral, Falcon) but lags behind NVIDIA's ecosystem for newer model families.
The cross-hardware portability landscape changes remarkably fast. In 2023, running vLLM on AMD GPUs required significant manual patching. By mid-2024, vLLM's ROCm support was production-ready with CI/CD pipelines running on MI300X. By 2025, AMD GPUs were being used in production LLM serving at several hyperscalers. The pace of improvement in alternative hardware ecosystems means that benchmarks and portability assessments have a shelf life of about 6 months.
4. Advanced Inference Scheduling
Standard LLM serving processes requests sequentially through the prefill and decode phases. Advanced scheduling techniques optimize how requests share GPU resources, reducing latency and improving throughput.
4.1 Sarathi-Serve: Chunked Prefill
The core insight of Sarathi-Serve (Agrawal et al., 2024) is that long prefill requests (processing thousands of input tokens) block the GPU for extended periods, creating "head-of-line blocking" that starves decode requests of GPU time. Chunked prefill breaks the prefill phase into smaller chunks that are interleaved with decode iterations from other requests.
Without chunked prefill, a request with 8,000 input tokens monopolizes the GPU for the entire prefill duration (potentially hundreds of milliseconds). Concurrent decode requests must wait, causing their TPOT to spike. With chunked prefill, the 8,000-token prefill is split into 8 chunks of 1,000 tokens each, and each chunk is interleaved with a decode iteration for waiting requests. This bounds the maximum interruption to the decode phase and dramatically reduces TPOT tail latency.
vLLM (v0.5+) implements chunked prefill with the --enable-chunked-prefill flag. The chunk size is configurable via --max-num-batched-tokens:
# Enable chunked prefill in vLLM for reduced decode latency
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--enable-chunked-prefill \
--max-num-batched-tokens 2048 # Chunk size for prefill
4.2 Llumnix: Cross-Instance Scheduling and Request Migration
Llumnix (2024) addresses the scheduling problem across multiple vLLM instances behind a load balancer. In a typical deployment, a load balancer assigns each request to an instance at arrival time. If the assignment is suboptimal (one instance becomes overloaded while another is idle), there is no way to rebalance without dropping and retrying the request.
Llumnix enables live request migration: transferring an in-progress request (including its KV cache) from one instance to another without interrupting the generation. This requires serializing the KV cache state, transferring it over the network, and deserializing it on the target instance, all while the generation continues. The overhead is proportional to the KV cache size, which can be significant for long contexts.
The scheduling algorithm considers both current load (queue depth, GPU utilization) and predicted load (estimated remaining tokens for in-flight requests) to make migration decisions. In benchmarks, Llumnix reduces P99 TTFT by 30-50% compared to static load balancing under bursty traffic.
5. KV Cache as a Distributed Resource
The KV cache is the dominant memory consumer during LLM inference (see Chapter 9). For a 70B model with 128K context, the KV cache alone can consume over 40 GB per request. Managing the KV cache as a distributed, shareable resource rather than a per-request allocation opens significant efficiency opportunities.
5.1 CacheGen: KV Cache Compression
CacheGen (Zheng et al., 2024) compresses KV cache tensors using a learned codec that exploits the statistical properties of attention key and value vectors. The compressed representations are 3-5x smaller than the original tensors, enabling more concurrent requests per GPU and faster KV cache transfer between instances.
The compression operates on a per-layer, per-head basis, applying a lightweight quantization scheme that preserves the attention distribution with minimal quality loss. Empirically, CacheGen achieves 3.7x compression with less than 1% quality degradation on standard benchmarks.
5.2 LMCache: KV Cache Reuse Across Requests
Many LLM applications use a shared system prompt or context prefix across all requests. Without cache reuse, the KV cache for this shared prefix is recomputed for every request, wasting GPU compute. LMCache (and similar approaches like SGLang's RadixAttention) stores and reuses KV cache entries across requests that share a common prefix.
# Code Fragment 29.14.3: Conceptual KV cache reuse with prefix caching
# Most serving engines now support this natively
# In vLLM, prefix caching is enabled by default (v0.5+)
# The engine automatically detects shared prefixes and reuses KV cache
# Example: 100 requests sharing a 2000-token system prompt
# Without prefix caching: 100 * prefill(2000 tokens) = 200K prefill tokens
# With prefix caching: 1 * prefill(2000 tokens) + 100 * prefill(user_tokens)
# Savings: ~90% reduction in prefill compute for the shared prefix
# SGLang provides explicit prefix caching via RadixAttention:
import sglang as sgl
@sgl.function
def chat_with_context(s, system_prompt, user_message):
s += sgl.system(system_prompt) # KV cache shared across calls
s += sgl.user(user_message) # Only this part is recomputed
s += sgl.assistant(sgl.gen("response", max_tokens=512))
5.3 InfiniGen: Predictive KV Cache Prefetch
InfiniGen (2024) addresses the memory limitation of KV caches by storing inactive cache entries in CPU memory or disk, and predictively prefetching entries back to GPU memory before they are needed. The prediction is based on the attention patterns of previous layers: by observing which KV entries receive high attention scores in early layers, InfiniGen predicts which entries will be needed in later layers and prefetches them during the compute phase.
This tiered storage approach enables serving much longer contexts than GPU memory alone would allow:
| Tier | Capacity | Bandwidth | Latency | Cost per GB |
|---|---|---|---|---|
| GPU HBM | 80 GB (H100) | 3.35 TB/s | < 1 us | Highest |
| CPU DRAM | 512 GB - 2 TB | 200 GB/s | ~100 ns | Medium |
| NVMe SSD | 4-16 TB | 7 GB/s | ~10 us | Low |
With InfiniGen's predictive prefetching, the effective KV cache capacity extends to CPU DRAM size (hundreds of gigabytes) while maintaining near-HBM access patterns for the active cache entries. The decode latency overhead is typically 5-15% compared to an all-HBM cache, which is acceptable for most serving scenarios.
The KV cache management landscape is converging toward treating the cache as a distributed, multi-tier resource rather than a per-request GPU allocation. This shift parallels the evolution of CPU memory management from per-process physical allocation to virtual memory with demand paging. The "virtual KV cache" paradigm (PagedAttention in vLLM was the first step) is extending to cross-instance sharing (LMCache), compression (CacheGen), and tiered storage (InfiniGen). Together, these techniques can increase the effective serving capacity of a GPU cluster by 3-10x for long-context workloads, without any changes to the model itself.
Readers often conflate throughput (tokens per second for the system) with latency (time per token experienced by a single user). A system can achieve high throughput by batching many requests, while each individual user experiences poor latency. When evaluating performance, always report both metrics at your target concurrency level. A benchmark showing 10,000 tokens/second is meaningless if per-request TPOT exceeds your SLA at production load.
✅ Key Takeaways
- MLPerf provides standardized benchmarks for comparing LLM training and inference across hardware platforms. The Server scenario is most relevant for production LLM serving, measuring throughput within a latency SLA.
- LLM inference benchmarking requires measuring TTFT, TPOT, and throughput at multiple concurrency levels to reveal the throughput-latency curve and find the optimal operating point.
- Cross-hardware portability is increasingly viable: TPU/JAX for cost-effective training at Google Cloud scale, AMD MI300X with ROCm for competitive inference, and Intel Gaudi for budget-conscious training. Each platform has kernel gaps that narrow with each release.
- Chunked prefill (Sarathi-Serve) interleaves prefill and decode phases to eliminate head-of-line blocking, reducing TPOT tail latency by 30-50% for mixed workloads.
- Llumnix enables live request migration between serving instances for dynamic load balancing, addressing the fundamental limitation of static request assignment.
- KV cache management (compression, prefix reuse, tiered storage with predictive prefetch) extends effective serving capacity by 3-10x, making the KV cache a first-class distributed resource.
Several research directions are pushing the boundaries of LLM performance and portability. Disaggregated inference (Splitwise, DistServe) separates the prefill and decode phases onto different hardware, using high-compute GPUs for prefill and high-memory GPUs for decode. Speculative decoding with hardware co-design explores using small, fast accelerators (FPGAs or NPUs) to run draft models while large GPUs verify and accept tokens. Compiler-level portability through OpenXLA and MLIR aims to provide a single compilation path that targets NVIDIA, AMD, TPU, and custom accelerators from a single source. KV cache disaggregation (Mooncake, MemServe) moves the KV cache to a dedicated memory pool accessible by any GPU in the cluster, fully decoupling compute from cache storage.
Exercises
Set up a vLLM server with a 7B model and run the benchmark_serving tool at concurrency levels 1, 4, 8, 16, and 32. Plot the throughput-latency curve (TTFT P99 vs. throughput). Identify the concurrency level at which the system saturates and explain what limits further throughput.
Answer Sketch
Launch vLLM with Llama-3.1-8B-Instruct, run benchmark_serving with, request-rate values of 1, 4, 8, 16, 32. Plot shows throughput increasing linearly to ~8 RPS then plateauing while TTFT P99 grows exponentially. Saturation occurs when the GPU's compute capacity is fully utilized (decode phase becomes bottleneck) or when KV cache memory is exhausted. The limiting factor is usually GPU compute at short contexts and memory at long contexts.
Compare the cost-per-million-tokens for serving Llama 3.1 70B on: (a) 4x NVIDIA H100 on AWS p5, (b) 4x AMD MI300X on a cloud provider, (c) a single TPU v5p pod-8 via GCP. Use published benchmarks or estimates for throughput on each platform and current cloud pricing to calculate the cost.
Answer Sketch
Research current pricing: AWS p5.24xlarge (8x H100) ~$98/hr, estimate ~14K tokens/sec for 70B on 4 GPUs. AMD MI300X: estimate ~10.8K tokens/sec, ~$45/hr equivalent. TPU v5p pod-8: ~$40/hr, ~8K tokens/sec. Calculate tokens per dollar: H100 ~514K tokens/$, MI300X ~864K tokens/$, TPU ~720K tokens/$. MI300X wins on cost-efficiency for throughput, H100 wins on latency, TPU wins on total cost for sustained workloads. Note: actual numbers vary significantly; the exercise teaches the methodology.
For a Llama 3.1 70B model (80 layers, 8 KV heads, head_dim=128, BF16), calculate the KV cache size per token, per request at 4K context, and the maximum concurrent requests on an 8x H100 system (640 GB total HBM) assuming the model weights consume 140 GB. How does CacheGen's 3.7x compression change the concurrency limit?
Answer Sketch
KV cache per token: 80 layers * 2 (K+V) * 8 heads * 128 dim * 2 bytes (BF16) = 327,680 bytes = 320 KB. Per request at 4K: 320 KB * 4096 = 1.28 GB. Available memory: 640 - 140 = 500 GB. Max concurrent: 500/1.28 = 390 requests. With CacheGen 3.7x: effective per-request = 1.28/3.7 = 0.346 GB, max concurrent = 500/0.346 = 1,445 requests. This 3.7x increase in concurrency translates directly to 3.7x throughput improvement.
Look up the latest MLPerf Inference results for an LLM benchmark. Compare two submissions on different hardware (e.g., NVIDIA H100 vs. H200, or NVIDIA vs. AMD). Discuss: What scenarios were tested? How do the results differ on throughput vs. latency? What can and cannot be concluded from these results for a production deployment?
Answer Sketch
Access results at mlcommons.org. Compare Server and Offline scenarios. Note that submissions use optimized software stacks that may not reflect default vLLM/TGI performance. Can conclude: relative hardware throughput, approximate performance scaling. Cannot conclude: actual production latency (depends on serving software, request distribution, batch size), cost-per-token (depends on pricing and utilization), quality differences (MLPerf does not measure model quality for same-model benchmarks).
Lab: Build an LLM Evaluation Suite
Objective
Build a complete LLM evaluation toolkit from the ground up. First, implement BLEU and ROUGE scoring by hand to understand what the metrics actually measure (the "right tool" baseline). Then, use the evaluate library for production-grade scoring. Finally, add an LLM-as-judge evaluator using a simple grading prompt.
What You'll Practice
- Implementing BLEU n-gram precision from scratch
- Implementing ROUGE-1 and ROUGE-L recall from scratch
- Using the Hugging Face
evaluatelibrary for standardized metric computation - Building an LLM-as-judge evaluator with structured scoring prompts
Setup
Install the evaluation libraries.
pip install evaluate rouge-score nltk transformers torch
evaluate library and its metric backends.Steps
Step 1: Implement BLEU from scratch
BLEU measures n-gram precision: what fraction of the candidate's n-grams appear in the reference. Implementing it by hand reveals how the metric penalizes omissions and rewards lexical overlap.
from collections import Counter
import math
def compute_ngrams(tokens, n):
"""Extract n-grams from a token list."""
return [tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]
def bleu_score(reference, candidate, max_n=4):
"""Compute BLEU score with brevity penalty (from scratch)."""
ref_tokens = reference.lower().split()
cand_tokens = candidate.lower().split()
# Brevity penalty
bp = min(1.0, math.exp(1 - len(ref_tokens) / max(len(cand_tokens), 1)))
# N-gram precisions
precisions = []
for n in range(1, max_n + 1):
ref_ngrams = Counter(compute_ngrams(ref_tokens, n))
cand_ngrams = Counter(compute_ngrams(cand_tokens, n))
# Clipped counts: each n-gram counts at most as many times as in ref
clipped = sum(min(count, ref_ngrams[ng])
for ng, count in cand_ngrams.items())
total = max(sum(cand_ngrams.values()), 1)
precisions.append(clipped / total)
# Geometric mean of precisions (skip zeros to avoid log(0))
positive = [p for p in precisions if p > 0]
if not positive:
return 0.0
log_avg = sum(math.log(p) for p in positive) / len(positive)
return bp * math.exp(log_avg)
# Test data
reference = "The cat sat on the mat and looked out the window"
candidate = "The cat was sitting on the mat"
score = bleu_score(reference, candidate)
print(f"Manual BLEU: {score:.4f}")
Step 2: Implement ROUGE from scratch
ROUGE measures recall: what fraction of the reference's content appears in the candidate. Implement ROUGE-1 (unigram recall) and ROUGE-L (longest common subsequence).
def rouge_1(reference, candidate):
"""Compute ROUGE-1 (unigram) F1 score."""
ref_tokens = reference.lower().split()
cand_tokens = candidate.lower().split()
ref_counts = Counter(ref_tokens)
cand_counts = Counter(cand_tokens)
overlap = sum(min(ref_counts[t], cand_counts[t])
for t in ref_counts)
precision = overlap / max(len(cand_tokens), 1)
recall = overlap / max(len(ref_tokens), 1)
f1 = 2 * precision * recall / max(precision + recall, 1e-8)
return {"precision": precision, "recall": recall, "f1": f1}
def lcs_length(x, y):
"""Compute length of longest common subsequence."""
m, n = len(x), len(y)
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if x[i-1] == y[j-1]:
dp[i][j] = dp[i-1][j-1] + 1
else:
dp[i][j] = max(dp[i-1][j], dp[i][j-1])
return dp[m][n]
def rouge_l(reference, candidate):
"""Compute ROUGE-L F1 score using longest common subsequence."""
ref_tokens = reference.lower().split()
cand_tokens = candidate.lower().split()
lcs = lcs_length(ref_tokens, cand_tokens)
precision = lcs / max(len(cand_tokens), 1)
recall = lcs / max(len(ref_tokens), 1)
f1 = 2 * precision * recall / max(precision + recall, 1e-8)
return {"precision": precision, "recall": recall, "f1": f1}
# Test
r1 = rouge_1(reference, candidate)
rl = rouge_l(reference, candidate)
print(f"Manual ROUGE-1 F1: {r1['f1']:.4f}")
print(f"Manual ROUGE-L F1: {rl['f1']:.4f}")
Step 3: Verify with the evaluate library
Now compare your manual implementations against the production evaluate library to confirm correctness and see what additional features the library provides.
import evaluate
# Load metrics from the evaluate library
bleu_metric = evaluate.load("bleu")
rouge_metric = evaluate.load("rouge")
# Compute library BLEU
lib_bleu = bleu_metric.compute(
predictions=[candidate],
references=[[reference]],
)
# Compute library ROUGE
lib_rouge = rouge_metric.compute(
predictions=[candidate],
references=[reference],
)
print("Library BLEU:", f"{lib_bleu['bleu']:.4f}")
print("Library ROUGE-1:", f"{lib_rouge['rouge1']:.4f}")
print("Library ROUGE-L:", f"{lib_rouge['rougeL']:.4f}")
print()
print("Manual BLEU:", f"{score:.4f}")
print("Manual ROUGE-1 F1:", f"{r1['f1']:.4f}")
print("Manual ROUGE-L F1:", f"{rl['f1']:.4f}")
print()
print("(Small differences are expected due to tokenization and smoothing.)")
Step 4: Build an LLM-as-judge evaluator
Automated metrics miss semantic equivalence. An LLM-as-judge uses a language model to grade outputs on criteria like relevance, fluency, and factual accuracy. Build a simple judge using a local model.
from transformers import pipeline
judge = pipeline(
"text-generation",
model="HuggingFaceH4/zephyr-7b-beta",
torch_dtype="auto",
device_map="auto",
)
def llm_judge(question, reference_answer, candidate_answer):
"""Score a candidate answer on a 1-5 scale using an LLM judge."""
prompt = f"""You are an expert evaluator. Score the candidate answer \
on a scale of 1 to 5 for relevance and accuracy compared to the reference.
Question: {question}
Reference answer: {reference_answer}
Candidate answer: {candidate_answer}
Respond with ONLY a JSON object: {{"score": N, "reason": "brief explanation"}}
"""
messages = [{"role": "user", "content": prompt}]
result = judge(messages, max_new_tokens=100, do_sample=False)
return result[0]["generated_text"][-1]["content"]
# Example evaluation
question = "What is the cat doing?"
ref_answer = "The cat sat on the mat and looked out the window."
cand_answer = "The cat was sitting on the mat."
judgment = llm_judge(question, ref_answer, cand_answer)
print("LLM Judge verdict:")
print(f" {judgment}")
Hint
If you lack GPU memory for a 7B model, substitute TinyLlama/TinyLlama-1.1B-Chat-v1.0 or use an API-based model. The key concept is the structured judging prompt, not the specific model.
Extensions
- Implement BERTScore using contextual embeddings to measure semantic similarity beyond lexical overlap.
- Create a multi-judge panel (3 LLM judges with different temperatures) and compute inter-judge agreement using Cohen's kappa.
- Build an evaluation harness that runs all metrics (BLEU, ROUGE, BERTScore, LLM-judge) on a dataset of 50 question-answer pairs and produces a comparative report.
MLCommons (2024). MLPerf Training and Inference Benchmarks.
The industry-standard benchmarking suite for ML hardware and software performance. Includes LLM-specific benchmarks with standardized rules, scenarios, and audited results across hardware platforms.
Introduces chunked prefill scheduling for LLM serving, which interleaves prefill and decode phases to eliminate head-of-line blocking. Demonstrates 30-50% reduction in decode tail latency for mixed workloads.
Proposes cross-instance request migration for LLM serving, enabling dynamic load balancing by transferring in-progress requests (including KV cache state) between serving instances.
Introduces learned compression for KV cache tensors, achieving 3.7x compression with minimal quality degradation. Enables higher concurrency and faster cache transfer between serving instances.
Proposes predictive KV cache prefetching between GPU, CPU, and disk storage tiers. Uses attention pattern prediction from early layers to prefetch entries for later layers during compute.
Google (2024). MaxText: A Simple, Performant, and Scalable Jax LLM.
Google's reference implementation for LLM training on TPUs using JAX. Demonstrates best practices for distributed training, FSDP, and XLA compiler optimization on TPU v5p hardware.
AMD (2024). ROCm Documentation: Open Software Platform for GPU Computing.
Documentation for AMD's ROCm software stack, covering HIP (CUDA compatibility layer), MIOpen (cuDNN equivalent), RCCL (NCCL equivalent), and integration with PyTorch and vLLM.
What Comes Next
In this section we covered mlperf training and inference suites, benchmarking llm inference, and related topics. This concludes the current chapter. Return to the chapter overview to review the material or explore related chapters.
