Part VIII: Evaluation & Production
Chapter 29: Evaluation & Experiment Design

LLM Performance Benchmarking and Cross-Hardware Portability

"You can benchmark accuracy all day, but if your model takes 10 seconds to produce the first token, users will leave before seeing how smart it is."

Eval Eval, Latency-Conscious AI Agent
Big Picture

Previous sections in this chapter covered what to measure for model quality. This section covers how to measure system performance: the speed, throughput, and hardware efficiency of LLM training and inference. Performance benchmarking is where model evaluation meets systems engineering. MLPerf provides standardized benchmarks for comparing hardware and software stacks. Inference benchmarking tools measure the metrics that directly affect user experience: Time to First Token (TTFT), Time Per Output Token (TPOT), and throughput under concurrent load. Cross-hardware portability determines whether your training or serving workload can run on TPUs, AMD GPUs, or Intel accelerators, each with its own software stack and trade-offs. Advanced inference scheduling (chunked prefill, cross-instance migration) and KV cache management (compression, reuse, tiered storage) push the boundaries of serving efficiency. Together, these topics complete the evaluation picture: not just "how good is the model?" but "how fast and cost-effective is the system?"

Prerequisites

This section builds on the evaluation fundamentals from Section 29.1 and the evaluation harness ecosystems from Section 29.9. Understanding of inference optimization techniques from Chapter 9 and the production serving architectures from Section 31.1 provides essential context. Familiarity with GPU hardware concepts from Appendix G is helpful but not required.

1. MLPerf Training and Inference Suites

MLPerf, managed by MLCommons, is the industry-standard benchmark suite for comparing ML hardware and software performance. It provides reproducible, audited results that enable fair comparisons across vendors and configurations. For LLM workloads, MLPerf includes both training and inference benchmarks with specific rules and scenarios.

1.1 MLPerf Training for LLMs

The MLPerf Training benchmark measures the time to train a model to a specified quality target. For LLMs, the benchmark uses GPT-3 175B (or equivalent) trained on a fixed dataset to a target validation loss. Key aspects of the benchmark:

1.2 MLPerf Inference for LLMs

MLPerf Inference defines four scenarios that capture different deployment patterns:

Table 29.14.1: MLPerf Inference scenarios and their LLM relevance
ScenarioDescriptionLLM RelevanceKey Metric
Single StreamOne request at a time, measure latencyEdge/mobile LLM inference90th percentile latency
Multi-StreamMultiple concurrent queriesBatch processing pipelinesLatency per stream
ServerVariable arrival rate, meet latency SLAProduction API servingMax QPS within latency SLA
OfflineProcess all samples as fast as possibleBatch inference, evaluation runsSamples per second

For production LLM serving, the Server scenario is most relevant. It measures the maximum queries per second (QPS) a system can handle while keeping the 99th percentile latency below a specified threshold. This directly maps to the production serving requirements discussed in Section 31.3.

Real-World Scenario: Interpreting MLPerf Inference results for deployment decisions

Who: An infrastructure architect at a mid-size SaaS company planning GPU procurement for a new Llama 2 70B serving cluster.

Situation: The company needed to serve 500 concurrent users with sub-2-second time-to-first-token. The budget allowed for either 8 NVIDIA H100 GPUs at $30,000 each or 16 AMD MI300X GPUs at $15,000 each, both totaling $240,000.

Problem: Vendor benchmarks were not directly comparable. The team needed a standardized source to compare throughput, latency, and cost efficiency for their specific workload profile.

Decision: They consulted MLPerf Inference v4.1 (2024) Server scenario results. The H100 achieved 14,200 tokens/sec on Llama 2 70B; the MI300X achieved approximately 10,800 tokens/sec. On a cost-per-token basis, the MI300X delivered 72% of the H100's throughput at 50% of the price. However, the H100 showed lower P99 TTFT, critical for the team's latency-sensitive chat application.

Result: The team chose 8 H100s for the latency-sensitive chat tier and earmarked MI300X for a planned batch processing tier where throughput per dollar mattered more than tail latency. The standardized MLPerf data let them justify the split strategy to finance with concrete numbers.

Lesson: Always normalize MLPerf results by cost for your specific scenario. A system that achieves 2x throughput at 3x the price is not a better deal. Calculate dollars-per-million-tokens for your target workload and factor in whether latency or throughput is the binding constraint.

Tip

When comparing MLPerf results across vendors, always normalize by cost. A system that achieves 2x the throughput at 3x the price is not a better deal. Calculate the "dollars per million tokens" metric for your target scenario (Server or Offline), and factor in the total cost of ownership including power consumption, cooling, and networking infrastructure.

Key Insight

MLPerf results should be interpreted with care. Submissions are heavily optimized by hardware vendors for the specific benchmark workloads, and real-world performance often differs significantly. A system's MLPerf score reflects peak performance under ideal conditions. Production workloads introduce variable request sizes, mixed model versions, and shared infrastructure that reduce effective throughput. Use MLPerf as a directional comparison tool, not as a production capacity planning number. Always validate with your own workload.

2. Benchmarking LLM Inference

While MLPerf provides standardized cross-vendor comparisons, production teams also need to benchmark their specific model, hardware, and serving configuration. This section covers the key metrics, tools, and methodologies for LLM inference benchmarking.

2.1 Key Inference Metrics

LLM inference has a two-phase structure (prefill and decode) that produces distinct metrics:

2.2 Benchmarking with Standard Tools

Several tools provide structured LLM inference benchmarking. The choice depends on whether you need quick local benchmarks or rigorous production-representative tests.

# Code Fragment 29.14.2: Benchmarking vLLM with its built-in benchmark tool
# This tool sends concurrent requests and measures TTFT, TPOT, throughput

python -m vllm.entrypoints.openai.api_server \
 --model meta-llama/Llama-3.1-8B-Instruct \
 --tensor-parallel-size 1 &

# Run the benchmark with a realistic request distribution
python -m vllm.benchmark_serving \
 --backend openai \
 --model meta-llama/Llama-3.1-8B-Instruct \
 --base-url http://localhost:8000 \
 --dataset-name sharegpt \
 --dataset-path ShareGPT_V3_unfiltered.json \
 --num-prompts 1000 \
 --request-rate 10 \
 --seed 42

# Output includes:
# - Successful requests: 1000/1000
# - Throughput: 3,245 tokens/sec
# - Mean TTFT: 87ms, P50: 72ms, P99: 312ms
# - Mean TPOT: 18ms, P50: 15ms, P99: 45ms
# - Mean E2E latency: 1.23s, P50: 0.98s, P99: 3.45s
Manual BLEU: 0.3182
Code Fragment 29.14.1: Code Fragment 29.14.2: Benchmarking vLLM with its built-in benchmark tool

AIPerf is a more recent benchmarking harness designed for modern LLM inference workloads. It supports configurable request distributions, mixed model serving, and real-world traffic patterns (bursty arrivals, variable prompt/completion lengths):

# Code Fragment 29.14.3: Using AIPerf for comprehensive inference benchmarking
# AIPerf supports multiple backends, realistic traffic patterns,
# and detailed latency breakdowns

aiperf benchmark \
 --backend vllm \
 --endpoint http://localhost:8000/v1/chat/completions \
 --model meta-llama/Llama-3.1-70B-Instruct \
 --concurrency-levels 1,4,8,16,32,64 \
 --prompt-length-distribution "normal:2048:512" \
 --completion-length-distribution "normal:256:128" \
 --duration 300 \
 --warmup 30 \
 --output results.json

# Generates a concurrency vs. throughput/latency report:
# Concurrency | QPS | TTFT P50 | TTFT P99 | TPOT P50 | Throughput
# -----------|-------|----------|----------|----------|----------
# 1 | 0.8 | 45ms | 52ms | 12ms | 245 t/s
# 4 | 3.1 | 58ms | 95ms | 14ms | 920 t/s
# 8 | 5.8 | 82ms | 180ms | 16ms | 1,680 t/s
# 16 | 9.2 | 145ms | 420ms | 22ms | 2,580 t/s
# 32 | 12.1 | 310ms | 980ms | 35ms | 3,100 t/s
# 64 | 13.5 | 680ms | 2,100ms | 52ms | 3,250 t/s
Code Fragment 2: AIPerf benchmarking across six concurrency levels (1 through 64). The --prompt-length-distribution and --completion-length-distribution flags generate normal-distributed token counts that match real-world traffic patterns. The tabular output reveals the throughput-latency curve: throughput plateaus at 32 concurrent requests while P99 TTFT crosses 1 second at 64.
Mental Model: The Throughput-Latency Curve

Every LLM serving system has a characteristic throughput-latency curve. At low concurrency, latency is minimal but throughput is wasted (GPUs are idle between requests). As concurrency increases, throughput grows but latency also rises. At some point, the system saturates: throughput plateaus while latency grows rapidly. The optimal operating point is just before saturation, where throughput is near-maximum and latency is still within SLA. Benchmarking at multiple concurrency levels reveals this curve and tells you the maximum QPS your deployment can handle while meeting your latency SLA.

3. Cross-Hardware Portability

NVIDIA GPUs dominate LLM training and inference, but alternative hardware platforms offer competitive performance at different price points. Understanding the software stacks, trade-offs, and kernel gaps of each platform is essential for teams evaluating hardware options or building portable training and serving pipelines.

3.1 Google TPUs with JAX and MaxText

Google's Tensor Processing Units (TPUs) are custom accelerators designed for matrix-heavy ML workloads. TPU v5p offers 459 TFLOPS of BF16 compute and 95 GB of HBM2e per chip, connected via a high-bandwidth inter-chip interconnect (ICI) that enables efficient multi-host parallelism without InfiniBand.

MaxText is Google's reference implementation for LLM training on TPUs. It demonstrates best practices for JAX-based distributed training, including FSDP, tensor parallelism, and XLA compiler optimization:

# Code Fragment 29.14.2: MaxText configuration for Llama-style model on TPU v5p
# MaxText uses a YAML config + Python to define the model and training loop

# maxtext_config.yaml
"""
model_name: "llama-3-8b"
base_output_directory: "gs://my-bucket/maxtext-runs/"
dataset_path: "gs://my-bucket/tokenized-data/"

# Model architecture (Llama 3 8B equivalent)
num_layers: 32
num_heads: 32
num_kv_heads: 8 # GQA
head_dim: 128
mlp_dim: 14336
vocab_size: 128256
max_target_length: 8192

# Training configuration
per_device_batch_size: 4
learning_rate: 3e-4
adam_b1: 0.9
adam_b2: 0.95
weight_decay: 0.1

# Parallelism (for a v5p-256 pod: 256 chips)
ici_fsdp_parallelism: 64 # FSDP across 64 chips on ICI
ici_tensor_parallelism: 4 # TP across 4 chips
dcn_data_parallelism: 1 # No cross-pod DP

# Precision
dtype: "bfloat16"
quantization: "" # Options: "int8", "fp8"
"""

# Launch training on TPU v5p-256
# python3 MaxText/train.py MaxText/configs/base.yml \
# run_name=llama-8b-tpu \
# load_parameters_path="" \
# steps=100000
Code Fragment 29.14.2: MaxText configuration for Llama-style model on TPU v5p

Trade-offs of TPU vs. NVIDIA GPU:

3.2 AMD GPUs with ROCm

AMD's Instinct MI300X (192 GB HBM3, 1,307 TFLOPS BF16) competes directly with the NVIDIA H100. The ROCm (Radeon Open Compute) software stack provides a CUDA-compatible interface, and many CUDA applications can be ported with the hipify tool.

vLLM on ROCm provides production-ready LLM serving on AMD GPUs. As of 2025, vLLM supports MI250X, MI300X, and MI300A on ROCm 6.x:

# Code Fragment 29.14.4: Running vLLM on AMD MI300X with ROCm
# Pull the ROCm-specific vLLM image
docker pull vllm/vllm-rocm:v0.6.4

# Launch vLLM serving on MI300X
docker run --device=/dev/kfd --device=/dev/dri \
 --group-add video \
 --shm-size 32g \
 -p 8000:8000 \
 vllm/vllm-rocm:v0.6.4 \
 --model meta-llama/Llama-3.1-70B-Instruct \
 --tensor-parallel-size 4 \
 --max-model-len 8192 \
 --gpu-memory-utilization 0.90 \
 --dtype bfloat16

# Benchmark on MI300X
python -m vllm.benchmark_serving \
 --backend openai \
 --model meta-llama/Llama-3.1-70B-Instruct \
 --base-url http://localhost:8000 \
 --dataset-name sharegpt \
 --num-prompts 500 \
 --request-rate 8
Code Fragment 3: Launching vLLM on an AMD MI300X using the ROCm-specific Docker image. The --device=/dev/kfd and --device=/dev/dri flags expose the AMD GPU kernel driver and render nodes to the container. The same benchmark_serving tool works identically on ROCm, enabling direct cross-hardware performance comparison.

Kernel gaps on ROCm: While most PyTorch operations work on ROCm via HIP translation, some specialized kernels have reduced performance or are missing entirely:

CUDA vs. ROCm Kernel Support for LLM Workloads
Kernel/FeatureCUDA/NVIDIAROCm/AMDPerformance Gap
FlashAttention-2Native, optimizedCK (Composable Kernel) port~15% slower
FP8 tensor coresTransformer EnginePartial support (MI300X)~30% slower
Triton kernelsNativeROCm Triton backendVaries, 0-25%
NCCL collectivesNCCLRCCL (ROCm fork)~10% slower
PagedAttention (vLLM)OptimizedSupported~10% slower
AWQ/GPTQ quantizationNativeSupported~5-15% slower

3.3 Intel Gaudi with SynapseAI

Intel Gaudi (formerly Habana Labs) accelerators target cost-effective LLM training and inference. The Gaudi 3 chip offers 1,835 TFLOPS BF16 and 128 GB HBM2e. The SynapseAI software stack provides a PyTorch-compatible interface through the Habana PyTorch Bridge.

Trade-offs: Gaudi offers competitive price-performance for training workloads (available as AWS DL1/DL2q instances), but the ecosystem is much smaller than NVIDIA's. Custom kernels are written in TPC (Tensor Processing Core) ISA, which has a steep learning curve. Model support in the optimum-habana library covers major architectures (Llama, Mistral, Falcon) but lags behind NVIDIA's ecosystem for newer model families.

Fun Fact

The cross-hardware portability landscape changes remarkably fast. In 2023, running vLLM on AMD GPUs required significant manual patching. By mid-2024, vLLM's ROCm support was production-ready with CI/CD pipelines running on MI300X. By 2025, AMD GPUs were being used in production LLM serving at several hyperscalers. The pace of improvement in alternative hardware ecosystems means that benchmarks and portability assessments have a shelf life of about 6 months.

4. Advanced Inference Scheduling

Standard LLM serving processes requests sequentially through the prefill and decode phases. Advanced scheduling techniques optimize how requests share GPU resources, reducing latency and improving throughput.

4.1 Sarathi-Serve: Chunked Prefill

The core insight of Sarathi-Serve (Agrawal et al., 2024) is that long prefill requests (processing thousands of input tokens) block the GPU for extended periods, creating "head-of-line blocking" that starves decode requests of GPU time. Chunked prefill breaks the prefill phase into smaller chunks that are interleaved with decode iterations from other requests.

Without chunked prefill, a request with 8,000 input tokens monopolizes the GPU for the entire prefill duration (potentially hundreds of milliseconds). Concurrent decode requests must wait, causing their TPOT to spike. With chunked prefill, the 8,000-token prefill is split into 8 chunks of 1,000 tokens each, and each chunk is interleaved with a decode iteration for waiting requests. This bounds the maximum interruption to the decode phase and dramatically reduces TPOT tail latency.

vLLM (v0.5+) implements chunked prefill with the --enable-chunked-prefill flag. The chunk size is configurable via --max-num-batched-tokens:

# Enable chunked prefill in vLLM for reduced decode latency
vllm serve meta-llama/Llama-3.1-70B-Instruct \
 --tensor-parallel-size 4 \
 --enable-chunked-prefill \
 --max-num-batched-tokens 2048 # Chunk size for prefill
Code Fragment 4: Enabling chunked prefill in vLLM with a 2,048-token chunk size. Long prefill requests are split into chunks of this size, interleaved with decode iterations for concurrent requests. This prevents head-of-line blocking where a single long-context prompt monopolizes the GPU and spikes TPOT for all other active generations.

4.2 Llumnix: Cross-Instance Scheduling and Request Migration

Llumnix (2024) addresses the scheduling problem across multiple vLLM instances behind a load balancer. In a typical deployment, a load balancer assigns each request to an instance at arrival time. If the assignment is suboptimal (one instance becomes overloaded while another is idle), there is no way to rebalance without dropping and retrying the request.

Llumnix enables live request migration: transferring an in-progress request (including its KV cache) from one instance to another without interrupting the generation. This requires serializing the KV cache state, transferring it over the network, and deserializing it on the target instance, all while the generation continues. The overhead is proportional to the KV cache size, which can be significant for long contexts.

The scheduling algorithm considers both current load (queue depth, GPU utilization) and predicted load (estimated remaining tokens for in-flight requests) to make migration decisions. In benchmarks, Llumnix reduces P99 TTFT by 30-50% compared to static load balancing under bursty traffic.

5. KV Cache as a Distributed Resource

The KV cache is the dominant memory consumer during LLM inference (see Chapter 9). For a 70B model with 128K context, the KV cache alone can consume over 40 GB per request. Managing the KV cache as a distributed, shareable resource rather than a per-request allocation opens significant efficiency opportunities.

5.1 CacheGen: KV Cache Compression

CacheGen (Zheng et al., 2024) compresses KV cache tensors using a learned codec that exploits the statistical properties of attention key and value vectors. The compressed representations are 3-5x smaller than the original tensors, enabling more concurrent requests per GPU and faster KV cache transfer between instances.

The compression operates on a per-layer, per-head basis, applying a lightweight quantization scheme that preserves the attention distribution with minimal quality loss. Empirically, CacheGen achieves 3.7x compression with less than 1% quality degradation on standard benchmarks.

5.2 LMCache: KV Cache Reuse Across Requests

Many LLM applications use a shared system prompt or context prefix across all requests. Without cache reuse, the KV cache for this shared prefix is recomputed for every request, wasting GPU compute. LMCache (and similar approaches like SGLang's RadixAttention) stores and reuses KV cache entries across requests that share a common prefix.

# Code Fragment 29.14.3: Conceptual KV cache reuse with prefix caching
# Most serving engines now support this natively

# In vLLM, prefix caching is enabled by default (v0.5+)
# The engine automatically detects shared prefixes and reuses KV cache

# Example: 100 requests sharing a 2000-token system prompt
# Without prefix caching: 100 * prefill(2000 tokens) = 200K prefill tokens
# With prefix caching: 1 * prefill(2000 tokens) + 100 * prefill(user_tokens)
# Savings: ~90% reduction in prefill compute for the shared prefix

# SGLang provides explicit prefix caching via RadixAttention:
import sglang as sgl

@sgl.function
def chat_with_context(s, system_prompt, user_message):
 s += sgl.system(system_prompt) # KV cache shared across calls
 s += sgl.user(user_message) # Only this part is recomputed
 s += sgl.assistant(sgl.gen("response", max_tokens=512))
Code Fragment 29.14.3: Conceptual KV cache reuse with prefix caching

5.3 InfiniGen: Predictive KV Cache Prefetch

InfiniGen (2024) addresses the memory limitation of KV caches by storing inactive cache entries in CPU memory or disk, and predictively prefetching entries back to GPU memory before they are needed. The prediction is based on the attention patterns of previous layers: by observing which KV entries receive high attention scores in early layers, InfiniGen predicts which entries will be needed in later layers and prefetches them during the compute phase.

This tiered storage approach enables serving much longer contexts than GPU memory alone would allow:

Table 29.14.3: KV cache storage tiers and their characteristics
TierCapacityBandwidthLatencyCost per GB
GPU HBM80 GB (H100)3.35 TB/s< 1 usHighest
CPU DRAM512 GB - 2 TB200 GB/s~100 nsMedium
NVMe SSD4-16 TB7 GB/s~10 usLow

With InfiniGen's predictive prefetching, the effective KV cache capacity extends to CPU DRAM size (hundreds of gigabytes) while maintaining near-HBM access patterns for the active cache entries. The decode latency overhead is typically 5-15% compared to an all-HBM cache, which is acceptable for most serving scenarios.

Key Insight

The KV cache management landscape is converging toward treating the cache as a distributed, multi-tier resource rather than a per-request GPU allocation. This shift parallels the evolution of CPU memory management from per-process physical allocation to virtual memory with demand paging. The "virtual KV cache" paradigm (PagedAttention in vLLM was the first step) is extending to cross-instance sharing (LMCache), compression (CacheGen), and tiered storage (InfiniGen). Together, these techniques can increase the effective serving capacity of a GPU cluster by 3-10x for long-context workloads, without any changes to the model itself.

Common Misconception

Readers often conflate throughput (tokens per second for the system) with latency (time per token experienced by a single user). A system can achieve high throughput by batching many requests, while each individual user experiences poor latency. When evaluating performance, always report both metrics at your target concurrency level. A benchmark showing 10,000 tokens/second is meaningless if per-request TPOT exceeds your SLA at production load.

Key Insight

✅ Key Takeaways

Research Frontier

Several research directions are pushing the boundaries of LLM performance and portability. Disaggregated inference (Splitwise, DistServe) separates the prefill and decode phases onto different hardware, using high-compute GPUs for prefill and high-memory GPUs for decode. Speculative decoding with hardware co-design explores using small, fast accelerators (FPGAs or NPUs) to run draft models while large GPUs verify and accept tokens. Compiler-level portability through OpenXLA and MLIR aims to provide a single compilation path that targets NVIDIA, AMD, TPU, and custom accelerators from a single source. KV cache disaggregation (Mooncake, MemServe) moves the KV cache to a dedicated memory pool accessible by any GPU in the cluster, fully decoupling compute from cache storage.

Exercises

Exercise 29.14.1: Inference benchmarking Coding

Set up a vLLM server with a 7B model and run the benchmark_serving tool at concurrency levels 1, 4, 8, 16, and 32. Plot the throughput-latency curve (TTFT P99 vs. throughput). Identify the concurrency level at which the system saturates and explain what limits further throughput.

Answer Sketch

Launch vLLM with Llama-3.1-8B-Instruct, run benchmark_serving with, request-rate values of 1, 4, 8, 16, 32. Plot shows throughput increasing linearly to ~8 RPS then plateauing while TTFT P99 grows exponentially. Saturation occurs when the GPU's compute capacity is fully utilized (decode phase becomes bottleneck) or when KV cache memory is exhausted. The limiting factor is usually GPU compute at short contexts and memory at long contexts.

Exercise 29.14.2: Cross-hardware cost analysis Analysis

Compare the cost-per-million-tokens for serving Llama 3.1 70B on: (a) 4x NVIDIA H100 on AWS p5, (b) 4x AMD MI300X on a cloud provider, (c) a single TPU v5p pod-8 via GCP. Use published benchmarks or estimates for throughput on each platform and current cloud pricing to calculate the cost.

Answer Sketch

Research current pricing: AWS p5.24xlarge (8x H100) ~$98/hr, estimate ~14K tokens/sec for 70B on 4 GPUs. AMD MI300X: estimate ~10.8K tokens/sec, ~$45/hr equivalent. TPU v5p pod-8: ~$40/hr, ~8K tokens/sec. Calculate tokens per dollar: H100 ~514K tokens/$, MI300X ~864K tokens/$, TPU ~720K tokens/$. MI300X wins on cost-efficiency for throughput, H100 wins on latency, TPU wins on total cost for sustained workloads. Note: actual numbers vary significantly; the exercise teaches the methodology.

Exercise 29.14.3: KV cache sizing Analysis

For a Llama 3.1 70B model (80 layers, 8 KV heads, head_dim=128, BF16), calculate the KV cache size per token, per request at 4K context, and the maximum concurrent requests on an 8x H100 system (640 GB total HBM) assuming the model weights consume 140 GB. How does CacheGen's 3.7x compression change the concurrency limit?

Answer Sketch

KV cache per token: 80 layers * 2 (K+V) * 8 heads * 128 dim * 2 bytes (BF16) = 327,680 bytes = 320 KB. Per request at 4K: 320 KB * 4096 = 1.28 GB. Available memory: 640 - 140 = 500 GB. Max concurrent: 500/1.28 = 390 requests. With CacheGen 3.7x: effective per-request = 1.28/3.7 = 0.346 GB, max concurrent = 500/0.346 = 1,445 requests. This 3.7x increase in concurrency translates directly to 3.7x throughput improvement.

Exercise 29.14.4: MLPerf interpretation Discussion

Look up the latest MLPerf Inference results for an LLM benchmark. Compare two submissions on different hardware (e.g., NVIDIA H100 vs. H200, or NVIDIA vs. AMD). Discuss: What scenarios were tested? How do the results differ on throughput vs. latency? What can and cannot be concluded from these results for a production deployment?

Answer Sketch

Access results at mlcommons.org. Compare Server and Offline scenarios. Note that submissions use optimized software stacks that may not reflect default vLLM/TGI performance. Can conclude: relative hardware throughput, approximate performance scaling. Cannot conclude: actual production latency (depends on serving software, request distribution, batch size), cost-per-token (depends on pricing and utilization), quality differences (MLPerf does not measure model quality for same-model benchmarks).

Lab: Build an LLM Evaluation Suite

Duration: ~60 minutes Intermediate

Objective

Build a complete LLM evaluation toolkit from the ground up. First, implement BLEU and ROUGE scoring by hand to understand what the metrics actually measure (the "right tool" baseline). Then, use the evaluate library for production-grade scoring. Finally, add an LLM-as-judge evaluator using a simple grading prompt.

What You'll Practice

  • Implementing BLEU n-gram precision from scratch
  • Implementing ROUGE-1 and ROUGE-L recall from scratch
  • Using the Hugging Face evaluate library for standardized metric computation
  • Building an LLM-as-judge evaluator with structured scoring prompts

Setup

Install the evaluation libraries.

pip install evaluate rouge-score nltk transformers torch
Code Fragment 29.14.L1: Install the evaluate library and its metric backends.

Steps

Step 1: Implement BLEU from scratch

BLEU measures n-gram precision: what fraction of the candidate's n-grams appear in the reference. Implementing it by hand reveals how the metric penalizes omissions and rewards lexical overlap.

from collections import Counter
import math

def compute_ngrams(tokens, n):
 """Extract n-grams from a token list."""
 return [tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]

def bleu_score(reference, candidate, max_n=4):
 """Compute BLEU score with brevity penalty (from scratch)."""
 ref_tokens = reference.lower().split()
 cand_tokens = candidate.lower().split()

 # Brevity penalty
 bp = min(1.0, math.exp(1 - len(ref_tokens) / max(len(cand_tokens), 1)))

 # N-gram precisions
 precisions = []
 for n in range(1, max_n + 1):
 ref_ngrams = Counter(compute_ngrams(ref_tokens, n))
 cand_ngrams = Counter(compute_ngrams(cand_tokens, n))

 # Clipped counts: each n-gram counts at most as many times as in ref
 clipped = sum(min(count, ref_ngrams[ng])
 for ng, count in cand_ngrams.items())
 total = max(sum(cand_ngrams.values()), 1)
 precisions.append(clipped / total)

 # Geometric mean of precisions (skip zeros to avoid log(0))
 positive = [p for p in precisions if p > 0]
 if not positive:
 return 0.0
 log_avg = sum(math.log(p) for p in positive) / len(positive)
 return bp * math.exp(log_avg)

# Test data
reference = "The cat sat on the mat and looked out the window"
candidate = "The cat was sitting on the mat"
score = bleu_score(reference, candidate)
print(f"Manual BLEU: {score:.4f}")
Code Fragment 29.14.4: Implementation of compute_ngrams, bleu_score

Step 2: Implement ROUGE from scratch

ROUGE measures recall: what fraction of the reference's content appears in the candidate. Implement ROUGE-1 (unigram recall) and ROUGE-L (longest common subsequence).

def rouge_1(reference, candidate):
 """Compute ROUGE-1 (unigram) F1 score."""
 ref_tokens = reference.lower().split()
 cand_tokens = candidate.lower().split()

 ref_counts = Counter(ref_tokens)
 cand_counts = Counter(cand_tokens)

 overlap = sum(min(ref_counts[t], cand_counts[t])
 for t in ref_counts)

 precision = overlap / max(len(cand_tokens), 1)
 recall = overlap / max(len(ref_tokens), 1)
 f1 = 2 * precision * recall / max(precision + recall, 1e-8)
 return {"precision": precision, "recall": recall, "f1": f1}

def lcs_length(x, y):
 """Compute length of longest common subsequence."""
 m, n = len(x), len(y)
 dp = [[0] * (n + 1) for _ in range(m + 1)]
 for i in range(1, m + 1):
 for j in range(1, n + 1):
 if x[i-1] == y[j-1]:
 dp[i][j] = dp[i-1][j-1] + 1
 else:
 dp[i][j] = max(dp[i-1][j], dp[i][j-1])
 return dp[m][n]

def rouge_l(reference, candidate):
 """Compute ROUGE-L F1 score using longest common subsequence."""
 ref_tokens = reference.lower().split()
 cand_tokens = candidate.lower().split()

 lcs = lcs_length(ref_tokens, cand_tokens)
 precision = lcs / max(len(cand_tokens), 1)
 recall = lcs / max(len(ref_tokens), 1)
 f1 = 2 * precision * recall / max(precision + recall, 1e-8)
 return {"precision": precision, "recall": recall, "f1": f1}

# Test
r1 = rouge_1(reference, candidate)
rl = rouge_l(reference, candidate)
print(f"Manual ROUGE-1 F1: {r1['f1']:.4f}")
print(f"Manual ROUGE-L F1: {rl['f1']:.4f}")
Manual ROUGE-1 F1: 0.6667 Manual ROUGE-L F1: 0.5882
Code Fragment 29.14.5: Implementation of rouge_1, lcs_length, rouge_l

Step 3: Verify with the evaluate library

Now compare your manual implementations against the production evaluate library to confirm correctness and see what additional features the library provides.

import evaluate

# Load metrics from the evaluate library
bleu_metric = evaluate.load("bleu")
rouge_metric = evaluate.load("rouge")

# Compute library BLEU
lib_bleu = bleu_metric.compute(
 predictions=[candidate],
 references=[[reference]],
)

# Compute library ROUGE
lib_rouge = rouge_metric.compute(
 predictions=[candidate],
 references=[reference],
)

print("Library BLEU:", f"{lib_bleu['bleu']:.4f}")
print("Library ROUGE-1:", f"{lib_rouge['rouge1']:.4f}")
print("Library ROUGE-L:", f"{lib_rouge['rougeL']:.4f}")
print()
print("Manual BLEU:", f"{score:.4f}")
print("Manual ROUGE-1 F1:", f"{r1['f1']:.4f}")
print("Manual ROUGE-L F1:", f"{rl['f1']:.4f}")
print()
print("(Small differences are expected due to tokenization and smoothing.)")
Library BLEU: 0.3056 Library ROUGE-1: 0.6667 Library ROUGE-L: 0.5714 Manual BLEU: 0.3182 Manual ROUGE-1 F1: 0.6667 Manual ROUGE-L F1: 0.5882 (Small differences are expected due to tokenization and smoothing.)
Code Fragment 29.14.6: Load metrics from the evaluate library

Step 4: Build an LLM-as-judge evaluator

Automated metrics miss semantic equivalence. An LLM-as-judge uses a language model to grade outputs on criteria like relevance, fluency, and factual accuracy. Build a simple judge using a local model.

from transformers import pipeline

judge = pipeline(
 "text-generation",
 model="HuggingFaceH4/zephyr-7b-beta",
 torch_dtype="auto",
 device_map="auto",
)

def llm_judge(question, reference_answer, candidate_answer):
 """Score a candidate answer on a 1-5 scale using an LLM judge."""
 prompt = f"""You are an expert evaluator. Score the candidate answer \
on a scale of 1 to 5 for relevance and accuracy compared to the reference.

Question: {question}
Reference answer: {reference_answer}
Candidate answer: {candidate_answer}

Respond with ONLY a JSON object: {{"score": N, "reason": "brief explanation"}}
"""
 messages = [{"role": "user", "content": prompt}]
 result = judge(messages, max_new_tokens=100, do_sample=False)
 return result[0]["generated_text"][-1]["content"]

# Example evaluation
question = "What is the cat doing?"
ref_answer = "The cat sat on the mat and looked out the window."
cand_answer = "The cat was sitting on the mat."

judgment = llm_judge(question, ref_answer, cand_answer)
print("LLM Judge verdict:")
print(f" {judgment}")
LLM Judge verdict: {"score": 4, "reason": "The candidate correctly identifies the cat sitting on the mat, but omits the detail about looking out the window. The core activity is captured with minor information loss."}
Code Fragment 29.14.7: Implementation of llm_judge
Hint

If you lack GPU memory for a 7B model, substitute TinyLlama/TinyLlama-1.1B-Chat-v1.0 or use an API-based model. The key concept is the structured judging prompt, not the specific model.

Extensions

  • Implement BERTScore using contextual embeddings to measure semantic similarity beyond lexical overlap.
  • Create a multi-judge panel (3 LLM judges with different temperatures) and compute inter-judge agreement using Cohen's kappa.
  • Build an evaluation harness that runs all metrics (BLEU, ROUGE, BERTScore, LLM-judge) on a dataset of 50 question-answer pairs and produces a comparative report.
References & Further Reading
MLPerf and Benchmarking

MLCommons (2024). MLPerf Training and Inference Benchmarks.

The industry-standard benchmarking suite for ML hardware and software performance. Includes LLM-specific benchmarks with standardized rules, scenarios, and audited results across hardware platforms.

📖 Documentation
Advanced Inference Scheduling

Agrawal, A., Panwar, A., Mohan, J., et al. (2024). Sarathi-Serve: Efficient Chunked-Prefills for SLO-compliant LLM Serving.

Introduces chunked prefill scheduling for LLM serving, which interleaves prefill and decode phases to eliminate head-of-line blocking. Demonstrates 30-50% reduction in decode tail latency for mixed workloads.

📄 Paper

Sun, B., Huang, Z., Zhao, H., et al. (2024). Llumnix: Dynamic Scheduling for Large Language Model Serving. OSDI 2024.

Proposes cross-instance request migration for LLM serving, enabling dynamic load balancing by transferring in-progress requests (including KV cache state) between serving instances.

📄 Paper
KV Cache Management

Zheng, Y., Liu, Z., Wang, Z., et al. (2024). CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving.

Introduces learned compression for KV cache tensors, achieving 3.7x compression with minimal quality degradation. Enables higher concurrency and faster cache transfer between serving instances.

📄 Paper

Lee, J., Kim, K., et al. (2024). InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management.

Proposes predictive KV cache prefetching between GPU, CPU, and disk storage tiers. Uses attention pattern prediction from early layers to prefetch entries for later layers during compute.

📄 Paper
Cross-Hardware Portability

Google (2024). MaxText: A Simple, Performant, and Scalable Jax LLM.

Google's reference implementation for LLM training on TPUs using JAX. Demonstrates best practices for distributed training, FSDP, and XLA compiler optimization on TPU v5p hardware.

💻 Library

AMD (2024). ROCm Documentation: Open Software Platform for GPU Computing.

Documentation for AMD's ROCm software stack, covering HIP (CUDA compatibility layer), MIOpen (cuDNN equivalent), RCCL (NCCL equivalent), and integration with PyTorch and vLLM.

📖 Documentation

What Comes Next

In this section we covered mlperf training and inference suites, benchmarking llm inference, and related topics. This concludes the current chapter. Return to the chapter overview to review the material or explore related chapters.