"If your only evaluation metric is vibes, your only production guarantee is also vibes."
Eval, Vibe-Averse AI Agent
You cannot improve what you cannot measure, and measuring LLM quality is surprisingly hard. Unlike classification tasks where accuracy tells the whole story, LLM outputs are open-ended, subjective, and context-dependent. A single question can have many correct answers, each with different levels of helpfulness, style, and factual precision. This section builds your evaluation toolkit from low-level language modeling metrics (perplexity, bits-per-byte) through reference-based text metrics (BLEU, ROUGE, BERTScore), to modern approaches like LLM-as-Judge and structured human evaluation. It also covers the major benchmarks that the research community uses to track model capability over time. The decoding strategies from Section 05.2 (temperature, top-k, top-p) directly affect evaluation scores, so understanding their impact is essential for fair comparisons.
Prerequisites
This section assumes familiarity with LLM API patterns from Section 10.1 and prompt engineering from Section 11.1. Understanding basic ML evaluation concepts (precision, recall, F1) from Section 00.3 provides useful context for the evaluation metrics discussed here.
1. Intrinsic Language Modeling Metrics
Intrinsic metrics measure how well a model captures the statistical properties of language itself. These metrics are computed directly from the model's probability distributions over tokens, without requiring any downstream task. They are most useful for comparing base (pretrained) models and for monitoring training progress.
BLEU score was invented in 2002 for machine translation and is still the most cited evaluation metric in NLP. It essentially counts matching n-grams, which means a random word salad that happens to contain the right phrases can score surprisingly well.
Perplexity
Perplexity measures how "surprised" a model is by a sequence of text. Formally, it is the exponentiated average negative log-likelihood per token. A model that assigns high probability to the correct next token at every position will have low perplexity. Lower perplexity means the model is a better predictor of natural language.
For a sequence of N tokens, perplexity is defined as:
Perplexity depends heavily on the tokenizer. A model that uses byte-pair encoding (Chapter 02) with a 50K vocabulary will have a different perplexity from one using a 100K vocabulary, even if both models are equally capable. This makes direct perplexity comparisons across model families unreliable.
Bits-per-Byte (BPB)
Bits-per-byte normalizes the language modeling loss by the number of UTF-8 bytes rather than tokens, making it comparable across tokenizers and vocabularies. BPB measures how many bits the model needs, on average, to encode each byte of the original text. In plain terms, a lower BPB means the model compresses text more efficiently, which reflects better language understanding. This is the preferred metric when comparing models with different tokenization schemes. Code Fragment 29.1.8 below puts this into practice.
# implement compute_perplexity_and_bpb
# Key operations: loss calculation, training loop, results display
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def compute_perplexity_and_bpb(model_name: str, text: str):
"""Compute perplexity and bits-per-byte for a text sample."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.float16, device_map="auto"
)
encodings = tokenizer(text, return_tensors="pt")
input_ids = encodings.input_ids.to(model.device)
with torch.no_grad():
outputs = model(input_ids, labels=input_ids)
neg_log_likelihood = outputs.loss # average NLL per token
num_tokens = input_ids.size(1)
perplexity = torch.exp(neg_log_likelihood).item()
# Bits-per-byte: convert nats to bits, normalize by bytes
total_nll_bits = neg_log_likelihood.item() * num_tokens / torch.log(torch.tensor(2.0))
num_bytes = len(text.encode("utf-8"))
bpb = total_nll_bits / num_bytes
return {
"perplexity": round(perplexity, 2),
"bits_per_byte": round(bpb.item(), 4),
"num_tokens": num_tokens,
"num_bytes": num_bytes
}
# Example usage
text = "The transformer architecture revolutionized natural language processing."
result = compute_perplexity_and_bpb("gpt2", text)
print(result)
Perplexity and BPB only measure how well a model predicts text, not how useful or safe its generations are. A model with excellent perplexity may still produce hallucinations, refuse safe requests, or generate harmful content. These metrics are necessary but far from sufficient for evaluating instruction-tuned or chat models. Figure 29.1.1 presents this taxonomy.
2. Reference-Based Text Metrics
Reference-based metrics compare model output against one or more "gold standard" reference texts. They originated in machine translation and summarization, where human-written references provide a ground truth. These metrics are fast, deterministic, and cheap to compute, but they share a fundamental limitation: they assume the reference text captures the space of acceptable answers, which is rarely true for open-ended generation. Code Fragment 29.1.7 below puts this into practice.
BLEU, ROUGE, and BERTScore
| Metric | What It Measures | Best For | Limitation |
|---|---|---|---|
| BLEU | N-gram precision (how many n-grams in the output appear in the reference) | Machine translation | Ignores recall; penalizes valid paraphrases |
| ROUGE-L | Longest common subsequence between output and reference | Summarization | Surface-level overlap; misses semantic equivalence |
| ROUGE-N | N-gram recall (how many reference n-grams appear in the output) | Summarization | Same surface-level limitations as ROUGE-L |
| BERTScore | Cosine similarity of contextualized token embeddings | Any text comparison | Computationally expensive; correlation varies by task |
| METEOR | Unigram matching with stemming, synonyms, and paraphrase support | Machine translation | Complex; still surface-level for LLM outputs |
Among these metrics, BLEU is the most established. Understanding its formula clarifies both its strengths (penalizing short or repetitive outputs) and its limitations (ignoring meaning entirely):
BLEU Score.
BLEU measures modified n-gram precision with a brevity penalty. For candidate c against reference r:
$$BLEU = BP \cdot \exp( \sum _{n=1..N} w_{n} \cdot \log(p_{n}) )$$where pn is the modified n-gram precision (clipped to reference counts), wn = 1/N are uniform weights (typically N=4), and the brevity penalty is:
$$BP = \min(1, \exp(1 - |r| / |c|))$$The brevity penalty penalizes candidates shorter than the reference, preventing trivially high precision from very short outputs.
A complementary recall-oriented metric captures how much of the reference content appears in the candidate:
ROUGE-L (Longest Common Subsequence).
ROUGE-L uses the LCS between candidate c and reference r:
$$R_{lcs} = LCS(r, c) / |r| ,\; P_{lcs} = LCS(r, c) / |c| \\ F_{lcs} = (1 + \beta ^{2}) \cdot R_{lcs} \cdot P_{lcs} / ( \beta ^{2} \cdot P_{lcs} + R_{lcs})$$where β controls the balance between precision and recall (typically β = 1.2, favoring recall).
Moving beyond surface-level matching, semantic similarity metrics use neural embeddings to capture meaning:
BERTScore.
BERTScore computes soft token matching using contextual embeddings. Given candidate tokens {ci} and reference tokens {rj} with their BERT embeddings:
$$P_{BERT} = (1/|c|) \sum _{i} max_{j} \cos(c_{i}, r_{j}) \\ R_{BERT} = (1/|r|) \sum _{j} max_{i} \cos(c_{i}, r_{j}) \\ F1_{BERT} = 2 \cdot P_{BERT} \cdot R_{BERT} / (P_{BERT} + R_{BERT})$$Each candidate token is matched to its most similar reference token (and vice versa) using cosine similarity of contextualized embeddings, capturing semantic equivalence that surface-level n-gram metrics miss.
# Implementation example
# Key operations: results display, evaluation logic
from rouge_score import rouge_scorer
from bert_score import score as bert_score
import evaluate
# Reference and candidate texts
reference = "The cat sat on the mat and watched the birds outside."
candidate = "A cat was sitting on a mat, observing birds through the window."
# ROUGE scores
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
rouge_results = scorer.score(reference, candidate)
for key, value in rouge_results.items():
print(f"{key}: precision={value.precision:.3f}, recall={value.recall:.3f}, f1={value.fmeasure:.3f}")
# BLEU score
bleu = evaluate.load("bleu")
bleu_result = bleu.compute(
predictions=[candidate],
references=[[reference]]
)
print(f"BLEU: {bleu_result['bleu']:.4f}")
# BERTScore
P, R, F1 = bert_score([candidate], [reference], lang="en", verbose=False)
print(f"BERTScore: precision={P[0]:.4f}, recall={R[0]:.4f}, f1={F1[0]:.4f}")
Notice how BERTScore (0.94 F1) captures the semantic similarity between these paraphrases far better than BLEU (0.12) or ROUGE-2 (0.11). When outputs are valid paraphrases of references, embedding-based metrics like BERTScore provide much more meaningful signal. However, no single metric tells the full story. Use multiple metrics together and always validate against human judgment.
Why evaluation is harder for LLMs than classical ML. In classical ML, you measure a model's performance against a known ground truth: the cat is either correctly classified or it is not. LLM evaluation has no equivalent. A question like "Explain quantum computing" has thousands of valid answers that differ in depth, style, framing, and emphasis. Two responses can both be excellent yet share almost no words in common. This is why n-gram metrics like BLEU fail for open-ended generation: they reward surface overlap, not semantic quality. The field's migration from BLEU to BERTScore to LLM-as-Judge reflects a steady attempt to close this gap between what metrics measure and what users actually care about.
3. LLM-as-Judge
The LLM-as-Judge paradigm uses a strong language model (typically GPT-4, Claude, or a specialized judge model) to evaluate the outputs of other models. This approach has rapidly become the most popular evaluation method for instruction-tuned models because it can assess open-ended qualities like helpfulness, safety, and reasoning quality that reference-based metrics cannot capture.
When setting up LLM-as-Judge evaluation, always calibrate against human judgments first. Take 50 examples, have two humans rate them independently, then compare the judge model's ratings to both. If the judge agrees with humans less often than the humans agree with each other, your judge prompt needs revision before you scale up. This calibration step takes one afternoon and prevents weeks of misleading evaluation results.
Judge Prompt Design
The quality of LLM-as-Judge evaluation depends entirely on the judge prompt. A well-designed prompt specifies clear evaluation criteria, provides a rubric with concrete examples at each score level, and asks the judge to produce a structured output (typically a score plus reasoning). The reasoning-before-score pattern (where the judge explains its rationale before assigning a numeric score) has been shown to produce more reliable evaluations. Code Fragment 29.1.9 below puts this into practice.
# implement llm_judge_evaluate
# Key operations: results display, prompt construction, evaluation logic
from openai import OpenAI
import json
client = OpenAI()
def llm_judge_evaluate(question: str, answer: str, criteria: str) -> dict:
"""Use an LLM as a judge to evaluate answer quality."""
judge_prompt = f"""You are an expert evaluator. Assess the following answer to a question.
QUESTION: {question}
ANSWER: {answer}
EVALUATION CRITERIA: {criteria}
RUBRIC:
5 = Excellent: Fully addresses the question with accurate, comprehensive, well-organized content
4 = Good: Addresses the question well with minor gaps or imprecisions
3 = Adequate: Partially addresses the question but has notable gaps
2 = Poor: Mostly fails to address the question or contains significant errors
1 = Very Poor: Completely off-topic, factually wrong, or incoherent
Provide your evaluation as JSON with these fields:
- "reasoning": Your step-by-step analysis (2-3 sentences)
- "score": Integer from 1 to 5
- "strengths": List of strengths
- "weaknesses": List of weaknesses"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": judge_prompt}],
response_format={"type": "json_object"},
temperature=0.0
)
return json.loads(response.choices[0].message.content)
# Example evaluation
result = llm_judge_evaluate(
question="What causes seasons on Earth?",
answer="Seasons happen because Earth's axis is tilted at 23.5 degrees relative to its orbital plane. This means different hemispheres receive more direct sunlight at different times of year.",
criteria="Factual accuracy, completeness, and clarity"
)
print(json.dumps(result, indent=2))
The same result in 6 lines with DeepEval:
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(
input="What causes seasons on Earth?",
actual_output="Seasons happen because Earth's axis is tilted...",
retrieval_context=["Earth's axis is tilted 23.5 degrees..."]
)
relevancy = AnswerRelevancyMetric(threshold=0.7)
relevancy.measure(test_case)
print(f"Score: {relevancy.score}, Reason: {relevancy.reason}")
Common LLM-as-Judge Biases
LLM judges are powerful but carry systematic biases that evaluators must account for. Understanding these biases is essential for interpreting judge results correctly. Figure 29.1.3 shows the LLM-as-Judge evaluation pipeline.
- Position bias: Judges tend to prefer the first option in pairwise comparisons. Mitigation: randomize order and average scores across both orderings.
- Verbosity bias: Judges often favor longer, more detailed answers even when the extra content adds no value. Mitigation: include "conciseness is a virtue" in the rubric.
- Self-preference bias: Models tend to rate their own outputs higher than outputs from other models. Mitigation: use a judge model different from the models being evaluated.
- Authority bias: Confident, well-formatted answers receive higher scores even when factually wrong. Mitigation: explicitly instruct the judge to verify factual claims.
Intuitively, LLM-as-Judge works because strong models have internalized enough of human preferences through alignment training that they can approximate the judgment of a human annotator. The main risk is that the judge model shares the same biases as the model being evaluated (both prefer verbose, formal answers) creating a feedback loop that rewards style over substance. This is why calibrating LLM judges against human judgments, as described in the arena methodology of Section 29.8, remains essential for high-stakes evaluation.
4. Human Evaluation
Human evaluation remains the gold standard for assessing LLM quality, especially for subjective dimensions like helpfulness, creativity, and conversational naturalness. However, human evaluation is expensive, slow, and introduces its own variability. The key challenge is designing evaluation protocols that are reliable (different annotators agree) and valid (they measure what you care about).
Evaluation Protocol Design
Effective human evaluation protocols share several characteristics. First, they define clear, specific criteria with concrete examples at each rating level. Second, they include calibration rounds where annotators evaluate the same examples and discuss disagreements before the main annotation begins. Third, they compute inter-annotator agreement metrics (like Cohen's kappa or Krippendorff's alpha) to verify that the task is well-defined enough for consistent human judgment. Code Fragment 29.1.5 below puts this into practice.
# implement compute_inter_annotator_agreement
# Key operations: results display
import numpy as np
from sklearn.metrics import cohen_kappa_score
def compute_inter_annotator_agreement(annotations: dict) -> dict:
"""Compute pairwise Cohen's kappa between annotators.
Args:
annotations: dict mapping annotator_id to list of scores
"""
annotator_ids = list(annotations.keys())
kappas = {}
for i in range(len(annotator_ids)):
for j in range(i + 1, len(annotator_ids)):
a1, a2 = annotator_ids[i], annotator_ids[j]
kappa = cohen_kappa_score(
annotations[a1], annotations[a2], weights="quadratic"
)
kappas[f"{a1}_vs_{a2}"] = round(kappa, 3)
avg_kappa = np.mean(list(kappas.values()))
return {"pairwise_kappas": kappas, "mean_kappa": round(avg_kappa, 3)}
# Three annotators rating 10 examples on a 1-5 scale
annotations = {
"annotator_A": [4, 3, 5, 2, 4, 3, 5, 4, 3, 4],
"annotator_B": [4, 3, 4, 2, 5, 3, 5, 3, 3, 4],
"annotator_C": [5, 3, 5, 3, 4, 2, 4, 4, 3, 5],
}
agreement = compute_inter_annotator_agreement(annotations)
print(f"Pairwise kappas: {agreement['pairwise_kappas']}")
print(f"Mean kappa: {agreement['mean_kappa']}")
# Interpretation: >0.8 = excellent, 0.6-0.8 = good, 0.4-0.6 = moderate
A mean kappa of 0.67 indicates "good" agreement, but the wide range (0.57 to 0.81) across annotator pairs suggests that annotator C may be interpreting the rubric differently. Before proceeding with annotation, run additional calibration rounds and review disagreement examples with all annotators to align their understanding of the scoring criteria.
5. Standard Benchmarks
Benchmarks provide standardized test sets that enable comparison across models. The LLM community has developed dozens of benchmarks, each targeting different capabilities. Understanding what each benchmark measures (and what it does not measure) is essential for interpreting model comparison tables. Figure 29.1.4 maps the benchmark landscape by capability domain. Code Fragment 29.1.7 below puts this into practice.
Benchmark Comparison
| Benchmark | Tasks | Metric | Format | Notes |
|---|---|---|---|---|
| MMLU | 57 subjects (STEM, humanities, social science) | Accuracy | Multiple choice | Most widely cited; saturation concern |
| HumanEval | 164 Python coding problems | pass@k | Code generation | Tests functional correctness via unit tests |
| MT-Bench | 80 multi-turn questions across 8 categories | LLM-as-Judge (1-10) | Open-ended | Tests multi-turn instruction following |
| Chatbot Arena | Crowdsourced pairwise comparisons | Elo rating (skill ranking derived from pairwise win/loss outcomes) | Side-by-side | Most ecologically valid; slow to update |
| GSM8K | 1,319 grade-school math word problems | Exact match | Free-form numeric | Nearly saturated by frontier models |
| SWE-Bench | Real GitHub issues requiring code fixes | % resolved | Repository-level code | Tests real-world software engineering |
Chatbot Arena ranks models using pairwise human comparisons rather than fixed benchmarks. The underlying Elo rating system, borrowed from chess, updates each model's score after every head-to-head comparison:
Elo Rating Update (Chatbot Arena).
After a comparison where model A (rating RA) plays against model B (rating RB), the expected win probability is:
$$E_{A} = 1 / (1 + 10^{(R_{B} - R_{A}) / 400})$$After observing actual outcome SA (1 for win, 0.5 for tie, 0 for loss), the rating updates as:
$$R_{A}^{new} = R_{A} + K \cdot (S_{A} - E_{A})$$where K is a sensitivity constant (typically 4 to 32). Chatbot Arena uses this to rank models from thousands of crowdsourced pairwise comparisons.
# implement run_mmlu_sample
# Key operations: prompt construction, evaluation logic
import json
from datasets import load_dataset
def run_mmlu_sample(model_fn, num_samples: int = 20):
"""Evaluate a model on a sample of MMLU questions.
Args:
model_fn: callable that takes a prompt and returns A/B/C/D
num_samples: number of questions to evaluate
"""
dataset = load_dataset("cais/mmlu", "all", split="test")
dataset = dataset.shuffle(seed=42).select(range(num_samples))
correct = 0
results_by_subject = {}
for example in dataset:
choices = example["choices"]
prompt = f"""Question: {example['question']}
A) {choices[0]}
B) {choices[1]}
C) {choices[2]}
D) {choices[3]}
Answer with just the letter (A, B, C, or D):"""
prediction = model_fn(prompt).strip().upper()
answer_map = {0: "A", 1: "B", 2: "C", 3: "D"}
correct_letter = answer_map[example["answer"]]
is_correct = prediction[0] == correct_letter if prediction else False
subject = example["subject"]
if subject not in results_by_subject:
results_by_subject[subject] = {"correct": 0, "total": 0}
results_by_subject[subject]["total"] += 1
if is_correct:
results_by_subject[subject]["correct"] += 1
correct += 1
accuracy = correct / num_samples
return {
"overall_accuracy": round(accuracy, 3),
"num_correct": correct,
"num_total": num_samples,
"by_subject": results_by_subject
}
A growing concern with standard benchmarks is data contamination: the possibility that benchmark questions appeared in the model's training data. When a model has "seen" the test questions during training, its benchmark score overestimates true capability. Always check whether a model's technical report discusses contamination analysis, and consider using dynamic benchmarks (like LiveCodeBench or Chatbot Arena) that continuously add new questions.
The lm-eval-harness (pip install lm-eval) from EleutherAI is the standard framework for running reproducible LLM benchmarks. It supports 200+ tasks with consistent prompting, few-shot formatting, and scoring, and it is the engine behind the Open LLM Leaderboard.
# pip install lm-eval
# Command-line (most common):
# lm_eval --model hf \
# --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
# --tasks mmlu,hellaswag,arc_challenge \
# --num_fewshot 5 \
# --batch_size auto \
# --output_path ./eval_results/
# Programmatic usage:
import lm_eval
results = lm_eval.simple_evaluate(
model="hf",
model_args="pretrained=meta-llama/Llama-3.1-8B-Instruct",
tasks=["mmlu", "hellaswag"],
num_fewshot=5,
batch_size="auto",
)
for task, metrics in results["results"].items():
acc = metrics.get("acc,none", metrics.get("acc_norm,none", "N/A"))
print(f"{task}: {acc:.3f}")
6. Building an Evaluation Harness
In practice, you rarely evaluate with a single metric. A well-designed evaluation harness combines multiple metrics, runs them across a curated test set, and produces a structured report that tracks performance over time. The following example shows how to build a simple but extensible evaluation framework.
# Define EvalCase, EvalResult, EvalHarness; implement __init__, evaluate, summary
# Key operations: prompt construction, evaluation logic
from dataclasses import dataclass, field
from typing import Callable
import json, time
@dataclass
class EvalCase:
"""A single evaluation test case."""
question: str
reference: str = ""
metadata: dict = field(default_factory=dict)
@dataclass
class EvalResult:
"""Result of evaluating one test case."""
case: EvalCase
model_output: str
scores: dict
latency_ms: float
class EvalHarness:
"""Lightweight evaluation harness for LLM applications."""
def __init__(self, model_fn: Callable, scorers: dict[str, Callable]):
self.model_fn = model_fn # callable: prompt -> response
self.scorers = scorers # dict of name -> scoring function
def evaluate(self, cases: list[EvalCase]) -> list[EvalResult]:
"""Run all test cases through the model and score them."""
results = []
for case in cases:
start = time.time()
output = self.model_fn(case.question)
latency = (time.time() - start) * 1000
scores = {}
for name, scorer in self.scorers.items():
scores[name] = scorer(
question=case.question,
output=output,
reference=case.reference
)
results.append(EvalResult(
case=case, model_output=output,
scores=scores, latency_ms=round(latency, 1)
))
return results
def summary(self, results: list[EvalResult]) -> dict:
"""Aggregate results into a summary report."""
import numpy as np
metric_names = list(results[0].scores.keys())
summary = {}
for metric in metric_names:
values = [r.scores[metric] for r in results]
summary[metric] = {
"mean": round(np.mean(values), 3),
"std": round(np.std(values), 3),
"min": round(min(values), 3),
"max": round(max(values), 3),
}
summary["mean_latency_ms"] = round(
np.mean([r.latency_ms for r in results]), 1
)
return summary
The evaluation harness above builds scoring and test management from scratch for pedagogical clarity. In production, use DeepEval (install: pip install deepeval), which provides 14+ built-in metrics, pytest integration, and CI/CD support:
# Production evaluation using DeepEval
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
test_case = LLMTestCase(
input="What is Python?",
actual_output=model_output,
retrieval_context=["Python is a programming language..."]
)
evaluate([test_case], [AnswerRelevancyMetric(), FaithfulnessMetric()])
For CLI-driven evaluation with side-by-side comparisons, see promptfoo (install: npx promptfoo@latest init), which supports YAML-based test definitions and model comparison tables.
The best evaluation harnesses separate three concerns: (1) test case management (what to evaluate), (2) model invocation (how to get outputs), and (3) scoring (how to measure quality). This separation makes it easy to swap models, add new metrics, or reuse test cases across experiments. Start simple, then extend as your evaluation needs grow.
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
Who: ML engineering team at a SaaS company deploying an LLM-powered customer support assistant
Situation: The team had fine-tuned a model for customer support but lacked confidence in its readiness for production. Stakeholders asked for metrics, but the team was unsure which evaluation approach to trust.
Problem: BLEU scores against reference answers were misleading because valid responses could be phrased in many ways. A single benchmark score did not capture the multiple dimensions that mattered: helpfulness, factual accuracy, tone appropriateness, and safety.
Dilemma: Human evaluation was accurate but too slow and expensive for continuous testing. Automated metrics were fast but each had known biases (BERTScore favored verbose responses; LLM-as-Judge showed self-preference).
Decision: The team built a multi-layer evaluation harness combining automated metrics for fast screening, LLM-as-Judge with bias mitigation for quality assessment, and periodic human evaluation for calibration.
How: They created 500 test cases spanning 10 categories. Automated checks ran on every commit (response format, length, keyword presence). Weekly LLM-as-Judge evaluation used randomized answer ordering and multi-judge consensus to mitigate position and verbosity bias. Monthly human evaluation on 50 samples calibrated the automated scores.
Result: The harness caught a regression where a model update improved helpfulness scores but degraded safety responses, something no single metric would have detected. Time to evaluate went from 2 days (pure human) to 15 minutes (automated) with monthly calibration taking 4 hours.
Lesson: No single evaluation metric tells the full story; a layered approach combining fast automated checks, bias-mitigated LLM judges, and periodic human calibration provides both speed and reliability.
Automated metrics (BLEU, ROUGE, exact match) catch regressions quickly but miss nuance. Pair them with periodic human evaluation on a rotating sample. Even reviewing 20 outputs per week gives you a ground-truth signal that metrics alone cannot provide.
- Perplexity measures prediction, not usefulness. Low perplexity is necessary but not sufficient for a good LLM. Instruction-tuned models may have higher perplexity than base models while being far more useful in practice.
- Reference-based metrics fail on open-ended tasks. BLEU and ROUGE penalize valid paraphrases. BERTScore captures semantics better but still cannot assess helpfulness, safety, or reasoning quality.
- LLM-as-Judge is powerful but biased. Account for position bias, verbosity bias, and self-preference bias through careful prompt design and evaluation protocol (order randomization, multi-judge consensus).
- Human evaluation requires rigorous protocol design. Define clear rubrics with concrete examples, run calibration rounds, and measure inter-annotator agreement before trusting human ratings.
- No single benchmark tells the full story. Evaluate across multiple dimensions (knowledge, reasoning, code, chat quality) and prefer dynamic benchmarks that resist contamination.
- Build evaluation harnesses early. Separating test cases, model invocation, and scoring into modular components makes your evaluation infrastructure reusable and extensible as your project evolves.
Open Questions in LLM Evaluation (2024-2026):
- Multi-dimensional evaluation: How can we build evaluation frameworks that simultaneously assess helpfulness, harmlessness, honesty, and instruction-following without collapsing these into a single score? Recent work on reward model ensembles (2024-2025) shows promise but faces scaling challenges.
- Evaluation of reasoning chains: As reasoning models (o1, DeepSeek-R1) become standard, evaluating the quality of intermediate reasoning steps, not just final answers, remains an open problem. Process reward models offer one approach, but calibrating them is difficult.
- Cross-lingual evaluation gaps: Most benchmarks are English-centric. Multilingual evaluation suites like MEGA (2024) and Global MMLU are expanding coverage, but evaluation quality for low-resource languages remains poor.
Explore Further: Try implementing a multi-judge evaluation pipeline where three different LLMs score the same outputs and you measure inter-judge agreement, comparing it against human annotator agreement on the same examples.
Exercises
Explain why perplexity scores cannot be directly compared between two models that use different tokenizers. What alternative metric avoids this problem?
Answer Sketch
Perplexity measures surprise per token, but different tokenizers split text into different numbers of tokens. A model with a larger vocabulary produces fewer tokens per sentence, so its per-token perplexity is computed over a different denominator. Bits-per-byte normalizes by the number of bytes in the original text, making it tokenizer-independent and enabling fair cross-model comparisons.
A model generates the sentence "The feline sat upon the mat" for the reference "The cat sat on the mat." Explain how BLEU and BERTScore would handle this case differently and which would give a higher score.
Answer Sketch
BLEU relies on exact n-gram overlap. "feline" and "upon" do not match "cat" and "on," so the unigram precision drops and higher-order n-gram matches are also lost. BERTScore computes cosine similarity between contextual embeddings, so "feline" and "cat" (as well as "upon" and "on") would have high embedding similarity. BERTScore would give a substantially higher score because it captures semantic equivalence rather than requiring lexical identity.
You use GPT-4 as a judge to evaluate outputs from GPT-4 and Claude. Identify at least three sources of bias in this setup and propose a mitigation for each.
Answer Sketch
(1) Self-enhancement bias: GPT-4 may prefer its own style. Mitigation: use a different judge model or multiple judges. (2) Position bias: the first response shown tends to be preferred. Mitigation: randomize presentation order or swap positions and average. (3) Verbosity bias: longer responses are often rated higher regardless of quality. Mitigation: include "prefer conciseness" in the rubric or normalize for length.
You are evaluating a customer support chatbot. Explain why MMLU would be a poor benchmark choice and propose three task-specific evaluation dimensions with example metrics for each.
Answer Sketch
MMLU tests broad factual knowledge through multiple-choice questions, which does not reflect customer support tasks (handling complaints, following policies, showing empathy). Better dimensions: (1) Accuracy: does the response correctly apply company policy? Metric: human-labeled correctness rate. (2) Helpfulness: does it resolve the user's issue? Metric: task completion rate. (3) Tone: is it empathetic and professional? Metric: LLM-as-judge rubric score on tone dimensions.
Write a Python function that takes a list of (generated_text, reference_text) pairs and computes ROUGE-L, BLEU, and a simple LLM-as-judge score (using an API call with a scoring rubric). Return a dictionary of aggregated metrics.
Answer Sketch
Use the rouge-score library for ROUGE-L and nltk.translate.bleu_score for BLEU. For LLM-as-judge, send each pair to an LLM with a rubric prompt asking for a 1-5 score, parse the integer from the response, and average across all pairs. Handle API errors gracefully and include the 95% confidence interval for each metric using bootstrap resampling.
What Comes Next
In the next section, Section 29.2: Experimental Design & Statistical Rigor, we cover experimental design and statistical rigor, ensuring that your evaluation results are reliable and actionable.
Bibliography
Zheng, L., Chiang, W.L., Sheng, Y., et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." arXiv:2306.05685
Hendrycks, D., Burns, C., Basart, S., et al. (2021). "Measuring Massive Multitask Language Understanding (MMLU)." arXiv:2009.03300
Cobbe, K., Kosaraju, V., Bavarian, M., et al. (2021). "Training Verifiers to Solve Math Word Problems (GSM8K)." arXiv:2110.14168
Srivastava, A., Rastogi, A., Rao, A., et al. (2023). "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models (BIG-Bench)." arXiv:2206.04615
Zhang, T., Kishore, V., Wu, F., et al. (2020). "BERTScore: Evaluating Text Generation with BERT." arXiv:1904.09675
