Section B.4: Evaluation Metrics

Covered in Detail

For a comprehensive discussion of evaluation methodology, train/validation/test splits, and cross-validation, see Section 0.1: ML Basics: Features, Optimization & Generalization. For LLM-specific evaluation frameworks, see Section 29.1.

This page collects the most commonly referenced metrics in a single lookup location. For discussion of when and how to apply these metrics in practice, see the main text references above.

Classification Metrics

Classification Metrics Quick Reference

Metric	Formula / Definition	Watch Out
Accuracy	Fraction of correct predictions	Misleading on imbalanced data (99% "not spam" baseline)
Precision	TP / (TP + FP)	High precision = few false positives
Recall	TP / (TP + FN)	High recall = few false negatives
F1 Score	$2 \cdot (P \cdot R) / (P + R)$	Harmonic mean; balances precision and recall

Language Generation Metrics

Language Generation Metrics Comparison

Metric	Measures	Range	Used For
Perplexity	How surprised the model is by the data	1 to ∞ (lower is better)	Language model evaluation
BLEU	N-gram overlap with reference text	0 to 100 (higher is better)	Machine translation
ROUGE	Recall-oriented n-gram overlap	0 to 1 (higher is better)	Summarization
BERTScore	Semantic similarity via embeddings	0 to 1 (higher is better)	General text generation
METEOR	Alignment with synonyms and stemming	0 to 1 (higher is better)	Machine translation, captioning

evaluate in Practice

Compute BLEU and ROUGE scores using Hugging Face's evaluate library.

# pip install evaluate rouge-score
import evaluate

bleu = evaluate.load("bleu")
result = bleu.compute(
    predictions=["The cat sat on the mat"],
    references=[["The cat is sitting on the mat"]]
)
print(f"BLEU: {result['bleu']:.3f}")

rouge = evaluate.load("rouge")
result = rouge.compute(
    predictions=["The cat sat on the mat"],
    references=["The cat is sitting on the mat"]
)
print(f"ROUGE-L: {result['rougeL']:.3f}")

Code Fragment 1: Computing BLEU and ROUGE scores with Hugging Face's evaluate library. BLEU measures n-gram precision between prediction and reference, while ROUGE measures recall-oriented overlap. Both return scores between 0 and 1.

BERTScore in Practice

Compute semantic similarity between generated and reference text using BERTScore.

# pip install evaluate bert-score
import evaluate

bertscore = evaluate.load("bertscore")
result = bertscore.compute(
    predictions=["The cat sat on the mat"],
    references=["The cat is sitting on the mat"],
    lang="en"
)
print(f"BERTScore F1: {result['f1'][0]:.3f}")

Code Fragment 2: BERTScore computes semantic similarity using contextual embeddings rather than surface-level n-gram overlap. The lang="en" parameter selects the appropriate pretrained model. The F1 score balances precision and recall of matched embedding tokens.

Key Insight: Automated Metrics Have Limits

BLEU and ROUGE measure surface-level text overlap, not meaning. A paraphrase that perfectly captures the intended meaning but uses different words will score poorly. This is why the field is increasingly moving toward LLM-as-judge evaluation (Chapter 25), where a powerful language model rates the quality of generated text. Human evaluation remains the gold standard for open-ended generation tasks, but it is expensive and slow.

Perplexity: A Deeper Look

Perplexity deserves special attention because it is the most commonly reported metric for language models. It is defined as:

PPL = \exp( -(1/N) \Sigma \log P(token_i | context_i) )

Intuitively, a perplexity of $k$ means the model is as uncertain as if it were choosing uniformly among $k$ options at each step. A model with perplexity 10 is much more confident (and presumably more accurate) than one with perplexity 100. Code Fragment B.4.1 below puts this into practice.


# PyTorch implementation
# Key operations: loss calculation, training loop, results display
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

text = "The transformer architecture revolutionized natural language processing."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs, labels=inputs["input_ids"])
    perplexity = torch.exp(outputs.loss)
    print(f"Perplexity: {perplexity.item():.2f}")

Perplexity: 52.47

Code Fragment B.4.1: Computing perplexity for GPT-2 on a sample sentence using PyTorch. The model's cross-entropy loss is exponentiated to yield perplexity (here 52.47), meaning the model is roughly as uncertain as choosing among 52 tokens at each step.

Fun Fact: The Perplexity Hall of Fame

GPT-2 (2019) achieved about 35 perplexity on WikiText-103. GPT-3 brought it below 20. Modern open models like LLaMA-3 push even lower. Each generation of models is, in a measurable sense, less surprised by human language. The trajectory suggests we are approaching fundamental limits of predictability in natural text.

Quick Reference Cheat Sheet

Supervised learning = labeled data; self-supervised = labels from structure; RL = learn from rewards.
Cross-entropy is the standard LLM loss function. AdamW is the standard optimizer.
Fight overfitting with weight decay, early stopping, and good data practices.
Use perplexity for language models, F1 for classification, ROUGE for summarization.
Automated metrics are useful but imperfect. Human evaluation and LLM-as-judge fill the gap.
Always check for data contamination before trusting benchmark results.

What Comes Next

Continue to Appendix C: Python Libraries and Patterns for LLM Development for the next reference appendix in this collection.

References and Further Reading

Textbooks

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.

The classic ML reference covering probabilistic models, optimization, and evaluation methodology.

Textbook

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning. Springer.

Comprehensive treatment of supervised learning, regularization, and model selection. Free PDF available.

Textbook

Metrics and Evaluation

Papineni, K. et al. (2002). "BLEU: a Method for Automatic Evaluation of Machine Translation." ACL 2002.

The original BLEU paper, still one of the most cited works in NLP evaluation.

Foundational Paper

Zhang, T. et al. (2020). "BERTScore: Evaluating Text Generation with BERT." ICLR 2020.

Embedding-based evaluation that better captures semantic similarity than n-gram methods.

Research Paper