For a comprehensive discussion of evaluation methodology, train/validation/test splits, and cross-validation, see Section 0.1: ML Basics: Features, Optimization & Generalization. For LLM-specific evaluation frameworks, see Section 29.1.
This page collects the most commonly referenced metrics in a single lookup location. For discussion of when and how to apply these metrics in practice, see the main text references above.
Classification Metrics
| Metric | Formula / Definition | Watch Out |
|---|---|---|
| Accuracy | Fraction of correct predictions | Misleading on imbalanced data (99% "not spam" baseline) |
| Precision | TP / (TP + FP) | High precision = few false positives |
| Recall | TP / (TP + FN) | High recall = few false negatives |
| F1 Score | $2 \cdot (P \cdot R) / (P + R)$ | Harmonic mean; balances precision and recall |
Language Generation Metrics
| Metric | Measures | Range | Used For |
|---|---|---|---|
| Perplexity | How surprised the model is by the data | 1 to ∞ (lower is better) | Language model evaluation |
| BLEU | N-gram overlap with reference text | 0 to 100 (higher is better) | Machine translation |
| ROUGE | Recall-oriented n-gram overlap | 0 to 1 (higher is better) | Summarization |
| BERTScore | Semantic similarity via embeddings | 0 to 1 (higher is better) | General text generation |
| METEOR | Alignment with synonyms and stemming | 0 to 1 (higher is better) | Machine translation, captioning |
Compute BLEU and ROUGE scores using Hugging Face's evaluate library.
# pip install evaluate rouge-score
import evaluate
bleu = evaluate.load("bleu")
result = bleu.compute(
predictions=["The cat sat on the mat"],
references=[["The cat is sitting on the mat"]]
)
print(f"BLEU: {result['bleu']:.3f}")
rouge = evaluate.load("rouge")
result = rouge.compute(
predictions=["The cat sat on the mat"],
references=["The cat is sitting on the mat"]
)
print(f"ROUGE-L: {result['rougeL']:.3f}")
evaluate library. BLEU measures n-gram precision between prediction and reference, while ROUGE measures recall-oriented overlap. Both return scores between 0 and 1.Compute semantic similarity between generated and reference text using BERTScore.
# pip install evaluate bert-score
import evaluate
bertscore = evaluate.load("bertscore")
result = bertscore.compute(
predictions=["The cat sat on the mat"],
references=["The cat is sitting on the mat"],
lang="en"
)
print(f"BERTScore F1: {result['f1'][0]:.3f}")
lang="en" parameter selects the appropriate pretrained model. The F1 score balances precision and recall of matched embedding tokens.BLEU and ROUGE measure surface-level text overlap, not meaning. A paraphrase that perfectly captures the intended meaning but uses different words will score poorly. This is why the field is increasingly moving toward LLM-as-judge evaluation (Chapter 25), where a powerful language model rates the quality of generated text. Human evaluation remains the gold standard for open-ended generation tasks, but it is expensive and slow.
Perplexity: A Deeper Look
Perplexity deserves special attention because it is the most commonly reported metric for language models. It is defined as:
Intuitively, a perplexity of $k$ means the model is as uncertain as if it were choosing uniformly among $k$ options at each step. A model with perplexity 10 is much more confident (and presumably more accurate) than one with perplexity 100. Code Fragment B.4.1 below puts this into practice.
# PyTorch implementation
# Key operations: loss calculation, training loop, results display
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
text = "The transformer architecture revolutionized natural language processing."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, labels=inputs["input_ids"])
perplexity = torch.exp(outputs.loss)
print(f"Perplexity: {perplexity.item():.2f}")
GPT-2 (2019) achieved about 35 perplexity on WikiText-103. GPT-3 brought it below 20. Modern open models like LLaMA-3 push even lower. Each generation of models is, in a measurable sense, less surprised by human language. The trajectory suggests we are approaching fundamental limits of predictability in natural text.
Quick Reference Cheat Sheet
- Supervised learning = labeled data; self-supervised = labels from structure; RL = learn from rewards.
- Cross-entropy is the standard LLM loss function. AdamW is the standard optimizer.
- Fight overfitting with weight decay, early stopping, and good data practices.
- Use perplexity for language models, F1 for classification, ROUGE for summarization.
- Automated metrics are useful but imperfect. Human evaluation and LLM-as-judge fill the gap.
- Always check for data contamination before trusting benchmark results.
What Comes Next
Continue to Appendix C: Python Libraries and Patterns for LLM Development for the next reference appendix in this collection.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
The classic ML reference covering probabilistic models, optimization, and evaluation methodology.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning. Springer.
Comprehensive treatment of supervised learning, regularization, and model selection. Free PDF available.
Papineni, K. et al. (2002). "BLEU: a Method for Automatic Evaluation of Machine Translation." ACL 2002.
The original BLEU paper, still one of the most cited works in NLP evaluation.
Zhang, T. et al. (2020). "BERTScore: Evaluating Text Generation with BERT." ICLR 2020.
Embedding-based evaluation that better captures semantic similarity than n-gram methods.