Section J.3: Evaluation Benchmarks

Benchmarks provide standardized measurements of model capabilities. No single benchmark tells the full story; each measures a specific aspect of intelligence. The following covers the most widely referenced benchmarks in LLM evaluation.

Knowledge and Reasoning Benchmarks

MMLU (Massive Multitask Language Understanding)

What It Measures	Broad knowledge across 57 subjects: STEM, humanities, social sciences, professional domains (law, medicine, accounting). Tests factual recall and reasoning.
Format	Multiple-choice questions (4 options), 14,042 questions total
Evaluation	Accuracy (% correct). Reported as overall average and per-subject scores. Few-shot (5-shot) is the standard evaluation protocol.
Access	Hugging Face (cais/mmlu); also available via lm-evaluation-harness
Limitations	Significant data contamination risk (many questions appear on the open web). Some questions have disputed correct answers. The multiple-choice format limits what it can measure. MMLU-Pro (harder, 10-choice variant) was introduced to address ceiling effects.

ARC (AI2 Reasoning Challenge)

What It Measures	Science reasoning at the grade-school level. The Challenge set (ARC-C) contains questions that were answered incorrectly by simple retrieval and co-occurrence methods, requiring genuine reasoning.
Format	Multiple-choice science questions; 7,787 questions (Easy + Challenge sets)
Evaluation	Accuracy on the Challenge set (ARC-C) is the standard reported metric
Access	Hugging Face (allenai/ai2_arc)
Limitations	Relatively small dataset. Grade-school level means frontier models now achieve near-perfect scores, reducing its discriminative power.

HellaSwag

What It Measures	Commonsense reasoning through sentence completion. Given the beginning of a scenario, the model must select the most plausible continuation from four options.
Format	Multiple-choice (4 options), ~10,000 questions
Evaluation	Accuracy. Designed to be easy for humans (~95%) but hard for models at the time of creation (2019).
Access	Hugging Face (Rowan/hellaswag)
Limitations	Frontier models now exceed 95% accuracy, approaching the ceiling. Adversarial filtering was applied during creation, which means the wrong answers are specifically designed to fool older models, not necessarily current ones.

TruthfulQA

What It Measures	Whether models generate truthful answers rather than reproducing common misconceptions. Questions are designed so that popular but incorrect answers are likely to be learned from web text.
Format	817 questions spanning 38 categories (health, law, finance, conspiracies, etc.)
Evaluation	Percentage of responses judged truthful and informative (by a fine-tuned GPT-judge or human eval)
Access	Hugging Face (truthful_qa)
Limitations	Small dataset size. Automated truthfulness evaluation via GPT-judge introduces its own biases. Some questions are culturally specific.

Mathematics Benchmarks

GSM8K (Grade School Math 8K)

What It Measures	Multi-step mathematical reasoning at the grade-school level. Problems require 2 to 8 sequential arithmetic operations with natural language reasoning.
Format	1,319 test problems; each has a step-by-step solution and a final numerical answer
Evaluation	Accuracy (exact match on the final numerical answer). Chain-of-thought prompting is standard.
Access	Hugging Face (gsm8k)
Limitations	Frontier models now score above 95%, reaching ceiling. High data contamination risk due to widespread internet availability. GSM-Plus and GSM-Hard extend the difficulty.

MATH

What It Measures	Competition-level mathematics spanning algebra, geometry, number theory, combinatorics, and probability. Significantly harder than GSM8K.
Format	12,500 problems (5,000 test) with LaTeX solutions; difficulty levels 1 through 5
Evaluation	Accuracy (exact match on the final answer after normalization)
Access	Hugging Face (lighteval/MATH); original at hendrycks/math
Limitations	LaTeX formatting of answers makes exact matching tricky; equivalent expressions may be scored as incorrect. Contamination concerns exist for problems sourced from well-known competitions.

Code Generation Benchmarks

HumanEval / HumanEval+

What It Measures	Functional code generation: given a function signature and docstring, generate a correct Python implementation that passes all unit tests.
Format	164 problems (HumanEval); HumanEval+ adds 80x more tests per problem to reduce false positives
Evaluation	pass@k: probability that at least one of k generated solutions passes all tests. pass@1 is the most commonly reported metric.
Access	GitHub (openai/human-eval); HumanEval+ via evalplus
Limitations	Python only. Small dataset (164 problems). Many problems are simple algorithmic tasks. Significant contamination risk since the problems are widely known. SWE-bench is emerging as a more realistic alternative.

SWE-bench / SWE-bench Verified

What It Measures	End-to-end software engineering: given a real GitHub issue from a popular Python repository, generate a patch that resolves the issue and passes the repository's existing test suite.
Format	SWE-bench contains 2,294 tasks drawn from 12 repositories (Django, Flask, scikit-learn, sympy, etc.). SWE-bench Verified is a human-validated 500-task subset with confirmed solvability and unambiguous test assertions.
Evaluation	Percentage of issues resolved: the generated patch must apply cleanly and pass all relevant tests. Unlike HumanEval's isolated functions, SWE-bench tasks require navigating multi-file repositories, understanding existing codebases, and writing patches that respect project conventions.
Access	swebench.com; GitHub (princeton-nlp/SWE-bench)
Limitations	Python only. Requires sandboxed execution environments for safe evaluation. Agent scaffolding (file navigation, tool use) significantly affects scores, making it hard to isolate model capability from system design. Cost per evaluation run is high due to repository setup and test execution.

Conversational and Preference Benchmarks

MT-Bench

What It Measures	Multi-turn conversational ability across 8 categories: writing, roleplay, extraction, reasoning, math, coding, knowledge (STEM), and knowledge (humanities).
Format	80 two-turn questions; GPT-4 judges model responses on a 1-10 scale
Evaluation	Average GPT-4 score across all questions (max 10.0)
Access	GitHub (lm-sys/FastChat); part of the lmsys ecosystem
Limitations	Very small dataset (80 questions). GPT-4 as judge introduces bias toward GPT-4-like responses. Scores tend to cluster in a narrow range (6 to 9), making it hard to distinguish similar models. The 1-10 scale is subjective.

Chatbot Arena (LMSYS)

What It Measures	Overall model quality as perceived by real users. Users submit any prompt, receive anonymized responses from two random models, and vote for the better one.
Format	Pairwise human preference judgments, continuously collected via a public web interface
Evaluation	Elo rating (and Bradley-Terry model coefficients) derived from pairwise win rates. Higher Elo indicates stronger perceived quality.
Access	Live leaderboard at chat.lmsys.org; vote data periodically released
Limitations	Scores reflect the preferences of the user population, which skews toward tech-savvy English speakers. Prompt distribution is uncontrolled and evolves over time. Models that produce longer, more formatted responses may receive a "verbosity bias" advantage.

lm-evaluation-harness in Practice

Run standardized benchmarks on any Hugging Face model from the command line.

# pip install lm-eval
# Evaluate a model on MMLU and HellaSwag
lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
    --tasks mmlu,hellaswag \
    --num_fewshot 5 \
    --batch_size 8 \
    --output_path ./eval_results/

Code Fragment 1: Running lm-evaluation-harness from the command line on MMLU and HellaSwag with 5-shot prompting. The --model_args flag accepts any Hugging Face model ID, and results are saved as JSON to the output path for comparison across models.

Loading Benchmarks with datasets

Load individual benchmark datasets programmatically for custom evaluation.

# pip install datasets
from datasets import load_dataset

# Load GSM8K math benchmark
gsm8k = load_dataset("gsm8k", "main", split="test")
print(f"GSM8K test size: {len(gsm8k)}")
print(f"Sample question: {gsm8k[0]['question'][:100]}...")

# Load MMLU (a specific subject)
mmlu = load_dataset("cais/mmlu", "college_physics", split="test")
print(f"MMLU college_physics: {len(mmlu)} questions")

Code Fragment 2: Loading benchmark datasets programmatically with datasets. GSM8K is loaded as a single split, while MMLU requires specifying the subject as a configuration name. This approach enables custom evaluation pipelines beyond what the harness supports.