Benchmarks provide standardized measurements of model capabilities. No single benchmark tells the full story; each measures a specific aspect of intelligence. The following covers the most widely referenced benchmarks in LLM evaluation.
Knowledge and Reasoning Benchmarks
MMLU (Massive Multitask Language Understanding)
What It Measures
Broad knowledge across 57 subjects: STEM, humanities, social sciences, professional domains (law, medicine, accounting). Tests factual recall and reasoning.
Format
Multiple-choice questions (4 options), 14,042 questions total
Evaluation
Accuracy (% correct). Reported as overall average and per-subject scores. Few-shot (5-shot) is the standard evaluation protocol.
Significant data contamination risk (many questions appear on the open web). Some questions have disputed correct answers. The multiple-choice format limits what it can measure. MMLU-Pro (harder, 10-choice variant) was introduced to address ceiling effects.
ARC (AI2 Reasoning Challenge)
What It Measures
Science reasoning at the grade-school level. The Challenge set (ARC-C) contains questions that were answered incorrectly by simple retrieval and co-occurrence methods, requiring genuine reasoning.
Relatively small dataset. Grade-school level means frontier models now achieve near-perfect scores, reducing its discriminative power.
HellaSwag
What It Measures
Commonsense reasoning through sentence completion. Given the beginning of a scenario, the model must select the most plausible continuation from four options.
Format
Multiple-choice (4 options), ~10,000 questions
Evaluation
Accuracy. Designed to be easy for humans (~95%) but hard for models at the time of creation (2019).
Frontier models now exceed 95% accuracy, approaching the ceiling. Adversarial filtering was applied during creation, which means the wrong answers are specifically designed to fool older models, not necessarily current ones.
TruthfulQA
What It Measures
Whether models generate truthful answers rather than reproducing common misconceptions. Questions are designed so that popular but incorrect answers are likely to be learned from web text.
Frontier models now score above 95%, reaching ceiling. High data contamination risk due to widespread internet availability. GSM-Plus and GSM-Hard extend the difficulty.
MATH
What It Measures
Competition-level mathematics spanning algebra, geometry, number theory, combinatorics, and probability. Significantly harder than GSM8K.
Format
12,500 problems (5,000 test) with LaTeX solutions; difficulty levels 1 through 5
Evaluation
Accuracy (exact match on the final answer after normalization)
Access
Hugging Face (lighteval/MATH); original at hendrycks/math
Limitations
LaTeX formatting of answers makes exact matching tricky; equivalent expressions may be scored as incorrect. Contamination concerns exist for problems sourced from well-known competitions.
Code Generation Benchmarks
HumanEval / HumanEval+
What It Measures
Functional code generation: given a function signature and docstring, generate a correct Python implementation that passes all unit tests.
Format
164 problems (HumanEval); HumanEval+ adds 80x more tests per problem to reduce false positives
Evaluation
pass@k: probability that at least one of k generated solutions passes all tests. pass@1 is the most commonly reported metric.
Python only. Small dataset (164 problems). Many problems are simple algorithmic tasks. Significant contamination risk since the problems are widely known. SWE-bench is emerging as a more realistic alternative.
SWE-bench / SWE-bench Verified
What It Measures
End-to-end software engineering: given a real GitHub issue from a popular Python repository, generate a patch that resolves the issue and passes the repository's existing test suite.
Format
SWE-bench contains 2,294 tasks drawn from 12 repositories (Django, Flask, scikit-learn, sympy, etc.). SWE-bench Verified is a human-validated 500-task subset with confirmed solvability and unambiguous test assertions.
Evaluation
Percentage of issues resolved: the generated patch must apply cleanly and pass all relevant tests. Unlike HumanEval's isolated functions, SWE-bench tasks require navigating multi-file repositories, understanding existing codebases, and writing patches that respect project conventions.
Python only. Requires sandboxed execution environments for safe evaluation. Agent scaffolding (file navigation, tool use) significantly affects scores, making it hard to isolate model capability from system design. Cost per evaluation run is high due to repository setup and test execution.
Conversational and Preference Benchmarks
MT-Bench
What It Measures
Multi-turn conversational ability across 8 categories: writing, roleplay, extraction, reasoning, math, coding, knowledge (STEM), and knowledge (humanities).
Format
80 two-turn questions; GPT-4 judges model responses on a 1-10 scale
Evaluation
Average GPT-4 score across all questions (max 10.0)
Access
GitHub (lm-sys/FastChat); part of the lmsys ecosystem
Limitations
Very small dataset (80 questions). GPT-4 as judge introduces bias toward GPT-4-like responses. Scores tend to cluster in a narrow range (6 to 9), making it hard to distinguish similar models. The 1-10 scale is subjective.
Chatbot Arena (LMSYS)
What It Measures
Overall model quality as perceived by real users. Users submit any prompt, receive anonymized responses from two random models, and vote for the better one.
Format
Pairwise human preference judgments, continuously collected via a public web interface
Evaluation
Elo rating (and Bradley-Terry model coefficients) derived from pairwise win rates. Higher Elo indicates stronger perceived quality.
Access
Live leaderboard at chat.lmsys.org; vote data periodically released
Limitations
Scores reflect the preferences of the user population, which skews toward tech-savvy English speakers. Prompt distribution is uncontrolled and evolves over time. Models that produce longer, more formatted responses may receive a "verbosity bias" advantage.
lm-evaluation-harness in Practice
Run standardized benchmarks on any Hugging Face model from the command line.
# pip install lm-eval
# Evaluate a model on MMLU and HellaSwag
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
--tasks mmlu,hellaswag \
--num_fewshot 5 \
--batch_size 8 \
--output_path ./eval_results/
Code Fragment 1: Running lm-evaluation-harness from the command line on MMLU and HellaSwag with 5-shot prompting. The --model_args flag accepts any Hugging Face model ID, and results are saved as JSON to the output path for comparison across models.
Loading Benchmarks with datasets
Load individual benchmark datasets programmatically for custom evaluation.
# pip install datasets
from datasets import load_dataset
# Load GSM8K math benchmark
gsm8k = load_dataset("gsm8k", "main", split="test")
print(f"GSM8K test size: {len(gsm8k)}")
print(f"Sample question: {gsm8k[0]['question'][:100]}...")
# Load MMLU (a specific subject)
mmlu = load_dataset("cais/mmlu", "college_physics", split="test")
print(f"MMLU college_physics: {len(mmlu)} questions")
Code Fragment 2: Loading benchmark datasets programmatically with datasets. GSM8K is loaded as a single split, while MMLU requires specifying the subject as a configuration name. This approach enables custom evaluation pipelines beyond what the harness supports.