Section 45.3: Datasets & Benchmarks

Big Picture

Choosing the right evaluation benchmarks is the difference between knowing how your model performs and pretending to know. The 2026 landscape splits into three buckets. Knowledge and reasoning benchmarks (MMLU, GPQA, AIME, HLE, ARC-AGI-2, FrontierMath) measure whether the model can recall facts and chain inferences. Capability and agentic benchmarks (BFCL, tau-bench, GAIA, WebArena, OSWorld, LMArena) measure whether it can call tools, navigate browsers, and produce outputs humans actually prefer. Safety and bias benchmarks (HarmBench, AdvBench, TruthfulQA, Anthropic Global Opinions) measure whether it refuses harmful requests and resists known jailbreaks. This section is the catalog: what each benchmark measures, who built it, when to use it, and which ones have already saturated. Pick a small portfolio across all three buckets; reporting only one bucket gives a misleading model card.

Eval datasets fall into three buckets: knowledge benchmarks, capability benchmarks, and safety / alignment benchmarks. The list below covers what most production teams test against in 2026.

45.3.1 Knowledge and reasoning

MMLU (Hendrycks et al., 2020) and MMLU-Pro (TIGER-Lab, 2024) are the canonical multitask knowledge benchmarks: MMLU covers 57 subjects from high school through professional level with multiple-choice questions; MMLU-Pro raises the bar with harder distractors, 10-way questions, and reasoning-heavy items. Their objective is to measure broad cross-domain knowledge with a single number, which mattered as the leaderboard everyone reported. The core concept is multiple-choice questions across subjects with accuracy as the metric. Pick MMLU for backward-compatible historical comparisons; pick MMLU-Pro for current evaluation because plain MMLU saturated above 90%.
GPQA (Rein et al., 2023) is the graduate-level Google-Proof QA benchmark: questions written by domain PhDs in biology, chemistry, and physics, designed to resist surface web search. Its objective is to evaluate genuinely hard reasoning where the model cannot lookup the answer, which matters when MMLU-saturated models still fail real expert problems. The core concept is "Google-proof" curation: PhD-validated questions whose answers are not findable via simple search. Pick GPQA for testing whether your model has real scientific reasoning; approaching saturation by mid-2026 frontier models.
AIME (American Invitational Mathematics Exam, 2024-2025) is the competition-math benchmark using AIME problems (15 problems per year, integer answers 0-999). Its objective is to test hard mathematical reasoning at high-school competition level, which matters because GSM8K saturated long ago. The core concept is fresh per-year problems (limiting contamination) plus integer-answer exact-match scoring. Pick AIME 2024 / 2025 for current math reasoning eval; expect contamination concerns for older years.
GSM8K (OpenAI, 2021) is the 8.5K grade-school math word problems dataset with step-by-step solutions. Its objective is to measure multi-step arithmetic reasoning that single-step heuristics cannot solve, which mattered as the canonical chain-of-thought benchmark. Pick GSM8K as a smoke test; expect contamination and saturation, so report it alongside MATH or AIME.
HumanEval (OpenAI, 2021) and SWE-bench (Princeton, 2023) are the canonical code benchmarks: HumanEval is 164 simple programming problems with unit tests; SWE-bench is full GitHub-issue resolution requiring multi-file edits. Their objective is to test code generation at progressively realistic scales, which matters because HumanEval saturated quickly and SWE-bench became the standard. Pick HumanEval for fast iteration (the model generates a function); pick SWE-bench Verified (the 500-task OpenAI-curated subset released August 2024 that removed mislabeled examples) for realistic end-to-end coding agents.
Humanity's Last Exam (HLE) (CAIS / Scale AI, January 2025) is the 2,500-question expert-curated benchmark spanning 100+ subjects, built specifically to remain hard for frontier models. Its objective is to be the MMLU successor for the post-saturation era, which matters because MMLU and GPQA-Diamond are now too easy. The core concept is questions from domain experts that even frontier models score 20-30% on as of mid-2025. Pick HLE as the modern hard-knowledge benchmark.
ARC-AGI-2 (Chollet et al., March 2025) is the second-generation abstract-reasoning benchmark released after o3 cracked ARC-AGI-1 in December 2024. Its objective is to test general fluid intelligence by visual pattern completion that resists memorization, which matters because pattern-matching benchmarks are easily gamed. The core concept is novel grid-puzzle reasoning that frontier models still score under 10% on as of mid-2025. Pick ARC-AGI-2 for AGI-frontier reasoning; not relevant for most production eval.
FrontierMath (Epoch AI, November 2024) is the expert-mathematician benchmark whose problems take trained mathematicians hours to solve. Its objective is to put a ceiling on mathematical reasoning above AIME and MATH, which matters because those are saturated. Pick FrontierMath for the math-frontier signal; frontier models score under 30% in 2025.

45.3.2 Capability and agentic

BFCL (UC Berkeley, 2023; v3 2024) is the Berkeley Function-Calling Leaderboard testing tool-call selection, parameter parsing, parallel calls, and error handling. Its objective is to evaluate function-calling mechanics specifically, which matters because tool-use failures cascade silently in agents. Pick BFCL for picking a function-calling-capable model.
tau-bench (Sierra Research, 2024) is the agentic-dialog benchmark across airline and retail customer-service domains with user-simulator interactions. Its objective is to test multi-turn agent behavior under realistic policies, which matters because BFCL's single-turn function calls miss real production complexity.
GAIA (Mialon et al., Hugging Face / Meta / AutoGPT, 2023) is the General AI Assistant benchmark with 466 real-world tasks requiring web browsing, multi-modal understanding, and tool use. Its objective is to test "what an AI assistant should do for a human" with tasks that humans solve at 92% but baseline agents at 15%, which matters as a holistic agent benchmark. Pick GAIA for browser-and-tool-using agents; the gap between frontier agent systems and humans remains substantial in 2025.
WebArena (Carnegie Mellon, 2023) and OSWorld (HKUST, 2024) are the realistic-environment agent benchmarks: WebArena runs in self-hosted reproductions of Reddit, GitLab, e-commerce, and content management; OSWorld runs in a real Ubuntu desktop with 369 tasks across office apps, browsers, and the file system. Their objective is to measure browser- and computer-use agents in environments that resemble real production targets. Pick WebArena for browser-only agents, OSWorld for full computer-use evaluation as deployed by Anthropic Computer Use and OpenAI Operator.
LMArena (LMSYS, 2023; rebranded from Chatbot Arena) is the public human-preference leaderboard where users submit prompts and vote on blind A/B model comparisons, with an Elo rating per model. Its objective is to capture real human preferences across diverse prompts, which matters because synthetic benchmarks correlate weakly with what users actually like. Pick LMArena Elo as your "vibes-based" cross-check on benchmark numbers; treat it cautiously because prompt distribution skews toward tech-user interests.
Artificial Analysis (Artificial Analysis, 2024) is the latency, cost, and quality dashboard tracking every major model on every major API. Its objective is to give you the cost-vs-quality-vs-latency tradeoff in real time, which matters for procurement decisions. The core concept is a third-party measurement service running standardized prompts against every provider. Pick Artificial Analysis to compare cost-per-token across providers and to see real latency numbers.

45.3.3 Safety and bias

HarmBench (Mazeika et al., 2024) is the harmful-output test suite covering 510 attack prompts across hate speech, weapons, illegal activity, malware, and bioweapons. Its objective is to standardize red-team evaluation across labs so safety claims are comparable, which matters because every lab previously used different attack sets. Pick HarmBench for cross-model safety comparison; for production, supplement with domain-specific harms relevant to your product.
Anthropic Global Opinions (Anthropic, 2023) is the survey-based bias benchmark that measures how closely model opinions align with different country populations on contested topics. Its objective is to make opinion alignment measurable and comparable, which matters because invisible biases shape what models will say. Pick it for measuring whose worldview your model defaults to; remediation requires careful steering.
AdvBench (Zou et al., 2023) is the adversarial-prompt benchmark of 520 harmful behaviors used to evaluate jailbreak robustness. Its objective is to standardize jailbreak evaluation so safety researchers can compare defenses, which matters when adversaries iterate on prompts faster than benchmarks update. Pick AdvBench as the cross-paper standard; supplement with GCG-style automated attacks for serious red-teaming.
TruthfulQA (Lin et al., 2021) is the truthfulness benchmark with 817 questions designed to elicit common misconceptions. Its objective is to measure how often models repeat plausible-but-false information, which matters because models trained on web text inherit many myths. Pick TruthfulQA for hallucination-resistance evaluation; treat the absolute numbers cautiously because the question selection biases toward easy-to-trick patterns.

45.3.4 Comparing the benchmarks

Table 45.3.1: 36.3.1 Benchmark coverage for production eval.

Benchmark	What it measures	Status in 2026
MMLU	General knowledge	Saturated
GPQA	Hard reasoning	Approaching saturation
SWE-bench Verified	Coding agent	Active
BFCL	Function calling	Active
tau-bench	Agentic dialog	Active
LMArena	Human preference	Live
HLE	Expert knowledge frontier	Active (2025)
ARC-AGI-2	Abstract reasoning frontier	Active (2025)
GAIA / WebArena / OSWorld	Browser and computer-use agents	Active

Warning: Public benchmarks are a starting point

Your production eval must include a domain-specific test set. Public benchmarks are necessary (for cross-team and cross-vendor comparison) but never sufficient (because your traffic is not the benchmark distribution).

What's Next?

In the next section, Section 45.4: Models, we build on the material covered here.

Further Reading

Eval Benchmarks

Hendrycks, D., Burns, C., Basart, S., et al. (2021). "Measuring Massive Multitask Language Understanding" (MMLU). ICLR 2021. arXiv:2009.03300. The canonical multi-domain knowledge benchmark; the LLM leaderboard standard for general capability.

Zheng, L., Chiang, W.-L., Sheng, Y., et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023. arXiv:2306.05685. The foundational paper for MT-Bench and LLM-as-judge evaluation methodology.

Chiang, W.-L., Zheng, L., Sheng, Y., et al. (2024). "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference." ICML 2024. arXiv:2403.04132. Reference human-preference leaderboard.

Rein, D., Hou, B. L., Stickland, A. C., et al. (2023). "GPQA: A Graduate-Level Google-Proof Q&A Benchmark." arXiv:2311.12022. PhD-level science benchmark resistant to web search; the canonical reference for frontier-model reasoning evaluation in 2024 to 2026.

Liang, P., Bommasani, R., Lee, T., et al. (2023). "Holistic Evaluation of Language Models" (HELM). TMLR 2023. arXiv:2211.09110. Stanford's multi-metric framework that pioneered evaluating accuracy alongside calibration, robustness, and bias; the methodological reference for any production eval suite.

Srivastava, A., Rastogi, A., Rao, A., et al. (2023). "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models" (BIG-Bench). TMLR 2023. arXiv:2206.04615. The massive collaborative benchmark of 200+ tasks designed to probe capabilities beyond standard suites; the reference for measuring qualitatively diverse capability scaling.