Datasets & Benchmarks

Section 45.3
Big Picture

Choosing the right evaluation benchmarks is the difference between knowing how your model performs and pretending to know. The 2026 landscape splits into three buckets. Knowledge and reasoning benchmarks (MMLU, GPQA, AIME, HLE, ARC-AGI-2, FrontierMath) measure whether the model can recall facts and chain inferences. Capability and agentic benchmarks (BFCL, tau-bench, GAIA, WebArena, OSWorld, LMArena) measure whether it can call tools, navigate browsers, and produce outputs humans actually prefer. Safety and bias benchmarks (HarmBench, AdvBench, TruthfulQA, Anthropic Global Opinions) measure whether it refuses harmful requests and resists known jailbreaks. This section is the catalog: what each benchmark measures, who built it, when to use it, and which ones have already saturated. Pick a small portfolio across all three buckets; reporting only one bucket gives a misleading model card.

Eval datasets fall into three buckets: knowledge benchmarks, capability benchmarks, and safety / alignment benchmarks. The list below covers what most production teams test against in 2026.

45.3.1 Knowledge and reasoning

45.3.2 Capability and agentic

45.3.3 Safety and bias

45.3.4 Comparing the benchmarks

Table 45.3.1: 36.3.1 Benchmark coverage for production eval.
Benchmark What it measures Status in 2026
MMLU General knowledge Saturated
GPQA Hard reasoning Approaching saturation
SWE-bench Verified Coding agent Active
BFCL Function calling Active
tau-bench Agentic dialog Active
LMArena Human preference Live
HLE Expert knowledge frontier Active (2025)
ARC-AGI-2 Abstract reasoning frontier Active (2025)
GAIA / WebArena / OSWorld Browser and computer-use agents Active
Warning: Public benchmarks are a starting point

Your production eval must include a domain-specific test set. Public benchmarks are necessary (for cross-team and cross-vendor comparison) but never sufficient (because your traffic is not the benchmark distribution).

What's Next?

In the next section, Section 45.4: Models, we build on the material covered here.

Further Reading

Eval Benchmarks

Hendrycks, D., Burns, C., Basart, S., et al. (2021). "Measuring Massive Multitask Language Understanding" (MMLU). ICLR 2021. arXiv:2009.03300. The canonical multi-domain knowledge benchmark; the LLM leaderboard standard for general capability.
Zheng, L., Chiang, W.-L., Sheng, Y., et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023. arXiv:2306.05685. The foundational paper for MT-Bench and LLM-as-judge evaluation methodology.
Chiang, W.-L., Zheng, L., Sheng, Y., et al. (2024). "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference." ICML 2024. arXiv:2403.04132. Reference human-preference leaderboard.
Rein, D., Hou, B. L., Stickland, A. C., et al. (2023). "GPQA: A Graduate-Level Google-Proof Q&A Benchmark." arXiv:2311.12022. PhD-level science benchmark resistant to web search; the canonical reference for frontier-model reasoning evaluation in 2024 to 2026.
Liang, P., Bommasani, R., Lee, T., et al. (2023). "Holistic Evaluation of Language Models" (HELM). TMLR 2023. arXiv:2211.09110. Stanford's multi-metric framework that pioneered evaluating accuracy alongside calibration, robustness, and bias; the methodological reference for any production eval suite.
Srivastava, A., Rastogi, A., Rao, A., et al. (2023). "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models" (BIG-Bench). TMLR 2023. arXiv:2206.04615. The massive collaborative benchmark of 200+ tasks designed to probe capabilities beyond standard suites; the reference for measuring qualitatively diverse capability scaling.