Datasets & Benchmarks

Section 78.3

"Benchmarks are the field's empirical anchor. Their leaderboards are the field's scoreboard. Their contamination is the field's recurring scandal."

EvalEval, Leaderboard-Skeptic AI Agent
Note: Learning Objectives
Big Picture

Benchmarks are the field's empirical anchor. A new model's leaderboard delta is the closest thing to a controlled experiment that the field has at scale. Knowing which benchmarks discriminate frontier capability versus which have already saturated is most of "is this announcement real progress". For LLM and agent practitioners, this matters because every model-selection, fine-tuning, or RAG-architecture decision rests on a benchmark claim; reading the leaderboard the way a builder reads an evaluation suite is the difference between picking the right LLM for a task and chasing a saturated headline number.

Prerequisites

This section assumes the LLM evaluation methodology from Section 42.1 and the frontier-benchmark vocabulary from Section 77.1.

Frontier benchmark saturation timeline 2024 to 2028
Figure 78.3.1: Frontier benchmarks plotted by current top score (vertical) and expected saturation year (horizontal). Green-shaded benchmarks (MMLU at ~92%, GSM8K at ~95%) saturated in 2024 and no longer discriminate frontier models. Amber benchmarks (AIME 2024 at ~85%, GPQA-Diamond at ~75%) are within one to two years of the 90% saturation line. Blue benchmarks (SWE-bench Verified at ~70%, ARC-AGI-2 at ~60%, HLE at ~45%) currently discriminate frontier capability. FrontierMath at ~29% on Tier-4 is the deepest reserve, with research-grade math problems by Epoch AI projected to take five-plus years to saturate. The METR long-horizon time-doubling claim (the "Moore's law for agents" curve) puts agentic benchmarks on a separate, faster doubling trajectory.

78.3.1 Frontier reasoning benchmarks

78.3.2 Frontier agentic / long-horizon benchmarks

78.3.3 Comparing the frontier benchmarks

Table 78.3.1a: 65.3.1 Frontier benchmarks (2026).
Benchmark Domain Frontier score Saturation timeline
ARC-AGI Pattern reasoning ~75% Unclear
GPQA-Diamond Hard science ~75% 2-3 years
AIME 2024 Competition math ~85% 1-2 years
FrontierMath Research math ~25% Likely 5+ years
METR long-horizon Agentic time-horizon ~30 min Unclear
Note: Saturation is the constant

Every frontier benchmark eventually saturates. The interesting metric is how fast: ARC-AGI was thought to be far away when introduced in 2019, then suddenly capabilities jumped in 2024. Treat benchmarks as snapshots, not as fixed milestones.

Real-World Scenario
the SWE-bench Verified leaderboard, Q1-Q2 2026

SWE-bench Verified is a 500-task curated subset of real GitHub issues (Princeton). Through 2024 the best agent scored around 25%; by Q1 2025 Claude 3.5 Sonnet + Cursor hit 49%. In Q4 2025, Claude Opus 4.6 with the agentic harness crossed 70% on the verified set; OpenAI's o3 + Operator hit 68% the same month. By Q2 2026 the top entries were within 3 points of each other, all in the 70-73% band. What this taught the field: the harness matters as much as the model (Cursor, Aider, OpenHands, Claude Code, Operator all post different scores with the same underlying model), the inference-time compute budget is now a primary parameter (high-thinking mode adds 5-15 points), and the next-tier benchmark (SWE-Lancer, with longer tasks and monetary stakes) is what discriminates further. The SWE-bench leaderboard is the cleanest 2026 reference. SWE-bench Verified shows where the agent-capability frontier actually sits in real engineering work.

Warning: saturated benchmarks are not solved problems

When MMLU saturated at 90% in 2024, the news framing was "models match expert humans on graduate knowledge". That framing was wrong in two ways: (1) MMLU's remaining 10% includes the questions the benchmark designers themselves cannot answer confidently, so 90% may be near the noise ceiling, not human ceiling; (2) the underlying competence varies wildly within the 90%: models get easy questions right and hard questions right via different mechanisms (pattern matching versus reasoning), and "90% MMLU" averages over both. The same caveat applies to ARC-AGI-2, FrontierMath, and SWE-bench when they eventually saturate. Saturation means "this benchmark has stopped discriminating", not "this capability is mastered". Always read the unsaturated benchmarks first.

Key Takeaways

What's Next?

In the next section, Section 78.4: Models, we build on the material covered here.

Further Reading
ARC-AGI / ARC Prize: Chollet et al., 2019-26.
GPQA-Diamond (Rein et al., 2023).
FrontierMath (Glazer et al., Epoch AI).
SWE-bench Verified (Princeton).
SWE-Lancer (OpenAI).