Frontier Benchmarks: HLE, ARC-AGI-2, FrontierMath

Section 77.1

"HLE, ARC-AGI-2, FrontierMath. The benchmarks that move when the frontier LLM moves, and stay still when it does not."

EvalEval, Frontier-Benchmark-Reader AI Agent
Note: Learning Objectives
Big Picture

Frontier benchmarks are the LLM field's empirical anchor. Without them, "the model got better" is a vibe rather than a measurement, and a "smarter agent" claim is just marketing. The 2025-26 generation of LLM benchmarks (HLE, ARC-AGI-2, FrontierMath) was designed to outlast the next two scaling cycles; whether they succeed determines whether the AGI-timeline debate in Section 77.3 has measurable inputs or only conjectures, and whether agent-product teams can rely on benchmark scores when picking a frontier LLM.

Prerequisites

This section assumes the LLM evaluation methodology from Section 42.1, the LLM-as-judge fundamentals from Section 46.1, and a passing familiarity with the frontier-API model zoo from Section 14.4.

Every era of LLM progress has had a "the benchmark of the moment", and every benchmark has eventually been saturated. GLUE in 2019, MMLU in 2021-22, GSM8K in 2022-23, MATH in 2023-24. By 2025 the leading models were above 90% on all four. Saturation is not just a technical inconvenience: it makes it impossible to tell whether a new model is meaningfully better, only different. The 2025 wave of new benchmarks (Humanity's Last Exam, ARC-AGI-2, FrontierMath) was designed specifically to be hard enough to last two more scaling cycles.

This section walks through each. None will survive forever, but together they paint the clearest picture available of where frontier models actually sit relative to PhD-level human expertise, and where the curve is bending fastest.

77.1.1 Humanity's Last Exam (HLE)

HLE (Phan et al., January 2025; official site) is 2,500 expert-curated questions across 100+ disciplines, deliberately designed to be hard for AI: math proofs, esoteric chemistry, legal interpretation, multimodal reasoning. Top model as of May 2026 is Gemini 3.1 Pro Preview at 44.7%, followed by GPT-5.5 (xhigh) at 44.3% and GPT-5.5 (high) at 43.0%. PhD-level humans in their domain score 90%+; non-expert humans (with internet access) score about 5-10%. HLE is the closest available proxy for "can a model match a domain expert on hard problems".

77.1.2 ARC-AGI-2 (and ARC-AGI-3)

ARC-AGI-2 (Chollet et al., May 2025) is Francois Chollet's update to the original ARC benchmark; it tests fluid pattern-recognition reasoning that does not transfer from internet-scale pretraining. The 2025 ARC Prize results show the top Kaggle submission hit 24% at $0.20/task, while frontier general-purpose reasoning models reached 60%+ at much higher per-task cost. ARC-AGI-3 (early 2026 technical report) is the planned interactive-reasoning successor; first format change to ARC since 2019.

77.1.3 FrontierMath

FrontierMath (Glazer et al., Epoch AI) is the math-research-grade benchmark: hundreds of unpublished problems contributed by professional mathematicians, with held-out solutions to prevent contamination. The June 2025 Tier 4 release introduced 50 problems at research-frontier difficulty; best single-run score is 29%, and the "ever-solved" rate on Tiers 1 through 3 by GPT-5.2 / Claude Opus 4.6 is above 40%. FrontierMath is what changed the consensus that "math reasoning is solved" back to "math reasoning is making fast but partial progress".

77.1.4 Comparing the frontier-2026 benchmarks

Table 77.1.1: Frontier benchmarks, mid-2026.
BenchmarkDomainSizeTop model scoreExpert human
HLEMulti-discipline expert~2,500Gemini 3.1 Pro 44.7%~90%
ARC-AGI-2Visual pattern reasoning~400 tasks~60% (frontier reasoning), 24% (Kaggle)~98%
FrontierMath Tier 4Research-grade math5029% best single run~60-80% over time
SWE-bench VerifiedReal GitHub issues500~70% (agentic Claude)~95%
GPQA DiamondGraduate science198~85% (frontier)~74%
Line chart of top frontier-model score per benchmark from January 2025 to May 20
Figure 77.1.1a: Frontier-benchmark scores over fifteen months. HLE has the steepest slope; ARC-AGI-2 and FrontierMath are slower. Whether the slower curves accelerate or stay flat is the 2027 question. 26. HLE (blue) rises from 8% to 44.7%. ARC-AGI-2 (purple, cost-uncapped) rises from 5% to ~38%. FrontierMath (green) rises from 2% to 29%. A dashed red horizontal line marks the 90% expert-human baseline. Two vertical annotation lines mark the o3/Claude 4 release wave in early 2026 and the Gemini 3 / GPT-5.5 wave in mid-2026.
Key Insight
Mental Model: emergent abilities reflect threshold effects, not new mechanisms

When GPT-3 jumped from 5% to 70% on arithmetic at 13B parameters, the field briefly believed something new had happened inside the network. Subsequent work (Schaeffer et al., "Are Emergent Abilities a Mirage?", 2023) showed most "emergences" were artifacts of all-or-nothing metric design: under continuous grading, the curves were smooth. The 2026 mental model: capability emergence is what you see when a continuous internal competence finally crosses a discrete output threshold. The mechanisms (autoregressive prediction, in-context learning, chain-of-thought) do not change at the emergence point; only the measured output crosses zero. The right way to read 2026 benchmark numbers: high MMLU is necessary but not sufficient; high HLE / FrontierMath is the genuine progress signal because they were designed to put thresholds out at the frontier of competence rather than near it.

Real-World Scenario
DeepSeek-R1 reasoning emergence (Q4 2025)

In November 2025, DeepSeek-R1 (the open-weight reasoning model trained with GRPO from cold-start) demonstrated a sharp jump on AIME and competition math: from ~10% at 30B base to ~75% at the same scale after RL-only training, with no SFT chain-of-thought data. The capability did not exist at the start of training and appeared partway through; the GRPO reward signal alone was sufficient. The DeepSeek-R1 paper documents the loss curves and the moment "aha" reasoning patterns first appear in samples. This was among the most-studied 2025 emergences; whether it is best characterized as a "mechanism-grounded threshold crossing" (the framing in the paper) or as the RL phase amplifying a competence already latent in the base model (the framing several independent reproductions argued for) is still under debate. The open-weight community reproduced the headline curve on smaller scales (Qwen-2.5 + GRPO from cold-start hit similar inflection at 7B and 14B). The result reframed the open question from "do reasoning models scale further" to "what other capabilities will RL-only training unlock on existing base models".

Warning: cost-controlled vs uncontrolled scores

ARC-AGI-2 publishes scores at multiple cost tiers ($0.20/task, $2.50/task, $100/task). A model that scores 60% on ARC at $100 per task and a model that scores 24% on ARC at $0.20 per task are not directly comparable. The same logic applies to reasoning-mode budgets on GPT-5.5 and Claude Opus 4.6: high-thinking mode is expensive, low-thinking mode is fast, and the published benchmark number is usually the expensive setting. Always check the cost column before drawing conclusions.

Tip: track three benchmarks, not thirty

The leaderboards listed in Section 12.5 cover dozens of benchmarks; you do not need to track them all. The three that capture the 2026 frontier are HLE (knowledge breadth + depth), ARC-AGI-2 (fluid reasoning), and SWE-bench Verified (real engineering tasks). If a new model moves all three, it is genuinely better. If it moves only one, it has probably been optimized specifically for that benchmark.

Fun Note: the funniest 2025 emergence

In June 2025, Gemini 2.5 Pro Thinking quietly began solving a class of Project Euler problems that had previously been considered "no transformer should be able to do this without an external tool", including the prime-counting variants. The interesting part: when researchers asked the model to explain its method, it produced a textbook account of the Meissel-Mertens approach. No tool calls, no external search, just the model writing the algorithm step-by-step in its scratchpad and getting the right answer. Whether this counts as "the model can do number theory" or "the model can simulate a small computer in chain-of-thought" is still debated; either way it was the first 2025 emergence that genuinely surprised the team that released the model.

Key Insight: What the benchmark trajectory implies

The benchmarks above are still climbing, but the rate is slowing. HLE jumped from 8% (GPT-4o, January 2025) to 44% (Gemini 3.1 Pro, May 2026) in fifteen months; a similar fifteen months from now would put 80% in reach if the curve holds, and into the noise of disagreement-about-correct-answers if it does not. ARC-AGI-2 and FrontierMath are flatter trajectories. Whether those flatten further or accelerate is one of the things Section 77.3's timeline question will hinge on. First, Section 77.2 takes up the alignment-at-frontier-scale question: what does it mean to align a model whose capabilities you cannot fully measure?

Key Takeaways
Self-Check
Q1: What does "emergent ability" mean in the 2026 mental model, after Schaeffer et al.?
Show Answer
Emergent ability now refers to a threshold-crossing of a previously continuous internal competence when evaluated under an all-or-nothing metric, not the spontaneous appearance of a new mechanism inside the network. Schaeffer and colleagues showed that switching to smoother metrics (such as token-level accuracy or partial credit) often removes the apparent discontinuity, revealing steady underlying improvement. The practical implication is that "emergence" is a measurement artifact layered on continuous scaling, so the 2026 textbook treatment de-emphasizes it as a mystical phenomenon.
Q2: A vendor announces 90% on ARC-AGI-2. What three questions should you ask before believing it?
Show Answer
Ask (1) what the cost per task was, since results at $0.20 per task (Kaggle-style) and $100 per task (frontier reasoning mode) are not the same artifact; (2) whether the reasoning-mode budget was capped, because uncapped chain-of-thought can buy raw points that vanish under deployment latency limits; and (3) whether the score is on the public set or the private holdout, because private-set numbers are the only ones that resist test-set contamination. Together these three checks separate genuine progress from leaderboard theater.

What's Next?

In the next section, Section 77.2: Alignment at Frontier Scale, we build on the material covered here.

Further Reading
Phan et al., "Humanity's Last Exam" (Jan 2025).
Chollet et al., "ARC-AGI-2" (May 2025); ARC-AGI-3 Technical Report (2026).
Glazer et al., "FrontierMath" (Epoch AI; Tier 4 release June 2025).
Schaeffer et al., "Are Emergent Abilities a Mirage?" (NeurIPS 2023).
ARC Prize, "2025 Results Analysis".