Chapter 77: AGI Trajectories & Open Questions

Chapter opener illustration: AGI Trajectories &amp.

"Where are we on the curve? The honest answer is that the curve has more dimensions than the question assumes."
Frontier, AGI-Forecasting AI Agent

Looking Back

Chapter 76 covered the theory we have. This chapter covers the theory we wish we had: forecasting capabilities, timelines, scaling-vs-data-vs-compute scenarios, the safety landscape ahead, and the open questions that decide whether the next decade is exciting or terrifying.

Big Picture

This chapter closes the book on the question every LLM textbook eventually has to address: when, if ever, do language models cross into something we would call general intelligence, and what happens to the world if they do. The 2025-26 evidence is unusually rich: Humanity's Last Exam is the first benchmark deliberately designed to outlast the next two scaling cycles, ARC-AGI-2 compressed the curve on which 2024 thought a major frontier-AI question would be settled, and FrontierMath Tier 4 introduced 50 problems that even leading 2026 reasoning models solve at roughly 29%. At the same time, Anthropic's labor-market study documented that 35.9% of U.S. workers used generative AI by Dec 2025 and 78.7% of measured AI interactions were augmentation rather than automation. Whether this is a slow-burn productivity story or a fast-burn displacement story is the question the next two years answer.

Chapter Overview

Every LLM textbook eventually has to address AGI. This chapter is the engineering answer: the frontier benchmarks that anchor empirical claims (HLE, ARC-AGI-2, FrontierMath), alignment at frontier scale and whether 2020s techniques scale, the AGI timeline spectrum from 2027 to 2033, the economic and labor-market implications, and what 2026 actually settled (versus what remains open).

AGI debates went from "unfalsifiable speculation" to "measurable disagreements about specific benchmarks and timelines" between 2023 and 2026. This chapter is the practitioner's syllabus for engaging with them seriously.

Note: Learning Objectives

Evaluate frontier benchmarks (HLE, ARC-AGI-2, FrontierMath) and what saturation does and does not imply.
Diagnose whether RLHF, DPO, Constitutional AI scale to frontier-capability models.
Reason about the 2027 to 2033 AGI timeline spectrum and the cruxes that distinguish positions.
Apply labor-market data to economic implications of LLM capability growth.
Identify what 2026 settled empirically and what remains genuinely open.

Library Shortcut

The benchmarks in this chapter are public and almost all expose evaluation harnesses:

pip install lm-eval               # MMLU, GPQA, ARC, etc. via lm-evaluation-harness
git clone https://github.com/centerforaisafety/hle  # Humanity's Last Exam
git clone https://github.com/arcprize/ARC-AGI-2     # ARC-AGI-2 task set

FrontierMath is held-out for security and not directly runnable; Epoch AI runs the official evaluation.

Sections in This Chapter

Prerequisites

Frontier theory from Chapter 76
Agent foundations from Chapter 26
Reasoning models from Chapter 8

What Comes Next

This is the book's final main chapter. Chapter 78 wraps Part XII with the reading list, leaderboards, and community spaces where the open questions catalogued here will get answered first, before the capstone project sends you back to the agent you built earlier and asks what you would change today.

Further Reading

What AGI Means & How to Measure It

Morris, M. R., Sohl-Dickstein, J., Fiedel, N., Warkentin, T., Dafoe, A., Faust, A., et al. (2024). "Levels of AGI: Operationalizing Progress on the Path to AGI." arXiv preprint. arXiv:2311.02462. The DeepMind levels-of-AGI framework, the most-cited operational taxonomy for the AGI debate this chapter engages.

Chollet, F., Knoop, M., Kamradt, G., & Landers, B. (2024). "ARC Prize 2024: Technical Report." ARC Prize Foundation. arXiv:2412.04604. The 2024 ARC results, the AGI benchmark whose persistent difficulty is a central evidentiary point in any AGI-trajectories analysis.

Capabilities, Agency & Safety

Kinniment, M., Sato, L. J. K., Du, H., Goodrich, B., Hasin, M., Chan, L., et al. (2024). "Evaluating Language-Model Agents on Realistic Autonomous Tasks." METR Technical Report. arXiv:2312.11671. The METR autonomy evaluation, the empirical instrument for "agency thresholds" that frontier policy uses to ground claims about AGI-ish capability.

Hendrycks, D., Mazeika, M., & Woodside, T. (2023). "An Overview of Catastrophic AI Risks." arXiv preprint. arXiv:2306.12001. The reference catalogue of AGI-trajectory risks (misuse, AI race, organizational, rogue), the structure most policy and corporate-strategy documents now use.