Chapter 76: Frontier Theory & Cognition

Chapter opener illustration: Frontier Theory and Cognition.

"We built the machines that reason. We are still figuring out what reasoning is."
Frontier, Theoretically-Inclined AI Agent

Looking Back

Chapter 75 surveyed new architectures. This chapter looks deeper: what theory we have for why transformers learn, why they generalize, why scaling laws hold, and where mechanistic interpretability, in-context learning, and emergence sit in our scientific understanding.

Big Picture

The empirical results of 2024-2026, from Apple's "Illusion of Reasoning" paper to Anthropic's attribution-graph studies, force a question the field had been deferring: do we have a theory of what LLMs do, or only a growing list of things they can be made to do. Empirical scaling laws (Chinchilla, then Kaplan, then the 2024 updates) are descriptive, not explanatory. Mechanistic interpretability has gone from toy circuits in 2-layer transformers to multi-million-feature crosscoders on production models, and the picture it reveals is stranger than the textbook account of attention as soft lookup. This chapter pulls four threads that, taken together, are the closest thing the field has to a 2026 theory of cognition in LLMs: a formal account of reasoning, a theory of memory as a first-class computational primitive, mechanistic interpretability at production scale, and a working definition of agency that does not collapse into either "just prompt engineering" or "already AGI."

Chapter Overview

Frontier theory is where the engineering questions become research questions. This chapter walks the formal theories of reasoning (chain-of-thought as a computational primitive, process reward models, compositional reasoning limits), memory as a computational primitive (working vs long-term memory in transformer agents, external memory and Turing-completeness), mechanistic interpretability at scale (sparse autoencoders, circuit analysis, superposition, polysemanticity), and the nature of agency (when does a model become an agent, and how do you tell?).

These four topics are the research frontier most likely to reshape the engineering picture by 2030. This chapter is the practitioner's bridge from production work into the open theoretical questions.

Note: Learning Objectives

Explain formal theories of reasoning in LLMs, including process reward models and compositional limits.
Architect memory primitives (working, long-term, external) for transformer agents.
Apply sparse autoencoders and circuit analysis to a mechanistic interpretability problem.
Diagnose superposition and polysemanticity in feature directions.
Reason about when a model becomes an agent and what the boundary depends on.

Key Insight: Why this matters

None of the four sections below are settled science. The reason to read them anyway is that the practitioner who knows the open questions makes better engineering decisions than the practitioner who treats LLMs as black boxes. Reasoning theories tell you when chain-of-thought prompting will help and when it is theatre; memory primitives tell you why long-context models still fail at multi-step retrieval; interpretability research tells you which features are real and which are convenient post-hoc stories; agency definitions tell you what your "agent" product actually is and is not.

Sections in This Chapter

Prerequisites

Interpretability from Chapter 10
Pretraining and scaling laws from Chapter 6
Reasoning models from Chapter 8

Warning: Frontier Theory Is Provisional, Not Settled

Every claim in this chapter is a snapshot of a research field that is genuinely moving month-by-month. The superposition hypothesis, the scaling-law extrapolations, the agentic-capability definitions, and the agency thresholds in 81.4 are all working positions, not consensus. Read the cited papers, watch the next conference cycle, and recalibrate. Treating any one framework here as load-bearing for a production decision is the failure mode this chapter warns against most strongly.

Lab 76: Reproduce a superposition probe on an open SAE

Objective

Use a published sparse-autoencoder (SAE) checkpoint to extract features from a small open LLM, then probe whether the features behave as the superposition hypothesis predicts (sparsely active, semantically coherent, recoverable across prompts). The goal is to feel where mech-interp theory becomes empirical rather than rhetorical.

Steps

Pull an open SAE checkpoint from SAELens trained on GPT-2 small or Pythia-70m.
Pick 3 to 5 candidate semantic concepts (e.g., "Python keyword", "negation", "capital city", "month name"). Generate 50 prompts per concept that should activate the concept and 50 controls that should not.
Run the SAE over the model's residual stream on each prompt and record per-feature activations.
For each concept, identify the top-5 features by activation gap (concept minus control). Inspect their max-activating examples and judge: is each feature monosemantic?
Report: how many of your 3-5 concepts have a clearly monosemantic feature, and how many activated features split across multiple senses (the "polysemy" failure mode).

Expected Output

A short notebook with one table (concept x top-feature-id x activation gap), and one narrative paragraph per concept describing the max-activating-example pattern. Time: 4 to 6 hours. Difficulty: intermediate; CPU works for Pythia-70m, GPU helps for GPT-2 small.

What Comes Next

With the conceptual spine laid down, Chapter 58 grounds these theoretical questions in the hardware reality: where the megawatts and silicon are going, and how training-inference co-design is reshaping what counts as a frontier model.

Further Reading

Interpretability & Superposition

Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., et al. (2022). "Toy Models of Superposition." Anthropic / Transformer Circuits. arXiv:2209.10652. The Anthropic paper that established the superposition hypothesis, the conceptual frame for the entire modern mechanistic-interpretability program.

Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., et al. (2024). "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." Anthropic. Transformer Circuits. The reference paper for sparse-autoencoder feature extraction at production-LLM scale, the empirical state-of-the-art for the interpretability theory in this chapter.

Scaling Laws & Emergence

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., et al. (2022). "Training Compute-Optimal Large Language Models." NeurIPS. arXiv:2203.15556. The Chinchilla paper that corrected the data-vs-parameters tradeoff and remains the canonical scaling-law reference.

Schaeffer, R., Miranda, B., & Koyejo, S. (2023). "Are Emergent Abilities of Large Language Models a Mirage?" NeurIPS. arXiv:2304.15004. The reference paper challenging the emergence claim by attributing it to metric choice, a central debate any chapter on frontier theory must engage with.