
"The question is no longer how big can we build it, but how long should we let it think."
Frontier, Patiently Pondering AI Agent
The model landscape from Chapter 7 (Modern LLM Landscape) includes a new category that didn't exist two years ago: reasoning models. o1, o3, DeepSeek-R1, and QwQ trade tokens for IQ, they "think out loud" before answering, and the extra computation buys real accuracy on hard problems. This chapter explains the paradigm (test-time compute), how these models are trained (RLVR, GRPO, PRM), and when paying for thinking actually pays off.
Chapter Overview
For years, the recipe for better language models was straightforward: train bigger models on more data. The scaling laws of Kaplan and Hoffmann formalized this, showing smooth, predictable improvement as training compute increased. Then, in late 2024, a new paradigm arrived. OpenAI's o1 demonstrated that investing compute at inference time, letting a model "think longer" before answering, could match or surpass models trained with orders of magnitude more compute. DeepSeek followed with R1, an open-weight reasoning model that revealed how reinforcement learning alone, without any supervised chain-of-thought data, could teach a model to reason step by step.
This chapter consolidates and expands on the reasoning model material introduced in Section 8.3, providing a dedicated, deep treatment of the test-time compute paradigm. We begin with the conceptual shift from train-time to test-time scaling (Section 8.1), then survey the major reasoning model architectures (Section 8.2). Section 8.3 dives into the training techniques that make reasoning models possible, including RLVR, GRPO, and process reward models. Section 8.4 provides practical guidance for prompting and deploying reasoning models in production. Finally, Section 8.5 addresses the compute-optimal inference problem and the benchmarks used to evaluate reasoning capabilities.
Recent breakthroughs show that LLMs can improve their outputs by "thinking longer" at inference time. Understanding chain-of-thought reasoning, test-time compute scaling, and verification strategies is increasingly central to building reliable AI systems, especially the agent architectures covered in Part VI.
- Explain the paradigm shift from train-time to test-time compute scaling and identify the conditions under which each strategy is preferable
- Compare the architectures and training methods of major reasoning models: OpenAI o1/o3/o4-mini, DeepSeek R1, Gemini 2.5, and QwQ
- Describe RLVR, GRPO, and their roles in training reasoning behavior without supervised chain-of-thought data
- Distinguish Process Reward Models (PRMs) from Outcome Reward Models (ORMs) and explain when each is appropriate
- Apply effective prompting strategies for reasoning models, including budget control and structured output extraction
- Evaluate the cost/benefit trade-offs of test-time compute using the compute-optimal inference framework
- Navigate reasoning model benchmarks (AIME, MATH-500, ARC-AGI, SWE-bench) and interpret results critically
Prerequisites
- Chapter 3: Transformer architecture (attention mechanism, feed-forward layers)
- Chapter 4: Decoding strategies (greedy, beam search, sampling methods)
- Chapter 6: Pretraining, scaling laws, and compute-optimal training
- Chapter 7: Modern LLM landscape (recommended but not strictly required)
- Basic familiarity with reinforcement learning concepts (reward, policy, optimization)
Sections
- 8.1 Trading FLOPs for IQ: The Test-Time Compute Bet Why does test-time compute matter? Entry
- 8.1a KV Cache Growth, PRMs vs ORMs & Exercises Memory arithmetic for long thinking traces, process vs outcome reward models, and exercises on test-time compute design. Intermediate
- 8.2 Reasoning Model Architectures: o1, o3, R1, QwQ A survey of reasoning architectures. Entry
- 8.3 Training Reasoning Models: RLVR, GRPO, PRM How do you train a model to reason? Intermediate
- 8.4 Prompting and Using Reasoning Models Reasoning models require different prompting strategies than standard models. Intermediate
- 8.5 Compute-Optimal Inference and Evaluation Making test-time compute pay off. Advanced
- 8.6 Formal and Verifiable Reasoning with Proof Assistants Formal theorem proving represents the gold standard for verifiable reasoning. Advanced
- 8.6a AlphaProof, Self-Play RL, and Evaluation for Formal Proving AlphaProof's IMO silver medal, dataset extraction, the three RL paradigms that exploit a perfect verifier, and pass@k metrics. Advanced
What's Next?
Next: Chapter 9: Inference Optimization & Efficient Serving. Reasoning models burn 10× to 100× the tokens of a normal completion. That makes the next question a matter of survival: how do you serve those tokens fast enough and cheap enough to not bankrupt the product? Chapter 9 covers quantization (INT4, NF4, FP8), KV cache tricks, speculative decoding, and the serving stacks (vLLM, SGLang, TensorRT-LLM) that put it all together.