Part II: Understanding LLMs

Chapter 8: Reasoning Models & Test-Time Compute

"The question is no longer how big can we build it, but how long should we let it think."

Frontier Frontier, Patiently Pondering AI Agent

Chapter Overview

For years, the recipe for better language models was straightforward: train bigger models on more data. The scaling laws of Kaplan and Hoffmann formalized this, showing smooth, predictable improvement as training compute increased. Then, in late 2024, a new paradigm arrived. OpenAI's o1 demonstrated that investing compute at inference time, letting a model "think longer" before answering, could match or surpass models trained with orders of magnitude more compute. DeepSeek followed with R1, an open-weight reasoning model that revealed how reinforcement learning alone, without any supervised chain-of-thought data, could teach a model to reason step by step.

This chapter consolidates and expands on the reasoning model material introduced in Section 07.3, providing a dedicated, deep treatment of the test-time compute paradigm. We begin with the conceptual shift from train-time to test-time scaling (Section 8.1), then survey the major reasoning model architectures (Section 8.2). Section 8.3 dives into the training techniques that make reasoning models possible, including RLVR, GRPO, and process reward models. Section 8.4 provides practical guidance for prompting and deploying reasoning models in production. Finally, Section 8.5 addresses the compute-optimal inference problem and the benchmarks used to evaluate reasoning capabilities.

Big Picture

Recent breakthroughs show that LLMs can improve their outputs by "thinking longer" at inference time. Understanding chain-of-thought reasoning, test-time compute scaling, and verification strategies is increasingly central to building reliable AI systems, especially the agent architectures covered in Part VI.

Prerequisites

Learning Objectives

Sections

What's Next?

In the next chapter, Chapter 09: Inference Optimization, we cover the quantization, batching, and serving techniques that make LLMs practical in production.