Section 8.1: Trading FLOPs for IQ: The Test-Time Compute Bet

A wise person does not give the same amount of thought to choosing lunch as to choosing a career. Neither should a language model.
Chinchilla, Wisely Allocating AI Agent

Big Picture

Why does test-time compute matter? Since 2020, the primary recipe for better language models has been to scale up training: more parameters, more data, more GPU hours. The Kaplan scaling laws (2020) and the Chinchilla correction (2022) formalized this, predicting performance as a smooth function of training compute. But by 2024, the frontier labs were running into practical ceilings: training runs costing hundreds of millions of dollars, energy consumption rivaling small cities, and data supplies approaching exhaustion. Test-time compute offers a complementary scaling axis. Instead of making the model bigger, you let it think longer on each problem. This chapter explores what that means, how it works, and when it is worth the cost. For the train-time scaling baseline, see Section 6.3.

Prerequisites

This section assumes familiarity with scaling laws from Section 6.3 (Kaplan and Chinchilla scaling) and basic decoding strategies from Section 4.2 (sampling, temperature). No prior knowledge of reinforcement learning is required.

Two chess players side by side, one making instant reflexive moves while the other pauses to examine a thought bubble showing branching move sequences before carefully placing a piece — **Figure 8.1.1**: Reasoning models allocate more thinking time to harder problems, just as skilled players deliberate longer on critical moves.

8.1.1 Two Kinds of Compute

Key Insight

Why: chain-of-thought is a computational extension, not just a prompting trick

Each generated reasoning token expands the model's effective compute by one forward pass without adding any parameters. The model is using its context window as scratch space, which means it can perform multi-step computations that are physically impossible to compute in a single forward pass (the depth of the network is fixed). This is Merrill and Sabharwal's (2024) "Expressive Power of Transformers with chain-of-thought" result: with T CoT tokens, a transformer can simulate a Turing machine for T steps. Without CoT, transformers are bounded by a uniform circuit class. CoT is not just a prompting trick; it is a computational extension that turns a fixed-depth network into a variable-depth one.

Every language model consumes compute in two distinct phases. Understanding the distinction is essential for grasping why reasoning models represent a paradigm shift.

Train-time compute is the total processing power consumed during pretraining and post-training. This includes the GPU hours spent on next-token prediction over trillions of tokens, followed by supervised fine-tuning and alignment (see Chapter 18). Train-time compute is a one-time investment: once the model weights are set, the cost is amortized across every future inference request. Larger models with more parameters require proportionally more training compute, but serve every user with the same fixed set of weights.

Test-time compute (also called inference-time compute) is the processing power consumed each time the model generates a response. In a standard LLM, this cost is approximately proportional to the number of output tokens: each token requires one forward pass through the network. In a reasoning model, the cost per query can vary enormously. A simple factual question might consume 50 tokens of thinking, while a hard mathematical proof might consume 10,000 or more.

Key Insight

The fundamental innovation of reasoning models is adaptive compute allocation. A standard LLM spends roughly the same amount of computation per output token regardless of problem difficulty. A reasoning model allocates more "thinking tokens" to harder problems and fewer to easy ones. This mirrors how humans operate: you do not spend ten minutes deliberating over what to eat for breakfast, but you might spend hours reasoning through a complex legal contract.

8.1.1.1 The Classical Scaling Regime

The Kaplan scaling laws (2020) established that language model loss decreases as a power law in three variables: number of parameters $N$, dataset size $D$, and total training compute $C$. The Chinchilla correction (Hoffmann et al., 2022) refined this by showing that models should scale parameters and data equally: a model with $N$ parameters should be trained on approximately $20N$ tokens for compute-optimal training.

These laws predict diminishing returns: doubling training compute yields a roughly 10% reduction in loss. To cut loss in half, you need to increase compute by roughly 100x. At the frontier, where training runs already cost $100M+, this creates severe economic pressure. (For the full derivation and historical context, see Section 6.3.)

8.1.1.2 The Test-Time Scaling Regime

Test-time compute scaling offers a fundamentally different trade-off. Instead of spending more at training to improve the model for all queries equally, you spend more at inference to improve the model's response to a specific query. This has several important properties:

Per-query cost: Unlike training, which is amortized, test-time compute is a direct per-query cost. A reasoning model that generates 5,000 thinking tokens costs roughly 50x more than a standard model generating 100 tokens.
Adaptive allocation: The compute can be allocated proportional to difficulty. Easy queries receive minimal thinking; hard queries receive extensive deliberation. This means the average cost per query can be much lower than the worst case.
Complementary to train-time scaling: A model can be both well-trained and equipped with test-time reasoning. The two forms of scaling are additive, not substitutes. A reasoning model trained with more compute will reason better than a smaller one.
Immediate deployment: Improving test-time scaling does not require retraining the model. Techniques like best-of-N sampling or tree search can be applied to any model at serving time (though purpose-trained reasoning models perform best).

Under the Hood: Budget forcing

Budget forcing (Muennighoff et al., 's1: Simple test-time scaling', 2025, arXiv:2501.19393) controls a reasoning model's thinking length purely at decode time. To force MORE thinking, the decoder suppresses the end-of-thinking delimiter whenever the model tries to stop and appends the token 'Wait', which makes the model second-guess and extend its chain; to force LESS, it injects the end-of-thinking delimiter once a token budget is hit. Trained on only ~1,000 curated reasoning traces, s1-32B rode this trick to a near-linear accuracy-versus-thinking-tokens curve, matching far larger systems on competition math. It is the open counterpart to API budget parameters: same goal, but implemented in the sampling loop rather than behind a vendor flag.

Figure 8.1.2: Performance as a function of total compute under three scaling strategies. On hard reasoning tasks, test-time scaling (green) achieves higher accuracy per FLOP than simply using a larger model (purple). The combined approach (red dashed) pushes the frontier further.

8.1.2 The Snell et al. Framework

The foundational paper for understanding test-time compute is Snell et al. (2024), "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters." This work provides the theoretical and empirical framework for answering a simple question: given a fixed compute budget for a single inference query, how should you allocate that budget between model size and test-time reasoning?

8.1.2.1 The Compute-Optimal Inference Problem

Consider a scenario where you have a fixed inference compute budget $C$ FLOPs per query. You can spend this budget in several ways:

Single pass through a large model: Use all $C$ FLOPs for one forward pass through the largest model that fits within the budget. This is the classical approach.
Multiple passes through a smaller model: Use a model that requires $C/N$ FLOPs per pass and generate $N$ independent solutions, then select the best one using a preference learning.
Extended chain-of-thought with a medium model: Use a model that generates a long reasoning trace, spending extra tokens to "think through" the problem before committing to an answer.
Tree search with a small model: Use a small model to explore a tree of partial solutions, using a reward model to guide the search toward promising branches.

Snell et al. show that the optimal strategy depends critically on task difficulty. Their key finding can be summarized in one sentence: on hard problems where a base model achieves less than 50% accuracy, test-time compute scaling with a smaller model can match or exceed a 14x larger model using the same total FLOPs.

8.1.2.2 Difficulty-Dependent Optimality

Table 8.1.1a: 2.2 Difficulty-Dependent Optimality Comparison (as of 2026).

Task Difficulty	Base Model Accuracy	Optimal Strategy	Rationale
Easy	>90%	Single pass, large model	Most samples are already correct; extra compute is wasted
Medium	50% to 90%	Best-of-N (N=8 to 32)	Sampling diversity finds the right answer; reward model selects it
Hard	10% to 50%	Extended CoT or tree search	Structured reasoning explores the solution space more efficiently than independent samples
Beyond capability	<1%	Larger model required	If the model lacks necessary knowledge, no amount of thinking helps

Fun Fact

The "think longer vs. be bigger" trade-off has a nice real-world analogy. Imagine you are trying to solve a crossword puzzle. If you are a native English speaker, a hard crossword benefits from thinking longer: more time lets you find connections between clues. But if the crossword is in Japanese and you do not speak Japanese, no amount of extra time will help. You need a fundamentally different skill set (a bigger model, metaphorically speaking). Reasoning models face the same constraint: they can only reason about knowledge they already possess.

8.1.3 Mechanisms of Test-Time Compute

Bridge: The Generator-Verifier Asymmetry

Why does spending more test-time compute help at all? The deep answer is that verification is almost always cheaper than generation. This asymmetry has a name in computational complexity theory (it is the NP vs. coNP distinction restated for neural networks): it is generally faster to check whether a candidate answer is correct than to produce the correct answer from scratch. Every test-time compute mechanism in this section is an exploitation of that asymmetry:

Best-of-N sampling works because it is cheaper to score N candidates than to generate one perfect answer.
Reward models in RLHF (Chapter 20.1) work for the same reason: a 7B reward model can rank outputs from a 70B policy because ranking is easier than generating.
Process reward models (PRMs) push this further by verifying intermediate reasoning steps.
Self-reflection loops (Section 14.3) work because the same model can critique its own output more reliably than it can produce it correctly the first time.
RAG faithfulness checking (Section 23.9) is verification of generated text against retrieved sources.

When the asymmetry breaks down (e.g., open-ended creative tasks where verification is as hard as generation), test-time compute helps less. When the asymmetry is strongest (math, code, formal proofs where verification is cheap), test-time compute is transformative, which is why the o-series and DeepSeek-R1 invest heavily in those domains. This is also why RLVR (Section 20.4) works on math and code but not on essay writing.

There are three primary mechanisms through which a model can invest additional compute at inference time. Each represents a different approach to the same goal: generating better answers by spending more processing per query.

8.1.3.1 Extended chain-of-thought

The most visible mechanism in modern reasoning models is extended chain-of-thought (CoT). The model generates a long sequence of "thinking tokens" before producing its final answer. These tokens represent the model's internal deliberation: breaking the problem into steps, considering alternatives, checking intermediate results, and sometimes backtracking when it detects an error.

In OpenAI's o1 and o3, these thinking tokens are hidden from the user (the API returns only a summary). In DeepSeek R1, the thinking process is visible, enclosed in <think>...</think> tags. The number of thinking tokens varies adaptively: a simple arithmetic question might generate 50 thinking tokens, while a competition-level math problem might generate 10,000+.

Why thinking tokens help, mechanically. A transformer is a fixed-depth computation graph: each input token passes through the same number of layers regardless of problem difficulty. By generating intermediate reasoning tokens, the model effectively gives itself additional "layers" of processing. Each thinking token feeds back into the context, allowing subsequent computations to build on earlier ones. This converts a fixed-depth computation into an adaptive-depth one, which is necessary for problems that require compositional reasoning (multi-step math, logical deduction, code debugging). The cost is linear in the number of thinking tokens, but the problems it unlocks are qualitatively different from what a single forward pass can solve. For deployment, this means the KV cache grows proportionally with thinking tokens, making efficient memory management crucial.

# Comparing standard vs. reasoning model outputs on the same problem
# Problem: "What is the sum of all prime numbers less than 20?"
# === Standard Model (GPT-4o) ===
# Output: "The prime numbers less than 20 are: 2, 3, 5, 7, 11, 13, 17, 19.
# Their sum is 2 + 3 + 5 + 7 + 11 + 13 + 17 + 19 = 77."
# Tokens generated: ~40
# Time: ~0.5s
# === Reasoning Model (o3-mini) ===
# Hidden thinking tokens (~200 tokens):
# "I need to find all primes less than 20.
# Starting from 2: 2 is prime.
# 3 is prime (not divisible by 2).
# 4 = 2x2, not prime.
# 5 is prime (not divisible by 2 or 3).
# 6 = 2x3, not prime.
# 7 is prime (not divisible by 2, 3, or 5). Actually I only need
# to check up to sqrt(7) ~ 2.6, so just check 2. Not divisible. Prime.
# ...
# [continues for all numbers up to 19]
# Now summing: 2+3=5, 5+5=10, 10+7=17, 17+11=28, 28+13=41,
# 41+17=58, 58+19=77.
# Let me verify: 2,3,5,7,11,13,17,19. Count: 8 primes. Sum: 77."
# Visible output: "The sum of all prime numbers less than 20 is 77."
# Total tokens (thinking + output): ~250
# Time: ~2s
# The reasoning model consumed ~6x more tokens but used a more
# systematic verification process. For this easy problem, both
# get the right answer. The difference becomes dramatic on harder
# problems where the standard model would make errors.

Code Fragment 8.1.1b: Pseudocode comparison of standard vs. reasoning model output on a simple math problem. The reasoning model generates extensive hidden thinking tokens before producing a concise final answer.

8.1.3.2 Best-of-N Sampling

A conceptually simpler approach is to generate $N$ independent solutions and select the best one. This requires a reward model (or verifier) that can score each candidate solution. The approach is "embarrassingly parallel" since all N candidates can be generated simultaneously on separate GPUs.

Performance scales logarithmically with $N$: doubling the number of samples yields a roughly constant improvement in accuracy. This means that going from N=1 to N=8 provides a large boost, but going from N=64 to N=128 provides a much smaller one. Practical systems typically use N between 8 and 64, balancing cost against accuracy.

Algorithm 8.1.1: Self-Consistency Decoding (Majority Vote over CoT samples)

Algorithm: Self-Consistency
Input:  prompt x, model M, number of samples N, temperature T > 0,
        answer-extraction function extract(.)
Output: predicted answer y_hat

  // 1. Sample N independent chain-of-thought reasoning traces
  S := []
  For i = 1..N:
      r_i ~ M(. | x, temperature = T)           // each r_i is "Let's think step by step ... Answer: a"
      a_i := extract(r_i)                       // pull the final answer span
      S := S + [a_i]

  // 2. Marginalize over reasoning by majority vote on the final answer
  counts := {}
  For a in S:
      counts[a] := counts.get(a, 0) + 1
  y_hat := argmax_a counts[a]                   // most frequent extracted answer
  Return y_hat

Interpretation. If P(a | x) = sum_r P(a, r | x) is the marginal probability of answer a,
self-consistency approximates argmax_a P(a | x) by sampling traces r and counting answer mode,
rather than argmax_a P(a, r* | x) for a single greedy trace r*. This is why temperature T > 0
helps (it samples the marginal correctly) while greedy decoding (T = 0) collapses to one trace.

Source: Wang et al., "Self-Consistency Improves chain-of-thought Reasoning in Language Models," ICLR 2023 (arXiv:2203.11171). Self-consistency is the conceptual ancestor of best-of-N: where best-of-N uses an external verifier to score chains, self-consistency uses agreement among chains as an implicit verifier. It needs no extra model but only works when (a) the answer space is small enough that votes concentrate and (b) the model's marginal distribution over answers is better calibrated than its greedy single-trace answer.

8.1.3.3 Tree Search

Tree search methods, including Monte Carlo Tree Search (MCTS) and beam search variants, structure the reasoning process as an explicit search over partial solutions. Each node in the tree represents a partial chain of thought, and edges represent possible next reasoning steps. A reward model evaluates the promise of each partial solution, allowing the search to focus compute on the most promising reasoning paths.

Tree search is more compute-efficient than best-of-N for hard problems because it avoids wasting compute on completely wrong solution paths. If the first reasoning step is clearly wrong, tree search can detect this early (via the reward model) and redirect compute to better branches. Best-of-N, by contrast, generates each solution independently and can repeat the same mistakes across many of its N attempts.

We cover tree search in detail in Section 8.5, including AlphaProof's application of MCTS to formal mathematical reasoning.

8.1.3.4 Reasoning Architecture Taxonomy

The term "reasoning model" covers at least four meaningfully different architectural choices. They share the goal of allocating more compute to harder problems, but differ in where the reasoning lives, who can see it, and how it is structured. Understanding these distinctions matters for choosing the right model in production and for interpreting benchmark comparisons.

Figure 8.1.3: The four major reasoning architecture approaches compared. Extended CoT adds no new architecture but trains the model to reason longer. Hidden thinking (o-series) uses RL to produce internal reasoning tokens invisible to the user. Explicit thinking (DeepSeek R1) exposes the reasoning trace via <think> tags. Tree search (MCTS) generates and scores multiple reasoning branches, requiring a process reward model.

Extended chain-of-thought is the simplest approach architecturally: the model is the same transformer, but it is trained (or prompted) to produce long reasoning traces before the answer (the prompting techniques are covered in Section 12.2). No structural change to the model is required. The reasoning trace is visible in the output and can be inspected, which makes it useful for debugging. The limitation is that quality depends heavily on training, and a model trained only with prompting (rather than RL) may not learn when to stop reasoning.

Hidden Thinking is the approach used by OpenAI's o-series (o1, o3, o4-mini). The model is trained with reinforcement learning to produce internal reasoning tokens that are not returned in the API response. The thinking tokens are generated autoregressively and consume full KV cache memory during generation, but are stripped before the response is returned. From the user's perspective, the model appears to pause and then return a concise answer. The tradeoff is interpretability: you cannot inspect the reasoning, which makes debugging failures harder. (See Section 8.2 for the o-series architecture in detail.)

Explicit Thinking is the approach used by DeepSeek R1. The model outputs its reasoning trace enclosed in <think>...</think> XML tags, followed by the final answer. The trace is part of the response and can be read, cached, and used for fine-tuning or distillation. DeepSeek-AI (2025) showed that this behavior can emerge from pure reinforcement learning, without supervised chain-of-thought data, using their GRPO training algorithm. Because the trace is visible, it is significantly easier to debug why the model failed on a given problem. The tradeoff is that the full trace appears in the response, which increases output token count and therefore cost.

Tree Search (MCTS-style) represents a fundamentally different approach. Instead of generating one linear reasoning trace, the model generates multiple reasoning branches, scores each with a reward model, and expands the most promising ones. This is similar to Monte Carlo Tree Search as used in game-playing AI (AlphaGo, AlphaProof). The compute cost is substantially higher than linear CoT because multiple partial solutions are generated and evaluated. However, for problems that require finding a specific correct reasoning path among many plausible-looking wrong paths (olympiad mathematics, formal proofs, complex constraint satisfaction), tree search can find solutions that linear CoT misses entirely. A process reward model (Section 7) is essential for tree search to work: you need step-level scoring to prune bad branches early.

8.1.3.5 Best-of-N Sampling: Depth vs. Breadth

Best-of-N (also called rejection sampling) occupies the middle ground between simple extended CoT and expensive tree search. Generate $N$ complete independent reasoning chains from the same model, score each with a reward model, and return the highest-scoring response. The approach is "embarrassingly parallel" since all N chains can be generated simultaneously.

Compute cost is linear in N, but quality improvement is logarithmic. If the probability of any single chain being correct is $p$, then the probability of at least one correct solution among N samples is $1 - (1-p)^N$. This grows quickly at first (N=1 to N=8 gives a large boost) and then saturates (N=64 to N=128 gives a small boost). In practice, N between 8 and 64 covers most of the achievable gain.

import anthropic
import math
def best_of_n(prompt: str, n: int, score_fn) -> str:
    """
    Generate N independent responses and return the highest-scoring one.
    score_fn: callable that takes a response string and returns a float score.
    This could be a rule-based verifier (e.g., check if math answer is numeric
        and matches ground truth format) or a separate reward model API call.
    """
    client = anthropic.Anthropic()
    # Generate N independent responses (in practice, batch these in parallel)
    responses = []
    for _ in range(n):
        msg = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
            )
        responses.append(msg.content[0].text)
        # Score each response and pick the best
        scored = [(score_fn(r), r) for r in responses]
        best_score, best_response = max(scored, key=lambda x: x[0])
        return best_response
    # Example: verifiable math problem
def math_verifier(response: str) -> float:
    """Simple rule-based verifier: does the response contain a numeric answer?"""
    import re
    # Look for a final answer pattern like "= 42" or "answer is 42"
    matches = re.findall(r'(?:=|answer is|result is)\s*([\d,\.]+)', response.lower())
    return 1.0 if matches else 0.0
# Expected accuracy improvement with best-of-N
# If single-sample accuracy = 30%, best-of-N accuracy (theoretical ceiling):
    p = 0.30
    for n in [1, 4, 8, 16, 32, 64]:
        ceiling = 1 - (1 - p) ** n
        print(f"N={n:3d}: theoretical ceiling = {ceiling:.1%}")

Code Fragment 8.1.2a: Best-of-N sampling with a simple rule-based verifier. For production, the score_fn would call a dedicated reward model. The theoretical ceiling calculation shows that quality gains saturate rapidly beyond N=32.

Relationship to rejection sampling in RLHF training. The same best-of-N principle underlies a key step in RLHF: during the supervised fine-tuning warm-up, the model generates multiple responses to each prompt, and the best response (as scored by a reward model) is used as the training target. This is called rejection sampling fine-tuning (RFT) and is how DeepSeek-R1 created its supervised warm-up data. For full RLHF context, see Section 18.1.

When to prefer best-of-N over tree search. Best-of-N is simpler to implement (no partial-solution infrastructure), works with any model without modification, and has predictable cost (exactly N forward passes). Tree search has higher overhead (partial solutions must be tracked, the tree must be maintained, and a PRM is required) but is more compute-efficient on very hard problems. A practical rule: start with best-of-N at N=8 to 32; only move to tree search if best-of-N is still failing.

These reasoning strategies are powerful, but they are not free. Every additional thinking token costs compute, latency, and memory. Before deploying any of these approaches, you need to understand the cost structure and learn how to budget reasoning compute effectively.

8.1.4 The Economics of Thinking

Test-time compute introduces a new dimension to the economics of LLM inference. The cost per query is no longer approximately fixed; it varies with problem difficulty and the chosen reasoning budget.

8.1.4.1 Token Costs

Reasoning models typically charge for both thinking tokens and output tokens, though the pricing structure varies. As of early 2025:

Table 8.1.2b: 4.1 Token Costs Comparison (as of 2026).

Model	Input ($/1M tokens)	Output ($/1M tokens)	Thinking Tokens	Notes
GPT-4o	$2.50	$10.00	N/A (no thinking)	Standard model baseline
o3-mini	$1.10	$4.40	Included in output	Low/medium/high reasoning effort
o3	$10.00	$39.00	Included in output	Highest capability reasoning
Claude 3.5 Sonnet	$3.00	$15.00	N/A (no extended thinking)	Standard model baseline
DeepSeek R1	$0.55	$2.19	Visible in output	Open-weight alternative

A critical detail: reasoning models generate far more tokens per query than standard models. A query that costs $0.001 with GPT-4o (100 output tokens) might cost $0.05 to $0.20 with o3 (1,000 to 5,000 thinking + output tokens). The per-token price of reasoning models is sometimes lower, but the total cost per query is typically 5x to 50x higher because of the additional thinking tokens.

8.1.4.2 The Routing Optimization

The wide variance in per-query cost creates a strong incentive for intelligent routing: sending easy questions to cheap, fast models and reserving expensive reasoning models for hard questions. This is not merely a cost optimization; it also improves latency. A simple factual query answered by GPT-4o-mini in 200ms is a better user experience than the same query sent to o3, which might take 5 to 15 seconds as it "thinks" about something that needs no thought.

Effective routing requires a difficulty estimator. This can be as simple as a logistic regression classifier trained on labeled query difficulty, or as sophisticated as a small language model fine-tuned to predict whether a reasoning model would produce a better answer than a standard model. The difficulty estimator itself is cheap to run (typically a single forward pass through a small model) and acts as a gatekeeper that protects the reasoning model from wasting compute on trivial queries.

Warning: The Hidden Cost of Over-Thinking

Reasoning models can dramatically over-think simple questions. If you ask o3 "What is the capital of France?", it will still generate thinking tokens (perhaps 50 to 100) as it considers and verifies the answer. On a per-query basis, this waste is small. But at scale, if 80% of your traffic is simple queries, sending everything to a reasoning model means 80% of your inference budget is wasted on unnecessary deliberation. Always implement routing.

8.1.5 Historical Context: From chain-of-thought to Reasoning Models

The reasoning model paradigm did not emerge in a vacuum. It builds on a decade of research into improving language model reasoning, with each advance building on the last.

2017: Scratchpad method. Nye et al. showed that letting models write intermediate computation steps (a "scratchpad") improved performance on algorithmic tasks. This was the first evidence that visible intermediate reasoning helps.
2022: Chain-of-Thought prompting. Wei et al. demonstrated that simply prompting a model to "think step by step" dramatically improved reasoning accuracy. This required no model changes, just a prompt modification.
2022: Self-consistency. Wang et al. showed that sampling multiple chain-of-thought solutions and taking a majority vote improved accuracy beyond a single CoT sample. This introduced the idea that more inference-time samples yield better answers.
2022: STaR. Zelikman et al. proposed Self-Taught Reasoner, which iteratively fine-tunes a model on its own correct reasoning traces. This bootstrapping approach was an early precursor to training models specifically for reasoning.
2023: Process reward models. Lightman et al. (PRM800K) demonstrated that rewarding individual reasoning steps outperforms rewarding only final answers, laying the groundwork for step-level supervision in reasoning models.
2024: OpenAI o1. The first commercially deployed reasoning model, trained with reinforcement learning to produce extended chains of thought. Hidden reasoning tokens, RL-trained rather than prompt-elicited.
2025: DeepSeek R1. Open-weight reasoning model demonstrating that reasoning behavior emerges from pure RL (R1-Zero) without supervised CoT data. Introduced GRPO training and <think> token delimiters.
2025: Reasoning proliferation. Google Gemini 2.5 (thinking mode), Qwen QwQ, Kimi k1.5, o3, o4-mini. Reasoning capabilities become a standard feature rather than a niche capability.

8.1.6 When to Use Test-Time Compute

Reasoning models earn their cost on some tasks and waste it on others. The two short lists below tell you which side you are on.

Key Insight

Aha Moment: The Same Model, Two Different Verdicts

OpenAI's o1-preview pricing is roughly $15/M input and $60/M output tokens, against GPT-4o's $5/M and $15/M. On AIME 2024 (the American Invitational Mathematics Examination), o1-preview scores 74.4 percent and GPT-4o scores 13.4 percent. Same provider, same week, same prompt: a 61-point accuracy gap. On TriviaQA (factoid retrieval), o1-preview scores 81.2 percent and GPT-4o scores 82.6 percent: a 1.4-point gap in the wrong direction. The cost premium of 4x on input and on output buys you 61 points on competition math and zero on Wikipedia trivia. That single contrast is the entire chapter compressed to two data points: test-time compute is multi-step reasoning depth, not knowledge breadth, and routing to it for the wrong task is a 4x bill for a -1.4 point regression.

Use reasoning models when:

The task requires multi-step logical reasoning (mathematical proofs, code debugging, legal analysis)
The problem has a verifiable answer (math, code execution, constraint satisfaction)
Accuracy is more important than latency (batch processing, offline analysis)
The base model achieves 20% to 80% accuracy (the "sweet spot" for test-time scaling)
Errors are expensive (medical reasoning, financial analysis, safety-critical systems)

Use standard models when:

The task is primarily retrieval or summarization (factual lookups, document summarization)
Latency matters more than marginal accuracy (real-time chat, autocomplete)
The base model already achieves >90% accuracy (no room for improvement via thinking)
The task is creative or subjective (writing, brainstorming) where there is no single "right" answer
Cost per query is tightly constrained

Key Takeaways

Two dimensions of scaling: Train-time compute (bigger models, more data) and test-time compute (more thinking per query) are complementary axes for improving LLM performance.
Adaptive compute allocation is the core innovation: reasoning models invest more tokens on hard problems and fewer on easy ones, unlike standard models that spend roughly equal compute per token regardless of difficulty.
Difficulty-dependent optimality: On hard problems (10% to 50% base accuracy), a smaller model with test-time compute can match a 14x larger model. On easy problems (>90% accuracy), test-time compute is wasted.
Four architectures, one goal: Extended CoT, hidden thinking (o-series), explicit thinking (DeepSeek R1), and tree search (MCTS) all allocate more compute to harder problems but differ in where reasoning lives, who can see it, and how it is structured.
KV cache is the memory bottleneck: A 70B GQA model generating 50,000 thinking tokens consumes roughly 16 GB of KV cache per request. Budget caps and routing are essential mitigations in production.
PRMs outperform ORMs for tree search: Process reward models score each reasoning step, enabling early branch pruning. Outcome reward models score only the final answer and cannot prune bad branches before completion.
Best-of-N is the practical starting point: Logarithmic quality improvement, linear compute cost, no additional infrastructure. Start here before investing in tree search.
Economics favor routing: Because reasoning models cost 5x to 50x more per query, intelligent routing of easy queries to cheap models and hard queries to reasoning models is essential for production systems.

Tip: Use chain-of-thought for Reasoning Tasks

For math, logic, and multi-step reasoning, simply adding "Let's think step by step" to your prompt can improve accuracy by 10 to 40%. For production, use structured chain-of-thought with explicit step numbering for more consistent results.

Warning

Common Misconception: "Chain-of-Thought Always Improves Accuracy"

The "always think step by step" advice spreads as if CoT were free. Liu et al. (2024) and Sprague et al. (2024) show CoT helps on math, symbolic reasoning, and multi-hop logic, but on tasks the model already knows well (factual lookup, sentiment classification, instruction following) CoT can hurt accuracy by 1 to 5 points; the extra reasoning tokens give the model room to second-guess a correct intuition and rationalize the wrong answer. Use CoT where verification is possible and the base model is in the 20 to 80 percent accuracy band; for high-confidence retrieval tasks, force a direct answer.

What's Next?

The discussion continues in Section 8.1a: KV Cache Growth, PRMs vs ORMs & Exercises, which covers the memory arithmetic that governs deployment of reasoning models under long thinking traces, the choice between Process and Outcome Reward Models for scoring those traces, and exercises that consolidate the test-time-compute picture. After that, Section 8.2 surveys the specific reasoning models (o-series, R1, QwQ).

Foundational Works

Snell, C. et al. (2024). "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters." arXiv preprint arXiv:2408.03314. The central paper for this section, establishing when and how test-time compute scaling outperforms train-time scaling. Useful for understanding the compute-optimal inference problem.

OpenAI. (2024). "Learning to Reason with LLMs." OpenAI Blog. September 2024. The o1 system announcement, describing the high-level approach of RL-training for hidden chain-of-thought reasoning. While not a full technical paper, it establishes the hidden-thinking architecture and summarizes benchmark results that opened the reasoning model era.

Kaplan, J. et al. (2020). "Scaling Laws for Neural Language Models." arXiv preprint arXiv:2001.08361. The original scaling laws paper that established the train-time compute paradigm. Understanding these laws provides the baseline against which test-time scaling is measured.

Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." NeurIPS 2022. The Chinchilla paper that corrected Kaplan's scaling laws, showing that models should be trained on 20x as many tokens as they have parameters.

Wei, J. et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022. The chain-of-thought prompting paper that demonstrated the value of step-by-step reasoning, laying the conceptual foundation for reasoning models.

Wang, X. et al. (2023). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." ICLR 2023. Introduced self-consistency decoding (majority vote over multiple CoT samples), the conceptual precursor to best-of-N sampling with reward models.

Nye, M. et al. (2021). "Show Your Work: Scratchpads for Intermediate Computation with Language Models." arXiv preprint arXiv:2112.00114. Early work showing that allowing models to write intermediate computation steps improves performance on algorithmic tasks.