Section 9.6: Test-Time Compute & Reasoning Models

"A wise person does not always answer faster. Sometimes wisdom means pausing to think."
Quant, Thoughtfully Pausing AI Agent

Prerequisites

This section builds on the inference optimization techniques covered in Section 9.1 (Quantization) through Section 9.5 (Pruning), and assumes familiarity with autoregressive decoding from Section 05.1 and sampling strategies from Section 05.2. Understanding reinforcement learning from human feedback (Section 17.1) will provide helpful context for process reward models.

Big Picture

Every optimization technique in this chapter so far has focused on making inference faster and cheaper: quantization reduces memory, pruning removes weights, speculative decoding parallelizes token generation, and serving frameworks maximize throughput. This section reverses the question entirely. Instead of asking "how do we spend less compute at inference time?", reasoning models ask: "what if we spend more compute at inference time to get better answers?" This paradigm shift, known as test-time compute scaling, represents one of the most important developments in LLM capabilities since the original scaling laws covered in Section 06.2. The previous sections taught you how to make inference efficient; this section teaches you when and why to make it deliberately expensive. For a comprehensive treatment of reasoning model architectures, training methods, and evaluation, see Chapter 08: Reasoning Models and Test-Time Compute.

1. The Test-Time Compute Paradigm

For years, the dominant recipe for improving LLM performance was straightforward: train a bigger model on more data. The scaling laws formalized by Kaplan et al. (2020) and Hoffmann et al. (2022) showed that loss decreases predictably as you increase model parameters and training tokens. But this approach has diminishing returns: doubling model size requires roughly doubling both training compute and serving costs, while the performance gain follows a power law with a small exponent.

Test-time compute scaling offers a different tradeoff. Instead of making the model permanently larger (which increases cost for every query, whether easy or hard), you let the model "think longer" only when it needs to. The key insight is that many problems benefit from extended reasoning: breaking a problem into steps, checking intermediate results, exploring alternative approaches, and self-correcting errors. By generating additional "thinking tokens" at inference time, a smaller model can match or exceed the performance of a much larger model on challenging tasks, while remaining fast and cheap on easy ones.

Key Insight

The traditional scaling paradigm treats every query identically: a 70B model uses 70B parameters whether you ask "What is 2+2?" or "Prove that there are infinitely many primes." Test-time compute scaling breaks this symmetry. A reasoning model might spend 50 tokens of internal thought on the first question and 5,000 tokens on the second. This adaptive compute allocation is analogous to how humans spend seconds on easy questions and hours on hard ones. The practical consequence is profound: you can trade inference FLOPs for better answers on hard problems, and the exchange rate is often favorable. A 70B reasoning model thinking for 30 seconds can outperform a 400B standard model that answers in 2 seconds, at a fraction of the training cost.

This paradigm shift was first demonstrated at scale by OpenAI's o1 model in late 2024, which showed that a model trained with reinforcement learning to produce extended chain-of-thought reasoning could achieve dramatic improvements on mathematics, coding, and scientific reasoning benchmarks. The results were striking: on the American Invitational Mathematics Examination (AIME), o1 scored comparably to top human competitors, while standard GPT-4 level models struggled with even basic problems from the same exam.

2. Reasoning Models: Architecture and Training

2.1 The "Thinking Tokens" Paradigm

Reasoning models generate two distinct types of output: thinking tokens (internal reasoning that the user may or may not see) and answer tokens (the final response). During inference, the model first generates a potentially lengthy chain of thought, exploring the problem space, checking its own work, and refining its approach. Only after this thinking phase does it produce the visible answer. The thinking tokens are typically enclosed in special delimiters (such as <think>...</think>) and may be hidden from the end user or shown as a collapsible "reasoning trace."

This is not simply prompt engineering with "think step by step" instructions. Reasoning models are specifically trained (typically through reinforcement learning) to produce high-quality internal reasoning. The RL training reward signal comes from verifiable outcomes: did the model get the math problem right? Did the generated code pass all test cases? This outcome-based training teaches the model to develop effective reasoning strategies organically, rather than mimicking the surface form of chain-of-thought demonstrations.

2.2 Key Reasoning Models

OpenAI o1, o3, and o4-mini. The o-series models pioneered commercial reasoning at scale. Released starting in late 2024, o1 demonstrated that reinforcement learning could train models to produce extended internal reasoning chains. The model generates a hidden chain of thought before every response, with the thinking process sometimes consuming thousands of tokens. The o3 and o4-mini models refined this approach with improved training recipes and more efficient architectures. o4-mini, in particular, showed that smaller models trained for reasoning can outperform much larger standard models on hard benchmarks, while keeping costs manageable for production workloads.

DeepSeek R1. Released in January 2025, DeepSeek R1 is an open-weight reasoning model that demonstrated a remarkably simple training recipe. The team applied group relative policy optimization (GRPO) directly to a base language model, rewarding correct answers on math and coding tasks, without any supervised fine-tuning on human-written chain-of-thought examples. The model spontaneously developed reasoning behaviors (self-verification, backtracking, exploring alternatives) purely through RL training. DeepSeek R1 matched or exceeded o1-preview on several benchmarks while being fully open-weight, enabling the research community to study and build upon its approach. The training recipe is described in detail in DeepSeek-AI (2025).

Google Gemini 2.5 Thinking Mode. Google's Gemini 2.5 models include an optional "thinking" mode that enables extended reasoning. When activated, the model generates a visible thinking trace before its answer, similar to the o-series approach. The Gemini implementation is notable for its integration with multimodal inputs (discussed in Section 27.1): the model can reason over images, charts, and documents, not just text.

Anthropic Claude with extended thinking. Claude models also support an extended thinking mode where the model produces a structured reasoning trace before answering. The thinking budget can be configured by the API caller, allowing fine-grained control over the compute/quality tradeoff.

Dimension Comparison

Table 9.6.1: Comparison of Standard LLM vs. Reasoning Model Behavior
Dimension	Standard LLM	Reasoning Model
Response latency	1 to 5 seconds (typical)	5 to 120+ seconds (depends on problem difficulty)
Output tokens per query	100 to 2,000 (answer only)	500 to 50,000+ (thinking + answer)
Cost per query	$0.001 to $0.05	$0.01 to $2.00+ (proportional to thinking tokens)
Accuracy on easy tasks	High	High (similar, sometimes slightly lower due to overthinking)
Accuracy on hard reasoning	Low to moderate	High (often 2x to 5x improvement on math/code benchmarks)
Streaming experience	Immediate token flow	Pause during thinking, then answer streams
Determinism	Controllable via temperature	Less controllable; thinking paths vary across runs

3. Process Reward Models (PRMs)

A standard reward model (often called an Outcome Reward Model, or ORM) evaluates only the final answer. Given a question and a complete response, the ORM assigns a single scalar score. This is the approach used in standard RLHF, as covered in Section 17.1. However, for multi-step reasoning, outcome-level reward has a critical limitation: it provides no signal about where the reasoning went wrong. A model that makes a correct guess for the wrong reasons receives the same reward as one that reasons flawlessly.

Process Reward Models (PRMs) address this by scoring each step of the reasoning chain independently. Given a partial solution up to step k, the PRM predicts the probability that continuing from this step will eventually reach a correct answer. This provides dense, step-level feedback that can guide both training and inference-time search.

3.1 Training PRMs

The seminal work on PRMs is Lightman et al. (2023), which introduced the PRM800K dataset: 800,000 step-level human annotations on mathematical reasoning chains. Each step in a solution is labeled as "correct," "incorrect," or "neutral." Training a reward model on these annotations produces a PRM that can evaluate reasoning quality at each intermediate step.

A key finding from this work is that PRMs substantially outperform ORMs when used to select among multiple candidate solutions. Given 100 candidate solutions to a math problem, ranking them by the PRM's minimum step-level score (the "worst step" heuristic) identifies correct solutions far more reliably than ranking by a single outcome score. The Math-Shepherd project (Wang et al., 2024) extended this approach by automating the collection of step-level labels, using Monte Carlo rollouts to estimate whether each step leads to a correct answer. This automation makes PRM training scalable without requiring expensive human annotations for every step.

3.2 How PRMs Enable Better Reasoning

PRMs unlock several powerful inference-time strategies:

Best-of-N with step-level scoring: Generate N candidate reasoning chains and select the one with the highest minimum PRM score across all steps. This catches chains that arrive at the right answer through flawed reasoning.
Step-level beam search: At each reasoning step, generate multiple continuations, score them with the PRM, and keep only the top-scoring branches. This prunes bad reasoning paths early, before they waste compute.
Training signal for RL: Use the PRM as a dense reward signal during reinforcement learning, providing feedback at every step rather than only at the end of a complete solution. This dramatically improves sample efficiency during RL training.

4. Search at Inference Time

The combination of a reasoning model and a verifier (PRM or ORM) naturally leads to search at inference time. Rather than generating a single reasoning chain and hoping it is correct, the system generates multiple candidates and uses the verifier to select or guide the best one. This is conceptually similar to how AlphaGo combined a neural network with Monte Carlo Tree Search; the model provides intuition and the search procedure provides reliability.

4.1 Best-of-N Sampling

The simplest search strategy is best-of-N: generate N independent solutions to a problem, score each one with a verifier, and return the highest-scoring solution. Despite its simplicity, best-of-N is remarkably effective. On the MATH benchmark, generating 256 solutions and selecting the best one (as scored by a PRM) can boost accuracy from roughly 50% (single sample) to over 85%. The cost scales linearly with N, but the returns diminish logarithmically, so there is a practical sweet spot (typically N = 16 to 64 for most applications).

4.2 Monte Carlo Tree Search (MCTS) for Reasoning

MCTS extends best-of-N by introducing structured exploration. Instead of generating complete solutions independently, MCTS builds a tree where each node represents a partial reasoning state and each edge represents a reasoning step. The algorithm iterates through four phases: (1) selection, using an upper confidence bound to balance exploration and exploitation; (2) expansion, generating new reasoning steps from promising nodes; (3) simulation, rolling out the reasoning to completion; and (4) backpropagation, updating node values based on the outcome. A PRM provides the value estimates at each node, replacing the random rollouts used in traditional MCTS with informed step-level evaluations.

4.3 The Compute-Optimal Frontier

A central question in test-time compute scaling is: when should you think more versus use a bigger model? Snell et al. (2024) systematically studied this question and found that the answer depends on problem difficulty. For easy problems, a standard model answers correctly on the first attempt, so additional thinking wastes compute. For moderately hard problems, best-of-N sampling with a small model can match a much larger model at lower total cost. For extremely hard problems, even extensive search with a small model cannot compensate for fundamental capability gaps, and a larger model is necessary.

The practical implication is a routing strategy: classify incoming queries by difficulty and route them to the appropriate compute tier. Easy queries go to a fast, cheap model with no extra thinking. Medium queries go to a reasoning model with moderate thinking budget. Hard queries go to a large reasoning model with extensive search. This adaptive allocation can reduce overall inference costs by 3x to 10x compared to sending every query to the most capable model. We revisit routing architectures in the context of Section 29.4.

Why test-time compute changes the optimization calculus. Every technique in this chapter so far (quantization, KV cache optimization, speculative decoding, pruning) optimizes for the same objective: minimize the cost of producing each token. Reasoning models invert this objective. They deliberately spend more tokens (and more compute) per query because the marginal value of a correct answer on hard problems far exceeds the marginal cost of extra tokens. This means the serving infrastructure you build must support both modes: high-throughput, low-latency serving for easy queries (using all the techniques from Sections 9.1 through 9.5) and high-compute, variable-latency serving for reasoning queries. The frameworks discussed in Section 9.4 (vLLM, SGLang) are evolving to support this dual mode, but the optimal architecture for mixed reasoning/standard workloads remains an active area of engineering.

5. Using Reasoning Models in Practice

5.1 API Usage

Reasoning model APIs follow the same chat completion interface as standard models, but with additional parameters to control thinking behavior. Code Fragment 9.6.1 demonstrates how to call a reasoning model and inspect both the thinking trace and the final answer.


# Call a reasoning model (o1/o3) with extended thinking enabled.
# The response includes the internal reasoning trace alongside the answer.
from openai import OpenAI

client = OpenAI()

# Call a reasoning model with extended thinking
response = client.chat.completions.create(
 model="o4-mini", # reasoning model
 messages=[
 {
 "role": "user",
 "content": (
 "Find all positive integers n such that "
 "n^2 + 2n + 3 is divisible by n + 1."
 )
 }
 ],
 # Some APIs allow controlling thinking budget:
 # max_completion_tokens includes both thinking + answer tokens
 max_completion_tokens=16000
)

# The response includes both reasoning and answer
message = response.choices[0].message
print("=== Final Answer ===")
print(message.content)

# Access thinking tokens (if exposed by the API)
# Note: OpenAI o-series models hide thinking tokens by default;
# other providers (DeepSeek, Anthropic) may expose them.
if hasattr(message, "reasoning_content") and message.reasoning_content:
 print("\n=== Thinking Trace (first 500 chars) ===")
 print(message.reasoning_content[:500])

# Cost awareness: reasoning models consume many more tokens
usage = response.usage
print(f"\nTokens used: {usage.total_tokens}")
print(f" Prompt tokens: {usage.prompt_tokens}")
print(f" Completion tokens: {usage.completion_tokens}")
# For o-series, completion_tokens includes hidden thinking tokens
# A simple math question might use 2,000+ thinking tokens
# A hard problem might use 20,000+ thinking tokens

=== Final Answer === We need n + 1 to divide n^2 + 2n + 3. Since n^2 + 2n + 3 = (n+1)^2 + 2, we need (n+1) | 2. So n + 1 is in {1, 2}, giving n in {1}. The answer is n = 1. Tokens used: 2847 Prompt tokens: 38 Completion tokens: 2809

Code Fragment 9.6.1: Calling a reasoning model via the OpenAI API. Note the high completion token count, which includes hidden thinking tokens.

5.2 DeepSeek R1: Open-Weight Reasoning

For teams that need on-premise deployment or want to customize reasoning behavior, DeepSeek R1 provides a fully open-weight alternative. The model can be served using the same infrastructure covered in Section 9.4 (vLLM, SGLang, TGI) and exposes its thinking trace directly in the output. Code Fragment 9.6.2 shows how to use R1 with visible reasoning.


# Run DeepSeek R1 locally: generate with a thinking prompt template
# and parse the <think>...</think> tags to extract the reasoning chain.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load DeepSeek R1 (or a distilled variant for testing)
model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
 model_name,
 torch_dtype=torch.bfloat16,
 device_map="auto"
)

prompt = "How many prime numbers are there between 100 and 130?"
messages = [{"role": "user", "content": prompt}]
input_text = tokenizer.apply_chat_template(
 messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

# Generate with enough tokens for thinking + answer
outputs = model.generate(
 **inputs,
 max_new_tokens=4096,
 temperature=0.6,
 do_sample=True
)

full_response = tokenizer.decode(
 outputs[0][inputs["input_ids"].shape[1]:],
 skip_special_tokens=True
)

# R1 outputs thinking inside <think>...</think> tags
if "<think>" in full_response and "</think>" in full_response:
 think_start = full_response.index("<think>") + len("<think>")
 think_end = full_response.index("</think>")
 thinking = full_response[think_start:think_end].strip()
 answer = full_response[think_end + len("</think>"):].strip()
 print(f"Thinking ({len(thinking.split())} words):")
 print(thinking[:300] + "...")
 print(f"\nFinal answer: {answer}")
else:
 print(full_response)

Thinking (187 words): Let me work through this systematically. I need to check each number from 101 to 129 for primality. 101: not divisible by 2, 3, 5, 7. sqrt(101) is about 10, so I only need to check up to 10. 101/7 = 14.4... So 101 is prime. 103: similar check... 103 is prime. 107: prime. 109: prime. 113: prime... Final answer: There are 6 prime numbers between 100 and 130: 101, 103, 107, 109, 113, and 127.

Code Fragment 9.6.2: Using DeepSeek R1 locally with visible thinking trace. The model generates explicit reasoning before answering.

Routing Easy vs. Hard Queries at a Financial Services Firm

Who: A quantitative analysis team at a mid-size asset management firm building an AI assistant for portfolio analysts.

Situation: The team initially deployed a reasoning model (o3) for all queries, achieving excellent accuracy on complex financial calculations and regulatory interpretation.

Problem: Monthly API costs reached $18,000 because every query, including simple ones like "What is the current P/E ratio of AAPL?" and "Summarize this earnings call," was being processed with full extended thinking. Average response latency was 12 seconds, frustrating analysts who needed quick lookups.

Decision: The team implemented a difficulty-based router. A lightweight classifier (fine-tuned from a small embedding model) categorized incoming queries into three tiers: (1) factual lookups and simple summaries sent to GPT-4o-mini; (2) multi-step calculations and comparisons sent to GPT-4o; and (3) complex reasoning tasks (portfolio optimization, risk scenario analysis, regulatory interpretation) sent to o3.

Result: Monthly costs dropped to $4,200 (a 77% reduction). Average latency fell to 2.8 seconds. Accuracy on hard reasoning tasks remained unchanged, while simple query accuracy actually improved slightly because the fast model was less likely to overthink straightforward questions.

Lesson: Reasoning models are a precision tool, not a universal replacement. The highest-ROI deployment routes hard problems to reasoning models and everything else to fast, cheap standard models. Spending $0.50 in thinking tokens on a question that a $0.001 call can answer correctly is waste, not intelligence.

5.3 Cost and Latency Considerations

Thinking tokens are not free. Each thinking token incurs the same per-token cost as an output token (and sometimes more, depending on the provider). A reasoning model that generates 10,000 thinking tokens before producing a 200-token answer costs roughly 50x more than a standard model that produces the same 200-token answer directly. The latency impact is similarly significant: those 10,000 thinking tokens must be generated sequentially before the answer begins streaming, creating a noticeable delay.

For production systems, the key practices are:

Set thinking budgets: Most reasoning model APIs allow you to cap the maximum number of thinking tokens. For well-defined tasks, experiment to find the minimum budget that maintains accuracy.
Use streaming for user experience: Even when the thinking phase takes 30 seconds, streaming the thinking trace (where available) gives users feedback that the model is working, reducing perceived latency.
Batch hard problems: If latency is not critical (offline analysis, batch processing), reasoning models provide the best accuracy per dollar on hard tasks. The cost premium is justified when correctness matters more than speed.
Monitor token usage: Track thinking token consumption per query type. Unexpected spikes in thinking tokens often indicate that the model is struggling with ambiguous inputs that could be clarified through better prompting (see Section 11.1).

6. The Compute-Optimal Frontier: Think More or Use a Bigger Model?

Snell et al. (2024) formalized the tradeoff between test-time compute and model size. Their key finding is that the optimal strategy depends on a "difficulty curve": for each problem, there exists a crossover point where additional test-time compute on a smaller model becomes more expensive than simply querying a larger model once. Below this crossover, test-time compute wins. Above it, model scale wins.

The practical framework for deciding between test-time compute and model scale involves three factors:

Problem difficulty distribution: If most of your queries are hard (e.g., competitive programming, advanced mathematics), investing in a larger model may be more cost-effective than scaling test-time compute on a small one. If difficulty varies widely, adaptive routing yields the best results.
Latency requirements: Test-time compute introduces variable latency proportional to problem difficulty. For real-time applications (chatbots, autocomplete), this variability can be unacceptable. For batch or offline applications, it is irrelevant.
Verifier quality: The effectiveness of test-time search depends critically on having a good verifier. For domains where correctness can be checked automatically (math, code, formal logic), test-time compute is highly effective. For open-ended tasks (creative writing, summarization), verifiers are weaker, and the returns from additional thinking diminish rapidly.

Self-Check

1. What is the fundamental difference between traditional scaling (training bigger models) and test-time compute scaling?

Show Answer

Traditional scaling increases model parameters and training data, which raises the cost of every query equally regardless of difficulty. Test-time compute scaling keeps the model size fixed but allows the model to spend variable amounts of inference compute per query. Easy questions get fast, cheap answers; hard questions get more thinking time. This adaptive allocation means you only pay extra compute when the problem warrants it.

2. How does a Process Reward Model (PRM) differ from an Outcome Reward Model (ORM), and why does this matter for reasoning?

Show Answer

An ORM scores only the final answer, providing a single reward signal for the entire reasoning chain. A PRM scores each intermediate step, providing dense feedback about where reasoning is correct or goes wrong. This matters because PRM scores enable step-level search (pruning bad reasoning branches early), more reliable selection among candidate solutions (catching right-answer-wrong-reasoning cases), and more sample-efficient RL training through dense reward signals.

3. Why is best-of-N sampling with a verifier effective despite its simplicity?

Show Answer

Best-of-N works because generating N independent solutions explores N different reasoning paths. Even if each individual path has a moderate probability of being correct (say 30%), the probability that at least one of N = 16 samples is correct is very high (approximately 99%). The verifier (PRM or ORM) then identifies which sample is most likely correct. The cost scales linearly with N, but accuracy improves rapidly for small N, creating a favorable cost/accuracy tradeoff up to a practical ceiling.

4. When should you prefer a standard (non-reasoning) model over a reasoning model?

Show Answer

Prefer a standard model when: (1) the task is straightforward and does not require multi-step reasoning (factual lookups, simple summaries, classification); (2) latency is critical and users cannot wait for a thinking phase; (3) cost sensitivity is high and the task does not justify 10x to 50x higher per-query costs; or (4) the task is inherently open-ended (creative writing, brainstorming) where additional thinking provides diminishing returns because there is no clear "correct" answer to verify against.

Key Takeaways

Test-time compute scaling is the paradigm of spending more inference compute (thinking tokens) to improve answer quality, as opposed to training a larger model.
Reasoning models (OpenAI o-series, DeepSeek R1, Gemini 2.5 thinking, Claude extended thinking) generate internal chain-of-thought reasoning before answering, achieving dramatic improvements on math, code, and scientific reasoning.
Process Reward Models (PRMs) score each reasoning step independently, enabling step-level search, early pruning of bad reasoning paths, and more reliable solution selection.
Inference-time search (best-of-N, MCTS, beam search over reasoning chains) uses verifiers to select among multiple candidate solutions, trading compute for accuracy.
The compute-optimal frontier depends on problem difficulty: test-time compute wins for medium-hard problems, while model scale wins for the hardest problems and is unnecessary for easy ones.
Adaptive routing (sending easy queries to cheap models and hard queries to reasoning models) is the highest-ROI deployment strategy, often reducing costs by 3x to 10x compared to using a reasoning model for everything.
Thinking tokens are not free: always set thinking budgets, monitor token consumption, and match the compute tier to the task difficulty.

Research Frontier

Test-time compute scaling is one of the fastest-moving areas in LLM research. Several directions are particularly active as of early 2026. Learned search policies aim to replace fixed search algorithms (best-of-N, MCTS) with models that learn when and how to search, dynamically allocating compute based on problem features. Self-verification trains the reasoning model to also serve as its own verifier, eliminating the need for a separate PRM. Efficient thinking investigates compression techniques for thinking tokens: can the model reason in a more compact latent space rather than generating verbose natural-language reasoning chains? Multi-agent reasoning (explored further in Section 24.1) uses multiple model instances that debate and verify each other's reasoning, combining test-time compute scaling with ensemble diversity. Finally, distilling reasoning into standard models (training a non-reasoning model on the thinking traces of a reasoning model) aims to capture some benefits of test-time compute at standard inference cost, connecting to the distillation techniques in Section 16.1.

Exercises

Exercise 9.6.1: End-to-end optimization priorities Conceptual

You have been asked to reduce LLM serving costs by 3x. Rank these optimization options by expected impact: quantization, prompt engineering (shorter prompts), model distillation, caching, and hardware upgrade. Justify your ranking.

Answer Sketch

1. Model routing/distillation (highest impact): using a 7B model for 70% of queries instead of 70B saves ~7x on those queries. Overall: ~5x savings. 2. Quantization (INT4): reduces memory and compute by ~3x. 3. Caching (prompt + semantic): eliminates compute for repeated/similar queries, typically 20 to 40% hit rate. 4. Prompt engineering: shorter prompts reduce prefill cost but usually affect response quality. 5. Hardware upgrade: newer GPUs (H100 vs. A100) give 2 to 3x improvement but require capital investment. The best strategy combines routing (biggest lever) with quantization (low effort, high return) to easily achieve 3x reduction.

Exercise 9.6.2: Profiling LLM inference Coding

Write a script that profiles LLM inference, measuring: time-to-first-token, tokens-per-second during generation, total latency, and peak GPU memory usage. Use torch.cuda.Event for precise timing.

Answer Sketch

Use torch.cuda.Event(enable_timing=True) to create start/end events. Record before prefill, after first token (TTFT), and at each subsequent token. Peak memory: torch.cuda.max_memory_allocated(). Key measurements: TTFT reveals prefill efficiency, tokens/second reveals decode speed, and memory reveals how much headroom exists for batching. For a 7B model on A100: expect TTFT ~50ms for a 100-token prompt, ~30 tokens/second for decode, and ~14 GB peak memory in FP16.

Exercise 9.6.3: Inference optimization for edge devices Analysis

Compare the inference optimization requirements for three deployment targets: (1) cloud GPU (A100/H100), (2) consumer GPU (RTX 4090), and (3) mobile device (phone with NPU). What techniques are essential for each?

Answer Sketch

Cloud GPU: focus on throughput (continuous batching, large batch sizes, PagedAttention). Quantization to INT8 for memory efficiency. Speculative decoding for latency-sensitive applications. Consumer GPU: limited to 24 GB VRAM, so quantization (INT4) is essential. Single-request latency matters more than throughput. FlashAttention and efficient KV-cache management are critical. Mobile NPU: extremely limited memory (4 to 8 GB shared with OS), so INT4 or even INT3 quantization is mandatory. Model size limited to 1 to 3B parameters. Operator-level optimization (custom kernels for NPU hardware) and model architecture co-design (fewer layers, shared weights) are necessary. Battery constraints limit sustained inference duration.

Exercise 9.6.4: Benchmark design for inference Coding

Design and implement an inference benchmark that tests: (1) prefill throughput (tokens/second for prompt processing), (2) decode throughput (tokens/second for generation), (3) concurrent request handling (throughput at batch sizes 1, 4, 16, 64), and (4) memory utilization at each batch size.

Answer Sketch

Send requests with varying prompt lengths (128, 512, 2048, 8192 tokens) and generation lengths (64, 256, 1024 tokens) at different batch sizes. Measure wall-clock time, GPU utilization (nvidia-smi), and memory. Key metrics to report: tokens/second/GPU for both prefill and decode, TTFT at each batch size, and the maximum batch size before OOM. Present results as a table showing the latency-throughput tradeoff curve. Good benchmarks also test mixed workloads (varying prompt and response lengths within a batch) to simulate realistic traffic.

Exercise 9.6.5: Future of inference optimization Discussion

Predict three major advances in LLM inference optimization over the next two years. Consider hardware trends (chip architectures), algorithmic improvements, and system-level innovations.

Answer Sketch

1. Hardware: custom LLM inference chips (like Groq's LPU and Cerebras' wafer-scale engine) that are designed for the specific memory access patterns of autoregressive generation, potentially achieving 10x throughput/dollar over GPUs. 2. Algorithmic: advances in linear attention or state-space models (Mamba-like architectures) that reduce the fundamental O(n^2) attention bottleneck, enabling million-token contexts without approximation. 3. System-level: intelligent request routing and model cascading that automatically selects the cheapest model capable of handling each query, with speculative execution across model tiers. Together, these could reduce inference costs by 10 to 100x, making current frontier-model capabilities available at today's small-model prices.

What Comes Next

This concludes Chapter 9 on inference optimization. In the next chapter, Chapter 10: LLM APIs, we shift from optimizing inference to using LLMs through their API interfaces, covering the practical engineering of building applications on top of hosted models.

References and Further Reading

Test-Time Compute and Reasoning

Snell, C. et al. (2024). Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.

The foundational study on test-time compute scaling. Demonstrates that optimally allocating inference compute can outperform 14x larger models on challenging tasks, and provides a framework for deciding when to think more versus use a bigger model.

Paper

OpenAI. (2024). Learning to Reason with LLMs. OpenAI Blog.

OpenAI's technical overview of the o1 model, describing how reinforcement learning trains models to produce extended chains of thought. Includes benchmark results on AIME, Codeforces, and GPQA.

Blog Post

OpenAI. (2024). OpenAI o1 System Card.

The safety and capability evaluation for o1, documenting reasoning capabilities, failure modes, and safety mitigations. Essential reading for understanding the alignment considerations specific to reasoning models (see also Chapter 17).

Technical Report

DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.

Describes the training recipe for DeepSeek R1, demonstrating that pure RL (without supervised chain-of-thought data) can produce strong reasoning models. The open-weight release enabled widespread research into reasoning model training.

Paper

Process Reward Models

Lightman, H. et al. (2023). Let's Verify Step by Step. ICLR 2024.

Introduces PRM800K and demonstrates that process-level reward models substantially outperform outcome-level reward models for selecting correct mathematical reasoning chains. The dataset and methodology have become foundational for PRM research.

Paper

Wang, P. et al. (2024). Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations. ACL 2024.

Proposes automated collection of step-level labels using Monte Carlo estimation, making PRM training scalable without human annotation. Demonstrates competitive results with PRM800K-trained models at a fraction of the labeling cost.

Paper

Inference-Time Search

Brown, B. et al. (2024). Large Language Monkeys: Scaling Inference Compute with Repeated Sampling.

Systematic study of best-of-N sampling at scale. Shows that coverage (the probability that at least one sample is correct) improves log-linearly with the number of samples, providing practical guidelines for compute allocation.

Paper