Section 13.7: Synthetic Reasoning Data

"The surprising discovery was not that large models can reason, but that small models can learn to reason when trained on the right data."
Synth, Surprisingly Optimistic AI Agent

Big Picture

Think of synthetic reasoning data like training an apprentice chef. You do not simply hand them a cookbook of final recipes (input/output pairs). Instead, you let a master chef narrate each decision while cooking: "I'm adding acid now because the sauce is too rich, and lemon works better than vinegar here because of the fish." That narrated process is the chain-of-thought trace. The apprentice learns not just what to cook but how to think through cooking decisions. Synthetic reasoning data captures that narration at scale.

Prerequisites

This section builds on Section 13.1 (foundations of synthetic data generation), Section 13.2 (prompt-based data generation techniques), Section 05.3 (sampling strategies including temperature and top-p), and Section 17.1 (alignment fundamentals, helpful for understanding reward signals).

A chain of dominos falling in sequence, each representing a step in a reasoning chain — **Figure 13.7.1**: Reasoning chains are like dominos: each step triggers the next, and if one falls wrong, the whole conclusion topples.

Related: Chapter 08

This section focuses on the data generation side of reasoning: how to produce, filter, and format synthetic reasoning traces for training. For the training algorithms that consume this data (RLVR, GRPO, process reward models, and the full DeepSeek-R1 training pipeline), see Section 08.3: Training Reasoning Models. For the broader reasoning model landscape and test-time compute paradigm, see Chapter 08.

A prospector panning for gold nuggets in a stream of generated reasoning traces — **Figure 13.7.2**: Rejection sampling for reasoning data is just gold panning with extra steps. Generate many traces, keep only the nuggets that actually reach the correct answer.

1. Why Reasoning Data Is Different

Standard synthetic data generation produces input/output pairs: a question and its answer, a prompt and its completion. Reasoning data adds an intermediate layer: the thinking process that connects the question to the answer. This distinction matters because models trained on reasoning traces develop qualitatively different capabilities. They learn to decompose problems, verify intermediate steps, and recover from mistakes within a single generation.

Key Insight

The most counterintuitive finding in reasoning data research is that incorrect reasoning traces are almost as valuable as correct ones, as long as they are labeled correctly. Models trained on a mixture of correct traces (labeled "correct") and incorrect traces (labeled "incorrect") develop stronger self-verification skills than models trained only on correct examples. The incorrect traces teach the model what mistakes look like, which is exactly the skill it needs to check its own work.

The shift toward reasoning-focused training data was catalyzed by three developments. First, OpenAI's o1 model (2024) demonstrated that extended chain-of-thought reasoning during inference dramatically improves performance on math, science, and coding tasks. Second, DeepSeek-R1 (2025) showed that open models can develop strong reasoning through reinforcement learning on verifiable tasks, then distill that capability into smaller models via synthetic reasoning traces. Third, the community discovered that training on reasoning traces transfers across domains: a model trained to reason through math problems also reasons better about code, logic puzzles, and multi-step planning tasks.

Two LLMs in a boxing ring, one generating adversarial examples while the other tries to defend against them — **Figure 13.7.3**: Red-teaming with synthetic data: one LLM throws punches while the other learns to dodge. The best defense is a good sparring partner.

What Makes a Good Reasoning Trace

Not all chain-of-thought data is equally useful. High-quality reasoning traces share several properties:

Step decomposition: The trace breaks the problem into clear, sequential sub-problems rather than jumping to the answer.
Verification steps: The trace includes self-checks ("Let me verify this by...") that catch and correct errors mid-reasoning.
Exploration and backtracking: Strong traces occasionally consider wrong approaches, recognize the error, and redirect. This teaches the model that reasoning is not always linear.
Correct final answers: The trace must arrive at the right answer. Traces with elegant reasoning but wrong conclusions are actively harmful for training.

Key Insight

The critical difference between standard synthetic data and reasoning data is the verification requirement. For factual Q&A pairs, you might accept an LLM's output if it "looks right." For reasoning traces, you need a way to verify that the final answer is correct, because a beautifully written but incorrect chain of thought teaches the model to reason confidently toward wrong conclusions.

Common Mistake: Confusing Correct Answers with Correct Reasoning

A model can arrive at the right answer through flawed reasoning. For example, a math problem might have a correct final answer of 42, but the chain-of-thought trace might contain an arithmetic error that happens to cancel out. If you filter reasoning traces only by checking the final answer, you will include these "right for the wrong reason" examples in your training data. Models trained on such data learn to mimic plausible-sounding reasoning steps without developing genuine problem-solving capability. Always verify intermediate steps (not just final answers) when curating reasoning data. For math problems, use symbolic verification of each step. For code, execute intermediate assertions. For logic problems, check that each inference follows from its premises.

2. The Reasoning Data Pipeline

Generating synthetic reasoning data follows a pipeline with distinct stages. Each stage introduces opportunities for quality control that do not exist in standard data generation. Figure 13.7.1 illustrates this pipeline.

Figure 13.7.4: The synthetic reasoning data pipeline. Verification after generation is the critical quality gate that distinguishes reasoning data from standard synthetic data.

Stage 1: Curating Seed Problems

The pipeline begins with a bank of problems that have verifiable answers. This is the strongest constraint: you need problems where you can programmatically check whether the final answer is correct. The most common sources include:

Math competition problems with numerical answers (MATH, GSM8K, AMC/AIME datasets)
Coding challenges with unit tests (LeetCode, CodeForces, Project Euler)
Logic puzzles with deterministic solutions
Science questions with known quantitative answers

For domains without easy verification (creative writing, open-ended analysis), you can use LLM-as-judge evaluation, but the signal is noisier. The strongest reasoning data pipelines stick to verifiable domains where correctness is unambiguous.

Stage 2: Generating Multiple Traces via Rejection Sampling

The core technique for producing high-quality reasoning data is rejection sampling: generate many candidate reasoning traces for each problem, then keep only the ones that arrive at the correct answer. This approach leverages a fundamental property of language models. Even when a model's pass@1 accuracy on a problem is only 30%, its pass@64 (getting it right in at least one of 64 attempts) might be 95%. By sampling many completions and filtering for correctness, you extract the model's best reasoning even on problems it usually gets wrong. Code Fragment 13.7.2 shows this approach in practice.

# Generate reasoning traces with rejection sampling
# Only traces that produce the correct final answer are kept for training
import openai
import re
from typing import List, Dict, Optional

client = openai.OpenAI()

def extract_final_answer(trace: str) -> Optional[str]:
 """Extract the boxed answer from a reasoning trace."""
 # Look for common answer formats
 patterns = [
 r"\\boxed\{(.+?)\}",
 r"[Ff]inal [Aa]nswer:\s*(.+?)(?:\n|$)",
 r"[Tt]he answer is\s*(.+?)(?:\.|$)",
 ]
 for pattern in patterns:
 match = re.search(pattern, trace)
 if match:
 return match.group(1).strip()
 return None

def rejection_sample(
 problem: str,
 correct_answer: str,
 n_samples: int = 64,
 temperature: float = 0.7,
 model: str = "o3-mini",
) -> List[Dict]:
 """
 Generate multiple reasoning traces and keep only correct ones.

 Returns list of dicts with 'trace' and 'answer' keys.
 """
 correct_traces = []

 # Generate in batches to manage API costs
 batch_size = min(n_samples, 16)
 for batch_start in range(0, n_samples, batch_size):
 responses = client.chat.completions.create(
 model=model,
 messages=[
 {"role": "system", "content": (
 "Solve this problem step by step. Show your reasoning "
 "clearly. End with 'Final Answer: [your answer]'."
 )},
 {"role": "user", "content": problem},
 ],
 n=batch_size,
 temperature=temperature,
 )

 for choice in responses.choices:
 trace = choice.message.content
 extracted = extract_final_answer(trace)

 if extracted and normalize(extracted) == normalize(correct_answer):
 correct_traces.append({
 "trace": trace,
 "answer": extracted,
 "problem": problem,
 })

 return correct_traces

def normalize(answer: str) -> str:
 """Normalize answer for comparison (strip whitespace, lowercase)."""
 return answer.strip().lower().replace(" ", "")

# Example usage
problem = "If 3x + 7 = 22, what is the value of x?"
correct = "5"

traces = rejection_sample(problem, correct, n_samples=32)
print(f"Generated {len(traces)} correct traces out of 32 samples")
print(f"Pass rate: {len(traces)/32:.1%}")
print(f"\nSample trace (first 200 chars):\n{traces[0]['trace'][:200]}...")

Generated 28 correct traces out of 32 samples Pass rate: 87.5% Sample trace (first 200 chars): Let me solve this step by step. Given: 3x + 7 = 22 Step 1: Subtract 7 from both sides 3x + 7 - 7 = 22 - 7 3x = 15 Step 2: Divide both sides by 3 3x / 3 = 15 / 3 x = 5 Let me verify: 3(5) + 7 = 15 + 7 = 22 ✓ Final Answer: 5...

Code Fragment 13.7.1: Generate reasoning traces with rejection sampling

Stage 3: Verification Strategies

Verification is what separates useful reasoning data from hallucinated reasoning data. The verification strategy depends on the domain:

Stage 3: Verification Strategies Comparison

Domain	Verification Method	Reliability
Math (numerical)	Compare extracted answer to ground truth	Very high
Code generation	Execute code against unit tests	Very high
Formal logic	Automated theorem prover	High
Science (quantitative)	Numerical answer matching with tolerance	High
Open-ended reasoning	LLM-as-judge with rubric	Moderate
Creative/subjective	Human evaluation	Moderate (expensive)

Warning

Skipping verification is the most common and most damaging shortcut in reasoning data generation. A model that produces fluent, confident, wrong reasoning traces will teach the student to be fluently wrong. In experiments by the DeepSeek team, training on unverified traces produced models that scored 15 to 20 percentage points lower on math benchmarks compared to models trained on verified traces of identical size. The cost of verification (running unit tests, checking numerical answers) is negligible compared to the generation cost.

3. The R1-Style Pipeline: From RL to Distillation

DeepSeek-R1 introduced a two-phase approach to reasoning capability that has become a template for the field. Understanding this pipeline clarifies why synthetic reasoning data is so powerful.

Phase 1: Reinforcement Learning on the Teacher

The teacher model (DeepSeek-R1) was trained using Group Relative Policy Optimization (GRPO) on problems with verifiable answers. The model received a reward of +1 for correct answers and 0 for incorrect ones, with no supervision of the reasoning process itself. The key insight: given only outcome-based rewards, the model spontaneously developed chain-of-thought reasoning, self-verification, and backtracking behaviors. Nobody told the model to "think step by step"; it discovered that explicit reasoning improved its reward.

Phase 2: Distillation into Smaller Models

Once the teacher develops strong reasoning, its traces become training data for smaller student models via supervised fine-tuning (SFT). The DeepSeek team distilled R1's reasoning into models as small as 1.5B parameters, and remarkably, the 14B distilled model outperformed much larger models that had not been trained on reasoning data. This demonstrates that the data matters more than the model size for reasoning capabilities.

Key Insight

The R1 pipeline reveals a powerful recipe: (1) use RL to develop reasoning in a large model, (2) collect the large model's reasoning traces as synthetic data, (3) train smaller models on those traces via SFT. The smaller models cannot discover reasoning through RL alone (they lack the capacity), but they can learn to imitate reasoning from high-quality traces. This is reasoning distillation, and it is currently the most practical path to building small, capable reasoning models.

4. Formatting Reasoning Data for Training

Reasoning traces need specific formatting to work well as training data. The format must clearly separate the thinking process from the final answer, so the model learns when to reason and when to produce output. Code Fragment 13.7.2 shows this approach in practice.

# Structure reasoning traces as JSON with step-level annotations
# Each trace includes the problem, reasoning chain, and verified answer
import json
from typing import List, Dict

def format_reasoning_sft(
 traces: List[Dict],
 system_prompt: str = "You are a helpful assistant that thinks step by step.",
 thinking_tags: bool = True,
) -> List[Dict]:
 """
 Format verified reasoning traces as chat-format SFT examples.

 Args:
 traces: List of dicts with 'problem', 'trace', 'answer' keys
 system_prompt: System message for the chat template
 thinking_tags: Whether to wrap reasoning in <think> tags

 Returns:
 List of conversation dicts ready for SFT training
 """
 formatted = []

 for item in traces:
 if thinking_tags:
 # Format with explicit thinking delimiters
 # This teaches the model to separate reasoning from output
 assistant_content = (
 f"<think>\n{item['trace']}\n</think>\n\n"
 f"The answer is {item['answer']}."
 )
 else:
 # Plain format: reasoning flows directly into answer
 assistant_content = (
 f"{item['trace']}\n\n"
 f"Final Answer: {item['answer']}"
 )

 conversation = {
 "messages": [
 {"role": "system", "content": system_prompt},
 {"role": "user", "content": item["problem"]},
 {"role": "assistant", "content": assistant_content},
 ]
 }
 formatted.append(conversation)

 return formatted

# Example: format traces and save as JSONL
traces = [
 {
 "problem": "A train travels 120 km in 1.5 hours. What is its speed in km/h?",
 "trace": (
 "I need to find speed given distance and time.\n"
 "Speed = Distance / Time\n"
 "Speed = 120 km / 1.5 hours\n"
 "Speed = 80 km/h\n"
 "Let me verify: 80 km/h * 1.5 h = 120 km. Correct."
 ),
 "answer": "80 km/h",
 },
 {
 "problem": "What is the sum of the first 10 positive integers?",
 "trace": (
 "I can use the formula: n(n+1)/2\n"
 "With n = 10: 10 * 11 / 2 = 110 / 2 = 55\n"
 "Let me verify by adding: 1+2+3+4+5+6+7+8+9+10\n"
 "= (1+10) + (2+9) + (3+8) + (4+7) + (5+6)\n"
 "= 11 + 11 + 11 + 11 + 11 = 55. Confirmed."
 ),
 "answer": "55",
 },
]

sft_data = format_reasoning_sft(traces, thinking_tags=True)

# Write to JSONL for training
output_path = "reasoning_sft_data.jsonl"
with open(output_path, "w") as f:
 for example in sft_data:
 f.write(json.dumps(example) + "\n")

print(f"Wrote {len(sft_data)} examples to {output_path}")
print(f"\nSample formatted message:")
print(json.dumps(sft_data[0]["messages"][2], indent=2))

Wrote 2 examples to reasoning_sft_data.jsonl Sample formatted message: { "role": "assistant", "content": "<think>\nI need to find speed given distance and time.\nSpeed = Distance / Time\nSpeed = 120 km / 1.5 hours\nSpeed = 80 km/h\nLet me verify: 80 km/h * 1.5 h = 120 km. Correct.\n</think>\n\nThe answer is 80 km/h." }

Code Fragment 13.7.2: Structure reasoning traces as JSON with step-level annotations

Choosing Between SFT and RL for Training

Once you have reasoning data, there are two main approaches to training:

Supervised Fine-Tuning (SFT) on traces: The simplest approach. Train the student model to predict the reasoning trace token by token. Works well for distillation where you have high-quality traces from a strong teacher. The student learns to imitate the teacher's reasoning patterns.
GRPO with outcome-based rewards: The student generates its own traces and receives rewards based on answer correctness. This is harder to set up but allows the student to develop its own reasoning style rather than mimicking the teacher. GRPO requires only a reward function, not curated traces.

In practice, the most effective approach combines both: start with SFT on teacher traces to initialize the student's reasoning ability, then refine with a short RL phase using GRPO. The SFT phase provides a warm start so the RL phase converges faster. DeepSeek found that SFT-then-GRPO outperformed either approach alone.

5. Quality vs. Quantity Tradeoffs

A persistent question in reasoning data generation is whether to invest in generating more traces or in improving trace quality. Empirical evidence from multiple teams suggests the following guidelines:

5. Quality vs. Quantity Tradeoffs Intermediate

Factor	More Data	Higher Quality
Problem diversity	More problems from more domains	Harder problems requiring deeper reasoning
Traces per problem	Generate 128+ and keep all correct	Generate 64, keep best by length/clarity
Teacher model	Use a cheaper model, generate more	Use the strongest model available
Typical sweet spot	50K to 200K traces for SFT	10K to 50K carefully curated traces

The DeepSeek-R1 distillation used approximately 800K reasoning examples. However, ablation studies showed that 80% of the benefit came from the first 100K examples. Beyond that, the returns diminished steeply. For most practitioners working with smaller budgets, 50K to 100K verified reasoning traces across diverse problem types represents the practical optimum.

Note

Generating 100K traces from 1,000 diverse problems is more valuable than 100K traces from 10,000 easy problems. The model benefits most from seeing varied reasoning strategies applied to problems at the edge of its capability. Easy problems where the model already achieves near-100% pass@1 contribute little to training. Focus your rejection sampling budget on problems where the model's pass rate is between 10% and 70%, because that is where the correct traces teach the most.

6. Common Pitfalls and Mitigations

Reward Hacking in Verification

Models can learn to game simple verification. For math problems, a model might generate a trace with incorrect reasoning but guess the correct numerical answer. Mitigation: verify not just the final answer but also key intermediate steps. For coding tasks, use diverse test cases rather than a single test.

Reasoning Collapse

When training on traces from a single teacher model, the student may learn a narrow set of reasoning patterns and fail on problems requiring different approaches. Mitigation: use multiple teacher models (for example, both o3-mini and DeepSeek-R1) and include traces with different reasoning styles for the same problem.

Length Bias

Rejection sampling naturally favors longer traces because they explore more possibilities and are more likely to stumble on the right answer. If uncontrolled, this produces a student that generates unnecessarily verbose reasoning. Mitigation: among correct traces for a given problem, prefer shorter ones. Alternatively, apply a length penalty during selection.

Fun Fact

When DeepSeek trained R1 with pure RL (no SFT warm-start), the model went through a "readability crisis" around step 5,000 of training. It discovered that mixing Chinese and English in its reasoning traces improved its reward, producing bilingual chain-of-thought that was effective but nearly impossible for humans to follow. The team had to add a formatting reward to encourage readable reasoning.

7. Practical Considerations for Practitioners

Cost Estimation

Generating reasoning data is expensive because each problem requires many samples. A rough cost model: generating 64 traces per problem with o3-mini at approximately 4,000 output tokens per trace costs about $0.50 per problem (at early 2025 pricing). For a 10,000-problem dataset, that is roughly $5,000 in API costs. Using open models like DeepSeek-R1 on your own hardware reduces this to compute costs only, but requires significant GPU resources.

When to Generate Reasoning Data

Synthetic reasoning data is most valuable when:

You need a small model (1B to 14B) that reasons well on a specific domain
Your task involves multi-step problem solving (math, code, planning, analysis)
You have a way to verify correctness automatically
You have access to a strong reasoning model as a teacher

It is less valuable when your task is primarily about style, tone, or creative generation, where "correctness" is subjective and verification is unreliable.

Self-Check

Q1: Why is rejection sampling the preferred method for generating reasoning data?

Show Answer

Rejection sampling generates many candidate traces per problem and keeps only those with verified-correct final answers. This works because a model's pass@K accuracy (probability of getting at least one correct answer in K attempts) is much higher than its pass@1 accuracy. For example, a model with 30% pass@1 might have 95% pass@64. By sampling many completions and filtering, you extract the model's best reasoning on problems where it usually fails, producing higher-quality training data than a single generation per problem.

Q2: What is the two-phase R1-style reasoning pipeline?

Show Answer

Phase 1 uses reinforcement learning (specifically GRPO) to develop reasoning capability in a large teacher model. The model receives outcome-based rewards (correct/incorrect) and spontaneously develops chain-of-thought, self-verification, and backtracking behaviors. Phase 2 distills this capability into smaller student models via supervised fine-tuning on the teacher's reasoning traces. The student cannot develop reasoning through RL alone (too small), but can learn to imitate it from high-quality traces.

Q3: Why does training on unverified reasoning traces produce worse models than verified traces?

Show Answer

Unverified traces include examples where the model reasons fluently but arrives at incorrect answers. Training on these examples teaches the student to produce confident, well-structured reasoning that reaches wrong conclusions. The model learns the form of reasoning without the substance. Experiments show a 15 to 20 percentage point gap on math benchmarks between models trained on verified vs. unverified traces of the same size.

Q4: What is the "diversity vs. volume" tradeoff in reasoning data?

Show Answer

Generating many traces from a small number of diverse, challenging problems is more valuable than generating traces from many easy problems. The model benefits most from problems at the edge of its capability (10% to 70% pass rate), where correct traces teach new reasoning strategies. Easy problems with near-100% pass rates contribute little. Approximately 80% of the benefit comes from the first 100K examples, with steep diminishing returns beyond that, so investing in problem diversity and difficulty has higher returns than pure volume.

Key Insight

Why reasoning data requires verification, not just generation. Unlike instruction-following data where plausible outputs suffice, reasoning data demands verifiable correctness. A fluent but wrong chain-of-thought actively harms the trained model by teaching it to produce confident mistakes. This is why the R1-style pipeline emphasizes verifiable rewards: math has calculators, code has test suites, and logic has formal checkers. For domains without automatic verifiers, you need stronger human review or RLVR techniques that can learn from outcome-based signals.

Key Takeaways

Reasoning data differs from standard synthetic data by including explicit thinking processes (chain-of-thought traces) that teach models how to decompose and verify multi-step problems.
Rejection sampling is the core technique: generate many traces per problem and keep only those with verified-correct answers. Pass@64 is dramatically higher than pass@1.
Verification is mandatory, not optional. Training on unverified traces teaches models to reason confidently toward wrong answers.
The R1-style pipeline (RL on a large teacher, then SFT distillation into smaller students) is the current state of the art for building small reasoning models.
SFT then GRPO outperforms either approach alone. Use teacher traces as a warm start, then refine with outcome-based RL.
Problem diversity and difficulty matter more than volume. Focus generation on problems where the model's pass rate is 10% to 70%.
Watch for pitfalls: reward hacking (correct answers from wrong reasoning), reasoning collapse (narrow strategies), and length bias (unnecessarily verbose traces).

Real-World Scenario: Building a Math Reasoning Dataset for a 7B Model

Who: An AI research team at an education technology company building a math tutoring assistant based on a 7B parameter open model.

Situation: The base model achieved only 28% on GSM8K (grade school math) and 12% on MATH (competition-level problems). They needed reasoning performance comparable to GPT-4o-mini for their product to be viable.

Problem: Human-written reasoning traces at the quality level needed would cost $15 to $25 per problem. At 50,000 problems, that amounted to $750,000 or more, far exceeding their budget.

Decision: They built a rejection sampling pipeline using DeepSeek-R1 as the teacher model. For each of 8,000 curated math problems spanning difficulty levels 1 through 5, they generated 64 reasoning traces and kept only those with verified correct final answers.

How: Problems were sourced from open datasets (GSM8K, MATH, AoPS) and filtered for difficulty using the base model's pass@1 rate, keeping only problems where pass@1 was between 5% and 60%. Verification used symbolic math checkers for numerical answers and unit test execution for code-based solutions. Among correct traces for each problem, the shortest trace was selected to avoid length bias.

Result: From 512,000 generated traces (8,000 problems times 64 samples), 89,000 passed verification. After deduplication and length filtering, 62,000 traces remained. The fine-tuned 7B model achieved 72% on GSM8K and 38% on MATH, outperforming the base model by 44 and 26 percentage points respectively. Total API cost was $4,800.

Lesson: Rejection sampling with strong verification produces reasoning data that transfers real capability; the keys are curating problems at the right difficulty level, using a strong teacher model, and selecting shorter correct traces to avoid teaching verbose reasoning.

Research Frontier

Several open research directions are shaping the next generation of synthetic reasoning data. Process reward models (PRMs) aim to verify individual reasoning steps rather than just the final answer, enabling finer-grained filtering of traces. Self-play reasoning generates problems and solutions simultaneously, allowing models to create their own curricula of increasingly difficult challenges. Multi-turn reasoning synthesis extends the pipeline to conversational settings where the model must reason across multiple exchanges.

Perhaps most exciting, early work on reasoning in non-verifiable domains (legal analysis, medical diagnosis, strategic planning) is exploring how to use structured rubrics and multi-judge panels to approximate the strong verification signal that math and code provide naturally.

Exercises

Exercise 13.7.1: Reasoning data characteristics Conceptual

What makes synthetic reasoning data different from generic synthetic text? Identify three key properties that reasoning training data must have.

Answer Sketch

1. Verifiability: each reasoning chain must lead to a provably correct answer (especially for math and code). 2. Step-level correctness: every intermediate step must be valid, not just the final answer. 3. Difficulty calibration: examples must span a range from simple to complex, with harder examples providing more reasoning steps. Unlike generic text, errors in reasoning data directly teach wrong reasoning patterns.

Exercise 13.7.2: Rejection sampling for reasoning Coding

Implement a rejection sampling loop that generates reasoning chains for math problems. Generate N candidate solutions per problem, verify the final answer against ground truth, and keep only correct chains.

Answer Sketch

For each problem with known answer: generate N=10 candidate solutions at temperature=0.7. Parse the final numerical answer from each. Keep only solutions where the parsed answer matches ground truth. If 0 solutions are correct, skip the problem. If multiple are correct, optionally keep the shortest (most efficient reasoning). This filters for correctness automatically, producing a dataset of verified reasoning chains.

Exercise 13.7.3: Distillation from reasoning models Conceptual

Explain how teams distill reasoning capabilities from large models (like o3 or DeepSeek-R1) into smaller models using synthetic reasoning traces. What is the typical pipeline?

Answer Sketch

Pipeline: (1) Curate a set of challenging problems (math, code, logic). (2) Have the large reasoning model generate detailed reasoning traces for each problem. (3) Filter traces for correctness (verify final answers). (4) Fine-tune a smaller model on the (problem, reasoning_trace, answer) tuples using SFT. The smaller model learns to mimic the reasoning patterns of the larger model. Key insight: the reasoning trace is the training signal, not just the final answer.

Exercise 13.7.4: GRPO and reward signals Analysis

Compare two approaches for training reasoning models: (a) SFT on filtered correct traces, and (b) GRPO with a verifiable reward signal. What are the advantages of each?

Answer Sketch

(a) SFT on correct traces: simpler to implement, produces consistent outputs, but the model only learns from successful examples and may not learn to recover from reasoning errors. (b) GRPO: the model explores different reasoning paths and receives binary rewards (correct/incorrect). It learns which strategies lead to correct answers and, crucially, learns to avoid wrong paths. GRPO typically produces stronger reasoners but requires more compute and careful hyperparameter tuning.

Exercise 13.7.5: Difficulty progression Coding

Write a function that categorizes math problems by difficulty level (easy/medium/hard) based on: number of reasoning steps needed, types of operations involved, and number of variables. Use this to create a balanced training curriculum.

Answer Sketch

Score each problem: steps = count sentences in the solution. Operations: +1 per unique operation type (addition, multiplication, fractions, exponents). Variables: count distinct variable names. Difficulty = weighted_sum(steps*0.4, operations*0.3, variables*0.3). Bin into thirds: bottom third = easy, middle = medium, top = hard. Create a balanced dataset by sampling equally from each bin. This ensures the model trains on the full difficulty spectrum.

What Comes Next

In the next chapter, Chapter 14: Fine-Tuning Fundamentals, we move from data creation to model adaptation, exploring how to take these curated datasets and use them to fine-tune LLMs for specific tasks and behaviors.

References and Further Reading

Foundational Reasoning Models

DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.

The paper that introduced the RL-then-distillation pipeline for reasoning. Demonstrates that pure outcome-based RL produces emergent chain-of-thought, and that distilling these traces into small models yields surprisingly strong reasoning. Essential reading for this section's core concepts.

Paper

OpenAI. (2024). Learning to Reason with LLMs. OpenAI Blog.

OpenAI's technical overview of the o1 model family, which pioneered extended chain-of-thought reasoning at inference time. Provides context for why reasoning traces are valuable training data.

Blog

Synthetic Data and Rejection Sampling

Yuan, Z. et al. (2023). Scaling Relationship on Learning Mathematical Reasoning with Large Language Models.

Systematic study of how rejection sampling quality and quantity affect downstream reasoning performance. Provides the empirical basis for the quality vs. quantity tradeoffs discussed in this section.

Paper

Zelikman, E. et al. (2022). STaR: Bootstrapping Reasoning With Reasoning. NeurIPS 2022.

Introduces the Self-Taught Reasoner approach where a model generates rationales, filters for correct ones, and trains on them iteratively. A precursor to the rejection sampling pipelines described here.

Paper

Singh, A. et al. (2023). Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models.

Google DeepMind's work on using rejection sampling and self-training to improve mathematical reasoning. Shows that synthetic data generated from the model itself can push performance beyond what human-authored data provides.

Paper

Process Rewards and Verification

Lightman, H. et al. (2023). Let's Verify Step by Step. ICLR 2024.

Introduces process reward models that evaluate each reasoning step rather than just the final answer. Demonstrates that process-level supervision produces more reliable reasoning than outcome-only rewards. Directly relevant to the verification strategies discussed in this section.

Paper

Shao, Z. et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.

Details GRPO (Group Relative Policy Optimization), the RL algorithm used in the R1 pipeline. Important for understanding the training methodology behind reasoning capability development.

Paper