Part 4: Training and Adapting
Chapter 17: Alignment, RLHF, and DPO

RLVR: Reinforcement Learning with Verifiable Rewards

"If the answer can be checked by a compiler or a proof assistant, why bother asking a human? Let the math grade itself."

Reward Reward, Compiler-Trusting AI Agent
Big Picture

RLVR removes humans from the reward loop entirely by using objectively verifiable correctness signals. For domains like mathematics, programming, and formal proofs, the correctness of an answer can be checked automatically: a math solution is either right or wrong, code either passes tests or fails, and a proof either verifies or does not. RLVR exploits this property to train reasoning models at massive scale without any human preference data. This paradigm powered DeepSeek-R1 and has sparked an open ecosystem of reasoning models that achieve frontier-level performance on mathematical and coding benchmarks. The test-time compute scaling paradigm from Section 07.3 provides the broader context for why reasoning models are so valuable.

Prerequisites

Before starting, make sure you are familiar with alignment fundamentals as covered in Section 17.1: RLHF: Reinforcement Learning from Human Feedback.

Common Mistake: Assuming RLVR Transfers to Open-Ended Tasks

RLVR works because it has access to an objective correctness oracle: a math checker, a test suite, or a proof verifier. Some teams see the impressive reasoning gains from RLVR and attempt to apply the same approach to open-ended tasks (summarization, creative writing, customer service) by constructing proxy verification functions. These proxy verifiers are just reward models by another name, and they inherit all the reward hacking problems that RLVR was designed to avoid. If your "verifier" is a heuristic or another LLM judging quality, you are doing RLHF with extra steps, not RLVR. Reserve RLVR for domains where correctness is truly binary and machine-checkable.

1. The Verifiable Reward Paradigm

Standard RLHF relies on a learned reward model trained on human preferences. This reward model is imperfect: it can be gamed, it introduces noise, and it caps the alignment quality at the level of annotator agreement. For domains where correctness is objectively verifiable, we can bypass the reward model entirely and use the ground truth as the reward signal.

The key insight behind RLVR is simple: if you can write a function that checks whether an answer is correct, you have a perfect reward signal. No reward model training, no human annotation, no preference noise. The reward is binary (correct or incorrect) or graded (partially correct), and it is always accurate. Figure 17.4.1 contrasts the RLHF and RLVR reward signal paths.

Key Insight

RLVR flips the hardest problem in alignment on its head. In RLHF, the reward signal is expensive, noisy, and biased (human annotators disagree). In RLVR, the reward signal is free, exact, and infinitely scalable. The catch is that RLVR only works for domains where correctness can be verified automatically: math (check the answer), code (run the tests), formal logic (verify the proof). For open-ended tasks like creative writing or ethical reasoning, there is no verification oracle, so RLHF and DPO remain necessary.

Mental Model: The Auto-Graded Math Exam

Think of RLVR as training with an auto-graded math exam instead of essay grading by humans. The model generates a solution (the exam answer), and a verifier checks whether the final answer matches the known correct answer. No human preferences are needed: the reward is binary (correct or incorrect) and objective. This works beautifully for math, code, and logic puzzles where correctness is verifiable, but it cannot directly train for subjective qualities like helpfulness or tone.

LLM generates Reward Model (learned, noisy) r = 0.73 approximate RLVR (Verifiable Reward) LLM generates Verifier (test suite / checker) r = 1.0 exact Verifiable Domains Math: check final answer Code: run test suite Proofs: formal verification Logic: truth table validation
Figure 17.4.1: RLHF uses a learned (noisy) reward model. RLVR uses verifiable correctness checks, producing exact reward signals without human annotation.

1.1 Types of Verifiable Rewards

1.1 Types of Verifiable Rewards Comparison
DomainVerification MethodReward TypeDataset Examples
MathematicsCompare to known answer; symbolic checkingBinary (correct/wrong)GSM8K, MATH, AIME
Code generationExecute against test suiteGraded (pass@k tests)HumanEval, MBPP, SWE-bench
Formal proofsLean/Coq/Isabelle type checkerBinary (verifies/fails)miniF2F, ProofNet
Logic puzzlesConstraint satisfaction checkBinary (satisfies/violates)FOLIO, ProntoQA
Format complianceRegex, JSON schema validationBinary (valid/invalid)Custom structured output tasks

The following code demonstrates verifiable reward functions for each of these domains. Code Fragment 17.4.2 implements binary math checking, graded code evaluation via test suites, and formal proof verification using the Lean 4 type checker.

# Verifiable reward functions for different domains
import subprocess
import json
import re
from typing import Optional

def math_reward(
 model_answer: str,
 ground_truth: str,
 tolerance: float = 1e-6,
) -> float:
 """Binary reward for math: 1.0 if correct, 0.0 otherwise."""
 # Extract numerical answer from model output
 extracted = extract_final_answer(model_answer)
 if extracted is None:
 return 0.0

 try:
 model_val = float(extracted)
 truth_val = float(ground_truth)
 return 1.0 if abs(model_val - truth_val) < tolerance else 0.0
 except ValueError:
 # String comparison for symbolic answers
 return 1.0 if extracted.strip() == ground_truth.strip() else 0.0

def code_reward(
 generated_code: str,
 test_cases: list,
 timeout: int = 10,
) -> float:
 """Graded reward for code: fraction of tests passed."""
 passed = 0
 for test in test_cases:
 full_code = generated_code + "\n" + test["test_code"]
 try:
 result = subprocess.run(
 ["python", "-c", full_code],
 capture_output=True,
 timeout=timeout,
 text=True,
 )
 if result.returncode == 0:
 passed += 1
 except subprocess.TimeoutExpired:
 continue

 return passed / len(test_cases) if test_cases else 0.0

def proof_reward(
 proof_text: str,
 theorem_statement: str,
 lean_project_path: str,
) -> float:
 """Binary reward for Lean 4 proofs."""
 # Write proof to a temporary Lean file
 lean_code = f"{theorem_statement}\n{proof_text}"
 with open(f"{lean_project_path}/Temp.lean", "w") as f:
 f.write(lean_code)

 result = subprocess.run(
 ["lake", "build", "Temp"],
 capture_output=True,
 cwd=lean_project_path,
 timeout=60,
 )
 return 1.0 if result.returncode == 0 else 0.0

def extract_final_answer(text: str) -> Optional[str]:
 """Extract boxed answer from math response."""
 # Look for \boxed{...} or "The answer is ..."
 boxed = re.findall(r"\\boxed\{([^}]+)\}", text)
 if boxed:
 return boxed[-1]

 answer_pattern = re.findall(
 r"(?:the answer is|therefore|thus)[:\s]+([^\n.]+)",
 text, re.IGNORECASE,
 )
 if answer_pattern:
 return answer_pattern[-1].strip()

 return None
Let me think about this carefully...
Let the distance between the two cities be d km.
Time for A to B: d/60 hours
Time for B to A: d/40 hours
Total distance: 2d
Total time: d/60 + d/40 = d(2+3)/120 = 5d/120 = d/24
Average speed: 2d / (d/24) = 48 km/h

The average speed for the round trip is **48 km/h**.
Code Fragment 17.4.1: Verifiable reward functions for different domains

Code Fragment 17.4.3 shows how GRPO combines verifiable rewards with group-level normalization for math reasoning training.

# GRPO with Verifiable Rewards for Math Reasoning
import torch
from typing import List, Callable

def grpo_math_training_step(
 policy_model,
 ref_model,
 tokenizer,
 math_problems: List[dict], # {"prompt": ..., "answer": ...}
 group_size: int = 16,
 beta: float = 0.04,
 clip_range: float = 0.2,
):
 """
 One GRPO training step using verifiable math rewards.

 For each problem, generate group_size solutions,
 check correctness, normalize rewards within the group,
 and compute the policy gradient.
 """
 total_loss = 0.0
 total_correct = 0
 total_generated = 0

 for problem in math_problems:
 prompt = problem["prompt"]
 ground_truth = problem["answer"]
 input_ids = tokenizer.encode(prompt, return_tensors="pt")

 # Generate a group of solutions
 solutions = []
 for _ in range(group_size):
 output = policy_model.generate(
 input_ids,
 max_new_tokens=1024,
 do_sample=True,
 temperature=0.7,
 top_p=0.95,
 )
 solutions.append(output[0])

 # Compute verifiable rewards (binary: correct or not)
 rewards = torch.tensor([
 math_reward(tokenizer.decode(s), ground_truth)
 for s in solutions
 ])

 total_correct += rewards.sum().item()
 total_generated += group_size

 # Skip if all correct or all wrong (no learning signal)
 if rewards.sum() == 0 or rewards.sum() == group_size:
 continue

 # Group normalization (GRPO core idea)
 advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-8)

 # Compute policy gradient for each solution
 for solution, advantage in zip(solutions, advantages):
 policy_logprobs = compute_logprobs(
 policy_model, input_ids, solution
 )
 with torch.no_grad():
 ref_logprobs = compute_logprobs(
 ref_model, input_ids, solution
 )

 # PPO-style clipped objective
 ratio = torch.exp(policy_logprobs - ref_logprobs)
 surr1 = ratio * advantage
 surr2 = torch.clamp(
 ratio, 1 - clip_range, 1 + clip_range
 ) * advantage
 policy_loss = -torch.min(surr1, surr2).mean()

 # KL penalty
 kl = (ref_logprobs - policy_logprobs).mean()
 total_loss += policy_loss + beta * kl

 accuracy = total_correct / total_generated if total_generated > 0 else 0
 return total_loss / len(math_problems), accuracy
Code Fragment 17.4.2: GRPO training step with verifiable math rewards. For each problem, the function generates a group of solutions, checks each against the ground truth, and normalizes rewards within the group. Solutions that are correct receive positive advantage; incorrect ones receive negative advantage.

Code Fragment 17.4.4 demonstrates the full four-stage DeepSeek-R1 style RLVR training pipeline.

# Simplified DeepSeek-R1 style training pipeline
from dataclasses import dataclass
from typing import List, Dict
import json

@dataclass
class RLVRPipelineConfig:
 """Configuration for a full RLVR training pipeline."""
 base_model: str = "deepseek-ai/DeepSeek-V3-Base"
 cold_start_data: str = "path/to/reasoning_examples.json"
 math_dataset: str = "hendrycks/MATH"
 code_dataset: str = "openai/humaneval"

 # GRPO hyperparameters
 group_size: int = 16
 beta: float = 0.04
 clip_range: float = 0.2
 learning_rate: float = 1e-6

 # Training schedule
 cold_start_epochs: int = 2
 rl_steps: int = 10000
 rejection_sampling_n: int = 64

def run_rlvr_pipeline(config: RLVRPipelineConfig):
 """Run the full RLVR pipeline (simplified)."""

 # Stage 1: Cold-start SFT
 print("Stage 1: Cold-start SFT on reasoning examples")
 cold_start_data = load_json(config.cold_start_data)
 model = sft_train(
 config.base_model,
 cold_start_data,
 epochs=config.cold_start_epochs,
 )

 # Stage 2: RLVR with GRPO
 print("Stage 2: RLVR training with verifiable rewards")
 math_problems = load_dataset(config.math_dataset)
 code_problems = load_dataset(config.code_dataset)

 for step in range(config.rl_steps):
 # Sample a batch of problems (mix of math and code)
 batch = sample_mixed_batch(math_problems, code_problems)

 # GRPO step with verifiable rewards
 loss, accuracy = grpo_math_training_step(
 policy_model=model,
 ref_model=ref_model,
 tokenizer=tokenizer,
 math_problems=batch,
 group_size=config.group_size,
 beta=config.beta,
 )

 if step % 100 == 0:
 print(f" Step {step}: loss={loss:.4f}, acc={accuracy:.3f}")

 # Stage 3: Rejection sampling + SFT
 print("Stage 3: Rejection sampling on diverse prompts")
 diverse_prompts = load_diverse_prompts()
 best_outputs = []
 for prompt in diverse_prompts:
 candidates = [
 model.generate(prompt) for _ in range(config.rejection_sampling_n)
 ]
 # Score and keep the best
 scored = [(c, score_response(c)) for c in candidates]
 best = max(scored, key=lambda x: x[1])
 best_outputs.append({"prompt": prompt, "response": best[0]})

 model = sft_train(model, best_outputs, epochs=1)

 # Stage 4: Final alignment RL
 print("Stage 4: General alignment with helpfulness/safety rewards")
 # Mix verifiable rewards with general preference rewards
 model = final_alignment_rl(model)

 return model
Code Fragment 17.4.3: Verifiable reward functions for different domains
Key Insight

Verifiable rewards bypass the fundamental limitation of human preference data. In standard RLHF, the quality ceiling is the reward model, which is only as good as the human annotations it was trained on. For math and code, RLVR replaces this noisy human signal with a perfect one: the answer is either correct or incorrect, verifiable by a calculator or test suite. This is why DeepSeek-R1 achieved such remarkable reasoning gains; it could train on millions of verifiable examples without any human annotation bottleneck. The open research question is how to extend verifiable rewards to domains like creative writing, ethical reasoning, and open-ended problem-solving where "correct" has no formal definition.

4. Extensions Beyond Math and Code

While RLVR has been most successful in mathematics and coding, researchers are actively exploring extensions to other domains. The key challenge is constructing reliable verifiers for domains where correctness is less clearly defined.

Warning

Extending RLVR beyond math and code is challenging because most real-world tasks lack clean verifiable signals. A customer service response cannot be automatically graded as "correct" or "incorrect" in the same way a math solution can. Hybrid approaches that combine verifiable rewards for structured components (factual claims, format compliance) with learned rewards for subjective quality are a promising research direction.

Domain Comparison
DomainVerification ApproachFeasibilityChallenge
Fact-checkingCross-reference with knowledge baseMediumKnowledge base completeness
TranslationRound-trip translation consistencyMediumMany valid translations exist
Structured outputSchema validation (JSON, XML)HighFormat correctness is not content correctness
Tool useExecute tool call, check resultHighSide effects, environment setup
Scientific reasoningDimensional analysis, unit checkingMediumMany steps resist automation

5. The Open Reasoning Ecosystem

DeepSeek-R1's success with RLVR sparked rapid reproduction efforts across the open-source community. Several projects have demonstrated that the core approach can be replicated at smaller scales with impressive results. Code Fragment 17.4.3 demonstrates inference with an open reasoning model that exposes its chain-of-thought tokens.

# Using open reasoning models for inference
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load an open reasoning model
model_name = "Qwen/QwQ-32B" # Open reasoning model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
 model_name, torch_dtype="bfloat16", device_map="auto"
)

# Reasoning models produce extended thinking chains
prompt = """Solve this step by step:
A train travels from City A to City B at 60 km/h. The return trip
is made at 40 km/h. What is the average speed for the round trip?"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
 **inputs,
 max_new_tokens=2048,
 temperature=0.6,
 top_p=0.95,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
# The model will produce a detailed reasoning chain:
# "Let me think about this carefully...
# Let the distance be d km.
# Time for A to B: d/60 hours
# Time for B to A: d/40 hours
# Total distance: 2d
# Total time: d/60 + d/40 = d(2+3)/120 = 5d/120 = d/24
# Average speed: 2d / (d/24) = 48 km/h
# Wait, let me verify: ..."
Code Fragment 17.4.4: Using open reasoning models for inference
Key Insight

The open reasoning ecosystem demonstrates a remarkable pattern: once a training recipe is published, the community can reproduce and extend it rapidly. Sky-T1 (from NovaSky) replicated core R1 results in under $500 of compute. Open-source reasoning models now match or exceed GPT-4o on mathematical reasoning benchmarks, showing that RLVR combined with sufficient base model quality and training data can achieve frontier performance without proprietary infrastructure.

5.1 Open Reasoning Models Comparison

5.1 Open Reasoning Models Comparison
ModelBase SizeTraining MethodMATH ScoreAIME 2024
DeepSeek-R1671B MoEGRPO + RLVR (4 stages)97.3%79.8%
QwQ-32B32BRLVR + SFT90.6%50.0%
DeepSeek-R1-Distill-32B32B (distilled)SFT from R1 outputs94.3%72.6%
Sky-T1-32B32BGRPO reproduction82.4%43.3%
OpenAI o1UnknownProprietary RL96.4%83.3%
Real-World Scenario: Building a math tutoring assistant with RLVR

Building a math tutoring assistant with RLVR. An edtech company trains a 7B parameter model to solve algebra and calculus problems. They use GRPO with a verifier that checks numerical answers against a solution key (binary reward) and a format verifier that ensures the model shows its work in a step-by-step chain-of-thought format. After 3 epochs of RLVR training on 50,000 problems from GSM8K and MATH, the model improves from 45% to 78% accuracy on held-out problems. The total training cost is approximately $200 on rented A100 GPUs, compared to the $50,000+ estimated cost of collecting equivalent human preference data for RLHF-based training.

Note

Distillation from reasoning models (like DeepSeek-R1-Distill) offers a practical shortcut: instead of running the full RLVR pipeline, you can fine-tune a smaller model on the reasoning traces produced by a larger one. This approach achieves strong results at a fraction of the training cost, though the distilled model may struggle on problems significantly harder than those in the training set.

Self-Check
Q1: What makes RLVR fundamentally different from RLHF?
Show Answer
RLVR uses objectively verifiable correctness signals (math answer checking, code test execution, proof verification) instead of a learned reward model trained on human preferences. This eliminates reward model noise, removes the need for human annotation, and provides a perfect reward signal for domains where correctness can be automatically checked.
Q2: Why is GRPO particularly well-suited for RLVR?
Show Answer
GRPO generates a group of responses and normalizes rewards within the group, eliminating the need for a learned value network. With verifiable rewards, the group contains solutions that are definitively correct or incorrect. The group statistics provide a natural baseline: in a group of 16 math solutions where 4 are correct, the correct ones get positive advantage and incorrect ones get negative advantage, without any learned approximation.
Q3: What was the "aha moment" in DeepSeek-R1 training?
Show Answer
During RLVR training, the model spontaneously began producing extended chain-of-thought reasoning, including self-correction and backtracking, despite never being trained on chain-of-thought examples. The RL objective (maximize correctness) provided sufficient signal for the model to discover that explicit step-by-step reasoning improves accuracy. This emergent behavior was not programmed but arose naturally from the optimization pressure.
Q4: Why is extending RLVR to open-ended tasks difficult?
Show Answer
Most real-world tasks (customer service, creative writing, general conversation) lack objectively verifiable correctness signals. You cannot write a function that definitively grades whether a customer service response is "correct." Hybrid approaches combining verifiable components (fact-checking, format compliance) with learned reward models for subjective quality are a promising research direction.
Q5: How does distillation from reasoning models compare to full RLVR training?
Show Answer
Distillation fine-tunes a smaller model on the reasoning traces of a larger RLVR-trained model. It achieves strong results (e.g., DeepSeek-R1-Distill-32B reaches 94.3% on MATH) at a fraction of the training cost. However, distilled models may generalize less well to problems harder than those in the training set, since they learn to imitate reasoning patterns rather than discovering them through RL optimization pressure.
Key Insight

✅ Key Takeaways

Fun Fact

DeepSeek-R1 demonstrated that RLVR can teach models to reason through complex math problems by rewarding correct final answers. The model spontaneously developed chain-of-thought reasoning without being explicitly taught to do so, which is either impressive emergent behavior or mildly unsettling, depending on your disposition.

Research Frontier

RLVR (Reinforcement Learning with Verifiable Rewards) is emerging as a powerful paradigm for domains with automated evaluation, where formal verification replaces human preference judgments. Research on outcome-based RL for code and math (as demonstrated by DeepSeek-R1 and OpenAI's o1) shows that verifiable reward signals can teach models sophisticated reasoning strategies that transfer to unverifiable domains.

The open frontier is extending RLVR beyond code and mathematics to domains like scientific reasoning, where partial verification is possible through simulation or experimental validation.

Exercises

Exercise 17.4.1: Verifiable rewards concept Conceptual

Explain what makes a reward 'verifiable' in RLVR. Give three examples of tasks with verifiable rewards and three examples where rewards are not easily verifiable.

Answer Sketch

A verifiable reward can be checked automatically with 100% accuracy. Verifiable: (1) math problems (compare to known answer), (2) code generation (run test cases), (3) formal proofs (check with a proof verifier). Not verifiable: (1) essay quality (subjective), (2) helpfulness (depends on user context), (3) creative writing (no objective standard). RLVR is only applicable to domains where correctness is binary and checkable by a program.

Exercise 17.4.2: GRPO algorithm Conceptual

Describe the GRPO (Group Relative Policy Optimization) algorithm used in DeepSeek-R1. How does it differ from PPO in handling reward signals?

Answer Sketch

GRPO samples a group of N completions for each prompt, scores each with the verifiable reward (correct/incorrect), and updates the policy to increase the probability of correct solutions relative to incorrect ones within the group. Unlike PPO, GRPO does not need a value network (critic) because it uses the group's average reward as the baseline. This simplifies training (no critic training) and scales better because the reward signal is binary and exact, not estimated.

Exercise 17.4.3: Reward function implementation Coding

Write a reward function for a math reasoning task that: extracts the final answer from the model's response, compares it to ground truth, and returns 1.0 for correct and 0.0 for incorrect.

Answer Sketch

Parse the answer: def extract_answer(response): match = re.search(r'\\boxed\{(.+?)\}', response); return match.group(1) if match else None. Reward: def reward_fn(response, ground_truth): predicted = extract_answer(response); return 1.0 if predicted and predicted.strip() == ground_truth.strip() else 0.0. For robustness, also handle alternative answer formats and normalize numbers (e.g., '0.50' == '0.5').

Exercise 17.4.4: Emergent reasoning Analysis

Explain how DeepSeek-R1-Zero demonstrated that reasoning capabilities can emerge from pure RL training without any SFT. What behaviors emerged, and why was this surprising?

Answer Sketch

DeepSeek-R1-Zero was trained with RLVR directly on the base model (no SFT stage). Surprisingly, it developed chain-of-thought reasoning, self-verification (checking its own answers), and problem decomposition spontaneously. This was surprising because these behaviors were not demonstrated in any training data. The model discovered that generating reasoning steps before answering increased its reward rate. This suggests that reasoning is a naturally advantageous strategy that RL can discover, not something that must be explicitly taught.

Exercise 17.4.5: RLVR limitations Discussion

RLVR works well for math and code but faces challenges in other domains. Identify three domains where creating verifiable rewards is difficult and propose a partial solution for one.

Answer Sketch

Difficult domains: (1) Summarization: no single correct summary exists. (2) Translation: multiple valid translations, quality is a spectrum. (3) Dialogue: helpfulness depends on user intent, which is ambiguous. Partial solution for summarization: combine verifiable sub-rewards (factual accuracy via NLI checks, coverage via keyword overlap) with a learned reward model for fluency. The verifiable components constrain the reward model, reducing the space for reward hacking while still capturing subjective quality aspects.

What Comes Next

In the next chapter, Chapter 18: Interpretability & Mechanistic Understanding, we explore interpretability and mechanistic understanding, learning to look inside the black box of neural networks.

References & Further Reading
RLVR & Reasoning Models

Guo, D., Yang, D., Zhang, H., et al. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.

The paper that demonstrated RLVR can induce emergent chain-of-thought reasoning in LLMs. Describes the four-stage training pipeline (cold-start SFT, GRPO with verifiable rewards, rejection sampling, final alignment) that became the standard approach for training reasoning models.

📄 Paper

Shao, Z., Wang, P., Zhu, Q., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.

Introduces GRPO (Group Relative Policy Optimization) for mathematical reasoning. Demonstrates that replacing the value network with group-level reward normalization cuts memory costs while maintaining training quality. A key building block for subsequent RLVR work.

📄 Paper
Process Reward Models & Verification

Lightman, H., Kosaraju, V., Burda, Y., et al. (2024). Let's Verify Step by Step. ICLR 2024.

OpenAI's landmark paper on process-based supervision for mathematical reasoning. Shows that per-step verification substantially outperforms outcome-only verification, establishing the foundation for RLVR's approach to credit assignment in multi-step reasoning.

📄 Paper

Wang, P., Li, L., Shao, Z., et al. (2024). Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations. ACL 2024.

Proposes automated step-level verification for mathematical reasoning without requiring human annotations. Demonstrates that process rewards can be generated automatically using outcome-based signals, bridging the gap between outcome and process supervision.

📄 Paper

Cobbe, K., Kosaraju, V., Bavarian, M., et al. (2021). Training Verifiers to Solve Math Word Problems.

Early work on training verifiers for mathematical problem solving. Introduces the GSM8K benchmark and the concept of using verification models to filter candidate solutions, a precursor to the RLVR paradigm.

📄 Paper

Uesato, J., Kushman, N., Kumar, R., et al. (2022). Solving math word problems with process- and outcome-based feedback.

Systematic comparison of process-based versus outcome-based reward signals for mathematical reasoning. Provides the theoretical and empirical foundation for understanding when step-level verification outperforms final-answer checking.

📄 Paper