AlphaProof did not learn mathematics from a textbook. It learned mathematics from itself, one self-play match at a time, until it could prove things that humans only suspected.
Sage, Theorem-Curious AI Agent
This section continues from Section 8.6, which introduced the intersection of LLMs and formal logic, LeanDojo's retrieval-augmented proving, and the standard formal-math benchmarks (miniF2F, ProofNet, LeanDojo Bench). Here we turn to the result that put formal proving on the front page: DeepMind's AlphaProof at IMO 2024. We then look at the dataset extraction workflows, the RL paradigms that make self-play possible against a perfect verifier, and the evaluation metrics that distinguish formal proving from informal reasoning.
Prerequisites
This section continues from Section 8.6. Familiarity with LeanDojo, the miniF2F and ProofNet benchmarks, and the RLVR / process reward model discussion from Section 8.3 is assumed.
AlphaProof's silver-medal IMO performance in 2024 was achieved with one tiny asterisk: the system was given three days of compute per problem, while human contestants get 4.5 hours. The math is correct, the proofs are valid, and the model genuinely solved problems it had never seen. But anyone tempted to declare automated theorem proving solved should remember that the gold-medal humans did the same job 24 times faster. Compute is a tax, and AlphaProof paid it generously.
Building on the proof-assistant infrastructure from Section 8.6, we now look at how reinforcement learning against a perfect verifier produces a system that can compete on the IMO, and at what to measure when comparing formal provers.
8.6.4 AlphaProof: Reinforcement Learning for Formal Mathematics
DeepMind's AlphaProof (2024) demonstrated that combining LLMs with reinforcement learning and formal verification can solve competition-level mathematics. At the 2024 International Mathematical Olympiad (IMO), AlphaProof solved 4 out of 6 problems, achieving a score equivalent to a silver medal. This was a landmark result: the first time an AI system performed competitively on the full IMO, widely considered one of the hardest mathematical competitions in the world.
AlphaProof's architecture combines three components. First, a language model (Gemini) translates informal problem statements into formal Lean 4 specifications. Second, a value network (trained via self-play) estimates the probability of completing a proof from any given state. Third, an MCTS-style search (building on the AlphaZero framework) explores the space of possible tactic applications, guided by the value network and a policy network that suggests promising tactics.
The training loop follows a self-play paradigm. AlphaProof generates proof attempts for a large library of formalized problems, the Lean verifier checks which proofs are valid, and the model is updated via reinforcement learning on the successful proofs. Over many iterations, the model learns increasingly sophisticated proving strategies. Crucially, because the verifier is perfect (a valid Lean proof is mathematically correct by definition), there is no reward hacking: every proof the model learns from is genuinely correct.
# AlphaProof-style self-play training loop (conceptual)
# Combines MCTS search with formal verification
from dataclasses import dataclass
from typing import List, Optional, Tuple
import random
@dataclass
class ProofNode:
"""A node in the proof search tree."""
state: str # Current Lean proof state
tactic: str # Tactic that led to this state
visits: int = 0
value: float = 0.0 # Estimated probability of proof completion
children: list = None
def __post_init__(self):
if self.children is None:
self.children = []
class AlphaProofSearch:
"""MCTS-guided proof search with formal verification."""
def __init__(self, policy_net, value_net, lean_verifier):
self.policy = policy_net # Suggests tactics
self.value = value_net # Estimates proof completion prob
self.verifier = lean_verifier # Lean 4 type checker
self.c_puct = 1.5 # Exploration constant
def search(self, root_goal: str, simulations: int = 800) -> Optional[str]:
"""Run MCTS to find a proof for the given goal."""
root = ProofNode(state=root_goal, tactic="")
for _ in range(simulations):
# Selection: traverse tree using UCB
node, path = self._select(root)
# Expansion: generate candidate tactics
tactics = self.policy.suggest_tactics(node.state, k=32)
for tactic in tactics:
# Apply tactic in Lean and get resulting state
result = self.verifier.apply_tactic(node.state, tactic)
if result.is_proof_complete:
return self._extract_proof(path, tactic)
if result.is_valid:
child = ProofNode(
state=result.new_goal, tactic=tactic
)
child.value = self.value.estimate(result.new_goal)
node.children.append(child)
# Backpropagation: update values along the path
self._backpropagate(path)
return None # No proof found within budget
def _select(self, root: ProofNode) -> Tuple[ProofNode, list]:
"""Select a leaf node using UCB1."""
node = root
path = [node]
while node.children:
node = max(node.children, key=lambda c: self._ucb(c, node))
path.append(node)
return node, path
def _ucb(self, child: ProofNode, parent: ProofNode) -> float:
exploitation = child.value
exploration = self.c_puct * (parent.visits ** 0.5) / (1 + child.visits)
return exploitation + exploration
def _backpropagate(self, path: list):
for node in reversed(path):
node.visits += 1
def _extract_proof(self, path: list, final_tactic: str) -> str:
tactics = [n.tactic for n in path if n.tactic] + [final_tactic]
return "\n".join(tactics)
AlphaProof's architecture is not publicly available in full detail. The description here is based on DeepMind's blog post and the clear parallels with AlphaZero. The key architectural decisions (MCTS + value network + formal verifier) are well established; the proprietary details concern the specific model architecture, training data, and search hyperparameters. Reproducing the full system would require significant compute, but the conceptual framework can be applied at smaller scale using open tools like LeanDojo and ReProver.
8.6.5 Dataset Extraction and Proof Assistant Workflows
Building a formal proving system requires extracting structured data from proof assistant libraries. The workflow involves three stages: compiling the library to capture proof states, parsing tactic applications into (state, action, result) triples, and splitting data to avoid leakage between train and test sets.
# Lean 4 data extraction workflow (simplified)
# In practice, LeanDojo handles the Lean compiler integration
import json
from pathlib import Path
from dataclasses import dataclass, asdict
from typing import List
@dataclass
class TacticStep:
"""One step in a Lean proof, extracted from the compiler trace."""
theorem_name: str
step_index: int
goal_before: str
tactic_text: str
goal_after: str
premises_used: List[str]
def extract_proof_data(lean_project_path: str) -> List[TacticStep]:
"""
Extract tactic-level proof data from a Lean 4 project.
In practice, this uses LeanDojo's tracing infrastructure
which instruments the Lean compiler to capture proof states.
"""
steps = []
# LeanDojo traces through the Lean compilation process
# and captures every tactic application with its context
# Pseudocode for the extraction loop:
traced_files = trace_lean_project(lean_project_path)
for file_data in traced_files:
for theorem in file_data["theorems"]:
for i, step in enumerate(theorem["proof_steps"]):
steps.append(TacticStep(
theorem_name=theorem["name"],
step_index=i,
goal_before=step["goal_before"],
tactic_text=step["tactic"],
goal_after=step["goal_after"],
premises_used=step["premises"],
))
return steps
def split_by_theorem(
steps: List[TacticStep], test_fraction: float = 0.1
) -> dict:
"""
Split data by theorem (not by step) to prevent leakage.
All steps from one theorem go to the same split.
"""
theorem_names = list(set(s.theorem_name for s in steps))
# Deterministic shuffle for reproducibility
theorem_names.sort()
split_point = int(len(theorem_names) * (1 - test_fraction))
train_theorems = set(theorem_names[:split_point])
train_steps = [s for s in steps if s.theorem_name in train_theorems]
test_steps = [s for s in steps if s.theorem_name not in train_theorems]
return {"train": train_steps, "test": test_steps}
# Example usage
print("Extraction pipeline stages:")
print(" 1. Instrument Lean compiler with LeanDojo tracing")
print(" 2. Compile mathlib to capture all proof states")
print(" 3. Parse (goal, tactic, premises) triples")
print(" 4. Split by theorem to avoid train/test leakage")
print(" 5. Build premise retrieval index over all lemma statements")
8.6.6 RL and Self-Play Approaches to Formal Reasoning
The formal proving setting is uniquely well suited to reinforcement learning because the proof assistant provides a perfect reward signal. Three RL paradigms have been applied to formal theorem proving:
- Expert iteration (ExIt): The model generates proof attempts, the verifier filters for valid proofs, and the model is fine-tuned on the successful proofs. This is repeated iteratively, with each generation producing harder proofs that serve as training data for the next round. This is conceptually similar to the STaR bootstrapping method described in Section 8.3.
- MCTS with learned value functions: As in AlphaProof, a value network estimates proof completion probability from any state, guiding tree search toward promising tactic sequences. The value network is trained from the outcomes of previous searches (proofs found vs. search failures).
- Online RL with proof feedback: The model receives reward only when a complete proof is verified by Lean. This sparse reward signal is challenging to optimize, but curriculum learning (starting with easy theorems and progressing to harder ones) helps. GRPO (from Section 8.3) can be applied directly, with the proof verifier as the reward function.
# Expert iteration for formal theorem proving
from typing import List, Optional
class ExpertIterationTrainer:
"""
Train a theorem prover via expert iteration.
Each round: generate proofs, verify, fine-tune on successes.
"""
def __init__(self, prover_model, verifier, theorem_set):
self.model = prover_model
self.verifier = verifier
self.theorems = theorem_set
self.proof_buffer = [] # Accumulated successful proofs
def run_iteration(self, num_attempts: int = 64) -> dict:
"""One round of expert iteration."""
new_proofs = []
for theorem in self.theorems:
for attempt in range(num_attempts):
# Generate a proof attempt
proof = self.model.generate_proof(
theorem.statement,
temperature=0.8,
max_tactics=50
)
# Verify with the proof assistant
if self.verifier.check(theorem.name, proof):
new_proofs.append({
"theorem": theorem.statement,
"proof": proof,
"difficulty": theorem.difficulty,
})
break # Move to next theorem
# Fine-tune on newly discovered proofs
self.proof_buffer.extend(new_proofs)
self.model.fine_tune(self.proof_buffer)
return {
"theorems_attempted": len(self.theorems),
"proofs_found": len(new_proofs),
"success_rate": len(new_proofs) / len(self.theorems),
"buffer_size": len(self.proof_buffer),
}
# Typical expert iteration progression:
# Round 1: 20% of theorems proved (easy ones)
# Round 2: 35% proved (medium difficulty unlocked)
# Round 3: 45% proved (harder proofs from composed tactics)
# Round 4: 52% proved (diminishing returns begin)
Formal theorem proving with LLMs requires significant compute even at small scale. Compiling mathlib takes several hours, and each proof search involves multiple calls to the Lean type checker. For experimentation, start with a small subset of theorems (miniF2F's validation split has 244 problems) and limit search budget to 100 tactic attempts per problem. Scale up only after validating your pipeline on this smaller set.
8.6.7 Evaluation Metrics for Formal Proving
Evaluating formal provers uses metrics distinct from those for informal reasoning. The key metrics are:
- pass@k: The probability that at least one of k independent proof attempts succeeds. This is the standard metric, directly analogous to pass@k in code generation (Section 34.2). Typical values reported: pass@1, pass@10, pass@100.
- Proof validity: Every reported proof must be machine-checked by the target proof assistant. Unlike informal math evaluation, there are no "approximately correct" results. The proof either type-checks or it does not.
- Premise retrieval accuracy: For retrieval-augmented provers, the recall of relevant premises among the top-k retrieved. High premise recall is a prerequisite for proof success, since missing a critical lemma makes the proof impossible.
- Search efficiency: The number of tactic attempts or wall-clock time required to find a proof. Two systems with the same pass@100 may differ dramatically in how quickly they find proofs; the more efficient system is preferable for interactive use.
| System | pass@1 | pass@100 | Approach |
|---|---|---|---|
| ReProver (LeanDojo) | 26.5% | ~48% | Retrieval-augmented tactic generation |
| GPT-4 + Lean | ~29% | ~52% | Few-shot prompting with search |
| AlphaProof | N/A | ~65% (est.) | MCTS + RL + formal verification |
| DeepSeek-Prover-V2 | ~35% | ~60% | Expert iteration + subgoal decomposition |
The numbers above should be treated as approximate. Formal proving benchmarks are sensitive to search budget (more attempts always help when the verifier is perfect), the version of mathlib used (which affects available premises), and whether the system has seen the test problems during training. Direct comparisons across papers should account for these variables. AlphaProof's results are estimated from its IMO performance, as detailed benchmark numbers have not been publicly released.
Formal proving with LLMs remains an active research frontier with several open challenges:
- Autoformalization: Translating informal mathematical text into formal Lean statements is a prerequisite for applying formal provers to real mathematical problems. Current systems achieve roughly 25% accuracy on ProofNet's autoformalization task, far below what is needed for practical use.
- Scaling to research-level mathematics: IMO problems, while difficult, follow well-established patterns. Open research problems require novel proof strategies that go beyond pattern matching on existing mathlib proofs.
- Integration with informal reasoning: The most promising direction may be hybrid systems that use informal chain-of-thought reasoning to develop a proof sketch, then formalize and verify each step. This mirrors how human mathematicians work: intuition first, rigor second.
- Applications beyond mathematics: Formal verification of software (using systems like Coq, Agda, or F*) shares the same proof search structure. LLM-based provers could accelerate the verification of critical software systems, from cryptographic protocols to operating system kernels.
Open Questions:
- Can LLM-based provers discover genuinely novel mathematical results, or are they limited to reproving known theorems using known techniques?
- How can autoformalization accuracy be improved to make formal proving accessible to mathematicians who do not know Lean?
- Will formal verification become a standard component of LLM training, providing ground-truth reward signals for reasoning beyond mathematics (code correctness, logical consistency)?
Explore Further: Install Lean 4 and LeanDojo, extract the proof data for a small mathlib file, and train a simple tactic predictor. Measure pass@1 and pass@10 on the miniF2F validation split to establish a baseline for your own experiments.
Objective
Implement greedy decoding and nucleus (top-p) sampling from scratch using raw GPT-2 logits, then compare your results with the transformers.pipeline("text-generation") shortcut to see how a few lines of library code replace dozens of manual steps.
What You'll Practice
- Loading a pretrained language model and extracting raw logits
- Implementing greedy decoding with an autoregressive loop
- Implementing nucleus (top-p) sampling with temperature scaling
- Comparing manual decoding against the Hugging Face pipeline abstraction
Setup
Install the transformers library and PyTorch. A GPU is not required; GPT-2 (124M parameters) runs comfortably on CPU.
pip install transformers torch
Steps
Step 1: Load GPT-2 and implement greedy decoding
Load the model, feed a prompt, and repeatedly select the highest-probability next token until you reach a length limit.
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()
def greedy_decode(prompt, max_new_tokens=50):
input_ids = tokenizer.encode(prompt, return_tensors="pt")
for _ in range(max_new_tokens):
with torch.no_grad():
outputs = model(input_ids)
logits = outputs.logits[:, -1, :] # logits for last position
next_token = logits.argmax(dim=-1, keepdim=True)
input_ids = torch.cat([input_ids, next_token], dim=-1)
if next_token.item() == tokenizer.eos_token_id:
break
return tokenizer.decode(input_ids[0], skip_special_tokens=True)
print("=== Greedy Decoding ===")
print(greedy_decode("The future of artificial intelligence is"))
Hint
Greedy decoding always picks the single most likely token. This produces deterministic output but often leads to repetitive text. Notice how the model tends to loop on the same phrases.
Step 2: Implement nucleus (top-p) sampling
Add temperature scaling and top-p filtering to produce more diverse, natural text.
import torch
import torch.nn.functional as F
def nucleus_sample(prompt, max_new_tokens=50, temperature=0.8, top_p=0.9):
input_ids = tokenizer.encode(prompt, return_tensors="pt")
for _ in range(max_new_tokens):
with torch.no_grad():
outputs = model(input_ids)
logits = outputs.logits[:, -1, :] / temperature
# Sort logits and compute cumulative probabilities
sorted_logits, sorted_indices = torch.sort(logits, descending=True)
cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
# Remove tokens with cumulative probability above top_p
sorted_mask = cumulative_probs - F.softmax(sorted_logits, dim=-1) >= top_p
sorted_logits[sorted_mask] = float("-inf")
# Sample from the filtered distribution
probs = F.softmax(sorted_logits, dim=-1)
next_index = torch.multinomial(probs, num_samples=1)
next_token = sorted_indices.gather(-1, next_index)
input_ids = torch.cat([input_ids, next_token], dim=-1)
if next_token.item() == tokenizer.eos_token_id:
break
return tokenizer.decode(input_ids[0], skip_special_tokens=True)
print("\n=== Nucleus Sampling (p=0.9, temp=0.8) ===")
for i in range(3):
print(f"Sample {i+1}: {nucleus_sample('The future of artificial intelligence is')}")
print()
Hint
Nucleus sampling keeps the smallest set of tokens whose cumulative probability exceeds top_p, then samples from that set. Lower top_p values (0.5) produce more focused text; higher values (0.95) allow more variety.
Step 3: Compare with the Hugging Face pipeline
Achieve the same results with the high-level pipeline API in just a few lines.
from transformers import pipeline
# The library way: 3 lines
generator = pipeline("text-generation", model="gpt2")
results = generator(
"The future of artificial intelligence is",
max_new_tokens=50, do_sample=True, top_p=0.9, temperature=0.8,
num_return_sequences=3
)
print("=== Pipeline Output ===")
for i, r in enumerate(results):
print(f"Sample {i+1}: {r['generated_text']}\n")
Hint
The pipeline handles tokenization, the autoregressive loop, sampling logic, and decoding internally. Under the hood it uses the same model.generate() method with configurable strategies including greedy, beam search, top-k, and top-p sampling.
Expected Output
- Greedy decoding produces deterministic, often repetitive text
- Nucleus sampling produces varied outputs across runs, with coherent prose
- Pipeline output matches the quality of manual nucleus sampling with far less code
Stretch Goals
- Implement top-k sampling and compare output diversity against nucleus sampling
- Add a repetition penalty that reduces the logit of any token already generated
- Plot token probability distributions at each step to visualize how temperature reshapes the distribution
- AlphaProof demonstrated that MCTS combined with RL and formal verification can solve IMO-level problems, achieving silver medal performance at the 2024 competition.
- Expert iteration and self-play are effective training paradigms for formal proving, bootstrapping from easy proofs to progressively harder ones over multiple rounds.
- Three RL paradigms apply to formal proving: expert iteration (filter-then-finetune), MCTS with learned value functions, and online RL with proof feedback. All exploit the perfect verifier.
- Dataset extraction for proof data must split by theorem (not by step) to prevent leakage; LeanDojo handles the Lean compiler instrumentation.
- pass@k is the standard metric; results are sensitive to search budget, mathlib version, and possible test contamination.
- The key open challenges are autoformalization (translating informal to formal math), scaling to research-level problems, and extending formal verification beyond pure mathematics to software and logical reasoning.
Show Answer
Show Answer
Exercises
Lean and Coq have existed for decades and can verify any proof you give them. (a) Why have they remained niche tools used mostly by a small community of mathematicians? (b) What specific bottleneck do LLMs remove? (c) Why is the verifier ultimately what makes the LLM useful here, rather than the other way around?
Answer Sketch
(a) Writing proofs in a proof assistant is enormously labor-intensive: each step requires picking the right tactic from thousands of options and shaping the goal state correctly. The skill barrier is high and the productivity gain over informal proofs has historically been low for everyday math. (b) LLMs remove the tactic-search bottleneck: given a goal state, an LLM can propose plausible next tactics with hit rates that beat heuristic search, much like an experienced human's intuition. (c) The verifier guarantees correctness, which lets the LLM hallucinate freely (most proposed tactics fail and are discarded) without poisoning the output. This asymmetry is unusual: for most LLM applications hallucinations are costs, but here they are a controlled cost paid against an oracle.
You use an LLM to suggest tactics in a Lean proof. The model's top-1 tactic succeeds 25% of the time at any given step, and a typical theorem requires 8 sequential tactics. Predict: (a) the success rate of greedy top-1 search; (b) how many sampled tactic sequences (best-of-N at the sequence level) you need to hit 90% success; (c) what changes if you do tree search instead of independent samples?
Answer Sketch
(a) Greedy top-1: 0.25^8 = 1.5e-5 (~0.0015%), essentially zero. (b) For 90% sequence-level success: need 1 - (1 - p_seq)^N ≥ 0.9, with p_seq = 1.5e-5; so N ~= ln(0.1) / ln(1 - 1.5e-5) ~= 154,000 samples. Wildly impractical. (c) Tree search lets you reuse partial successes: each correct prefix is shared across many candidate suffixes, so the effective branching factor is much smaller. With beam-style search keeping the top 64 candidates per step, total proof states explored is ~64 * 8 = 512, comfortably tractable. This is exactly why systems like AlphaProof use MCTS rather than independent sampling.
Formal verification is powerful but a poor fit for many real LLM tasks. List four use cases where adding a proof-assistant verifier is the wrong tool, and explain why for each.
Answer Sketch
(1) Open-domain question answering: there is no formal specification of "is this answer correct"; correctness is judged by humans against fuzzy intent. (2) Creative writing: by definition there's no ground truth to verify. (3) Conversational tone and helpfulness: these are preference dimensions, not logical claims. (4) Empirical scientific claims: a proof assistant can verify mathematical reasoning but not whether an experimental result generalizes. The general principle: formal methods help when correctness can be specified precisely (math, sorting algorithms, type-safe code transformations) and fail when it cannot. For most NLP, you need empirical eval, not proof.
What's Next?
Now that we have explored how models reason, from chain-of-thought prompting to formal theorem proving, we shift focus to making these models run efficiently. In Chapter 9: Inference Optimization and Efficient Serving, we cover quantization, KV caching, speculative decoding, and serving frameworks that let you deploy reasoning models in production without breaking the compute budget.