Section 8.6a: AlphaProof, Self-Play RL, and Evaluation for Formal Proving

AlphaProof did not learn mathematics from a textbook. It learned mathematics from itself, one self-play match at a time, until it could prove things that humans only suspected.
Sage, Theorem-Curious AI Agent

Big Picture

This section continues from Section 8.6, which introduced the intersection of LLMs and formal logic, LeanDojo's retrieval-augmented proving, and the standard formal-math benchmarks (miniF2F, ProofNet, LeanDojo Bench). Here we turn to the result that put formal proving on the front page: DeepMind's AlphaProof at IMO 2024. We then look at the dataset extraction workflows, the RL paradigms that make self-play possible against a perfect verifier, and the evaluation metrics that distinguish formal proving from informal reasoning.

Prerequisites

This section continues from Section 8.6. Familiarity with LeanDojo, the miniF2F and ProofNet benchmarks, and the RLVR / process reward model discussion from Section 8.3 is assumed.

Fun Fact: The IMO Problem That Took Three Days

Three boxes (Prover, Verifier, Training Buffer) connected in a feedback loop — Self-play loop for formal proving: a learner proposes proofs, a verifier (Lean) accepts or rejects each, and accepted proofs feed back into the next round of training data.

AlphaProof's silver-medal IMO performance in 2024 was achieved with one tiny asterisk: the system was given three days of compute per problem, while human contestants get 4.5 hours. The math is correct, the proofs are valid, and the model genuinely solved problems it had never seen. But anyone tempted to declare automated theorem proving solved should remember that the gold-medal humans did the same job 24 times faster. Compute is a tax, and AlphaProof paid it generously.

Building on the proof-assistant infrastructure from Section 8.6, we now look at how reinforcement learning against a perfect verifier produces a system that can compete on the IMO, and at what to measure when comparing formal provers.

8.6.4 AlphaProof: Reinforcement Learning for Formal Mathematics

DeepMind's AlphaProof (2024) demonstrated that combining LLMs with reinforcement learning and formal verification can solve competition-level mathematics. At the 2024 International Mathematical Olympiad (IMO), AlphaProof solved 4 out of 6 problems, achieving a score equivalent to a silver medal. This was a landmark result: the first time an AI system performed competitively on the full IMO, widely considered one of the hardest mathematical competitions in the world.

AlphaProof's architecture combines three components. First, a language model (Gemini) translates informal problem statements into formal Lean 4 specifications. Second, a value network (trained via self-play) estimates the probability of completing a proof from any given state. Third, an MCTS-style search (building on the AlphaZero framework) explores the space of possible tactic applications, guided by the value network and a policy network that suggests promising tactics.

The training loop follows a self-play paradigm. AlphaProof generates proof attempts for a large library of formalized problems, the Lean verifier checks which proofs are valid, and the model is updated via reinforcement learning on the successful proofs. Over many iterations, the model learns increasingly sophisticated proving strategies. Crucially, because the verifier is perfect (a valid Lean proof is mathematically correct by definition), there is no reward hacking: every proof the model learns from is genuinely correct.

# AlphaProof-style self-play training loop (conceptual)
# Combines MCTS search with formal verification
from dataclasses import dataclass
from typing import List, Optional, Tuple
import random
@dataclass
class ProofNode:
    """A node in the proof search tree."""
    state: str # Current Lean proof state
    tactic: str # Tactic that led to this state
    visits: int = 0
    value: float = 0.0 # Estimated probability of proof completion
    children: list = None
    def __post_init__(self):
        if self.children is None:
            self.children = []
    class AlphaProofSearch:
        """MCTS-guided proof search with formal verification."""
    def __init__(self, policy_net, value_net, lean_verifier):
        self.policy = policy_net # Suggests tactics
        self.value = value_net # Estimates proof completion prob
        self.verifier = lean_verifier # Lean 4 type checker
        self.c_puct = 1.5 # Exploration constant
    def search(self, root_goal: str, simulations: int = 800) -> Optional[str]:
        """Run MCTS to find a proof for the given goal."""
        root = ProofNode(state=root_goal, tactic="")
        for _ in range(simulations):
            # Selection: traverse tree using UCB
            node, path = self._select(root)
            # Expansion: generate candidate tactics
            tactics = self.policy.suggest_tactics(node.state, k=32)
            for tactic in tactics:
                # Apply tactic in Lean and get resulting state
                result = self.verifier.apply_tactic(node.state, tactic)
                if result.is_proof_complete:
                    return self._extract_proof(path, tactic)
                    if result.is_valid:
                        child = ProofNode(
                            state=result.new_goal, tactic=tactic
                            )
                        child.value = self.value.estimate(result.new_goal)
                        node.children.append(child)
                        # Backpropagation: update values along the path
                        self._backpropagate(path)
                        return None # No proof found within budget
        def _select(self, root: ProofNode) -> Tuple[ProofNode, list]:
            """Select a leaf node using UCB1."""
            node = root
            path = [node]
            while node.children:
                node = max(node.children, key=lambda c: self._ucb(c, node))
                path.append(node)
                return node, path
                def _ucb(self, child: ProofNode, parent: ProofNode) -> float:
                    exploitation = child.value
                    exploration = self.c_puct * (parent.visits ** 0.5) / (1 + child.visits)
                    return exploitation + exploration
                def _backpropagate(self, path: list):
                    for node in reversed(path):
                        node.visits += 1
                        def _extract_proof(self, path: list, final_tactic: str) -> str:
                            tactics = [n.tactic for n in path if n.tactic] + [final_tactic]
                            return "\n".join(tactics)

Code Fragment 8.6.4: Conceptual illustration of the AlphaProof search algorithm. MCTS explores the space of tactic applications, guided by a policy network (which suggests promising tactics) and a value network (which estimates proof completion probability). The Lean verifier provides ground-truth feedback at every step, eliminating the need for approximate reward models.

Note

AlphaProof's architecture is not publicly available in full detail. The description here is based on DeepMind's blog post and the clear parallels with AlphaZero. The key architectural decisions (MCTS + value network + formal verifier) are well established; the proprietary details concern the specific model architecture, training data, and search hyperparameters. Reproducing the full system would require significant compute, but the conceptual framework can be applied at smaller scale using open tools like LeanDojo and ReProver.

8.6.5 Dataset Extraction and Proof Assistant Workflows

Building a formal proving system requires extracting structured data from proof assistant libraries. The workflow involves three stages: compiling the library to capture proof states, parsing tactic applications into (state, action, result) triples, and splitting data to avoid leakage between train and test sets.

# Lean 4 data extraction workflow (simplified)
# In practice, LeanDojo handles the Lean compiler integration
import json
from pathlib import Path
from dataclasses import dataclass, asdict
from typing import List
@dataclass
class TacticStep:
    """One step in a Lean proof, extracted from the compiler trace."""
    theorem_name: str
    step_index: int
    goal_before: str
    tactic_text: str
    goal_after: str
    premises_used: List[str]
def extract_proof_data(lean_project_path: str) -> List[TacticStep]:
    """
    Extract tactic-level proof data from a Lean 4 project.
    In practice, this uses LeanDojo's tracing infrastructure
    which instruments the Lean compiler to capture proof states.
    """
    steps = []
    # LeanDojo traces through the Lean compilation process
    # and captures every tactic application with its context
    # Pseudocode for the extraction loop:
    traced_files = trace_lean_project(lean_project_path)
    for file_data in traced_files:
        for theorem in file_data["theorems"]:
            for i, step in enumerate(theorem["proof_steps"]):
                steps.append(TacticStep(
                        theorem_name=theorem["name"],
                        step_index=i,
                        goal_before=step["goal_before"],
                        tactic_text=step["tactic"],
                        goal_after=step["goal_after"],
                        premises_used=step["premises"],
                        ))
                return steps
def split_by_theorem(
    steps: List[TacticStep], test_fraction: float = 0.1
    ) -> dict:
    """
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Split data by theorem (not by step) to prevent leakage.
    All steps from one theorem go to the same split.
    """
    theorem_names = list(set(s.theorem_name for s in steps))
    # Deterministic shuffle for reproducibility
    theorem_names.sort()
    split_point = int(len(theorem_names) * (1 - test_fraction))
    train_theorems = set(theorem_names[:split_point])
    train_steps = [s for s in steps if s.theorem_name in train_theorems]
    test_steps = [s for s in steps if s.theorem_name not in train_theorems]
    return {"train": train_steps, "test": test_steps}
# Example usage
print("Extraction pipeline stages:")
print(" 1. Instrument Lean compiler with LeanDojo tracing")
print(" 2. Compile mathlib to capture all proof states")
print(" 3. Parse (goal, tactic, premises) triples")
print(" 4. Split by theorem to avoid train/test leakage")
print(" 5. Build premise retrieval index over all lemma statements")

Output: Extraction pipeline stages: 1. Instrument Lean compiler with LeanDojo tracing 2. Compile mathlib to capture all proof states 3. Parse (goal, tactic, premises) triples 4. Split by theorem to avoid train/test leakage 5. Build premise retrieval index over all lemma statements

Code Fragment 8.6.5: Simplified data extraction workflow for Lean 4 proof data. LeanDojo instruments the Lean compiler to capture proof states at every tactic application. The critical design choice is splitting by theorem rather than by step, since steps within the same proof are highly correlated and splitting by step would cause severe data leakage.

8.6.6 RL and Self-Play Approaches to Formal Reasoning

The formal proving setting is uniquely well suited to reinforcement learning because the proof assistant provides a perfect reward signal. Three RL paradigms have been applied to formal theorem proving:

Expert iteration (ExIt): The model generates proof attempts, the verifier filters for valid proofs, and the model is fine-tuned on the successful proofs. This is repeated iteratively, with each generation producing harder proofs that serve as training data for the next round. This is conceptually similar to the STaR bootstrapping method described in Section 8.3.
MCTS with learned value functions: As in AlphaProof, a value network estimates proof completion probability from any state, guiding tree search toward promising tactic sequences. The value network is trained from the outcomes of previous searches (proofs found vs. search failures).
Online RL with proof feedback: The model receives reward only when a complete proof is verified by Lean. This sparse reward signal is challenging to optimize, but curriculum learning (starting with easy theorems and progressing to harder ones) helps. GRPO (from Section 8.3) can be applied directly, with the proof verifier as the reward function.

# Expert iteration for formal theorem proving
from typing import List, Optional
class ExpertIterationTrainer:
    """
    Train a theorem prover via expert iteration.
    Each round: generate proofs, verify, fine-tune on successes.
    """
    def __init__(self, prover_model, verifier, theorem_set):
        self.model = prover_model
        self.verifier = verifier
        self.theorems = theorem_set
        self.proof_buffer = [] # Accumulated successful proofs
    def run_iteration(self, num_attempts: int = 64) -> dict:
        """One round of expert iteration."""
        new_proofs = []
        for theorem in self.theorems:
            for attempt in range(num_attempts):
                # Generate a proof attempt
                proof = self.model.generate_proof(
                    theorem.statement,
                    temperature=0.8,
                    max_tactics=50
                    )
                # Verify with the proof assistant
                if self.verifier.check(theorem.name, proof):
                    new_proofs.append({
                        "theorem": theorem.statement,
                        "proof": proof,
                        "difficulty": theorem.difficulty,
                        })
                    break # Move to next theorem
                    # Fine-tune on newly discovered proofs
                    self.proof_buffer.extend(new_proofs)
                    self.model.fine_tune(self.proof_buffer)
                    return {
                        "theorems_attempted": len(self.theorems),
                        "proofs_found": len(new_proofs),
                        "success_rate": len(new_proofs) / len(self.theorems),
                        "buffer_size": len(self.proof_buffer),
                        }
                    # Typical expert iteration progression:
                        # Round 1: 20% of theorems proved (easy ones)
                        # Round 2: 35% proved (medium difficulty unlocked)
                        # Round 3: 45% proved (harder proofs from composed tactics)
                        # Round 4: 52% proved (diminishing returns begin)

Code Fragment 8.6.6: Expert iteration training loop for formal theorem proving. Each round generates proof attempts for all theorems, verifies them with the proof assistant, and fine-tunes on the successful proofs. Over multiple rounds, the model bootstraps from easy proofs to progressively harder ones.

Warning

Formal theorem proving with LLMs requires significant compute even at small scale. Compiling mathlib takes several hours, and each proof search involves multiple calls to the Lean type checker. For experimentation, start with a small subset of theorems (miniF2F's validation split has 244 problems) and limit search budget to 100 tactic attempts per problem. Scale up only after validating your pipeline on this smaller set.

8.6.7 Evaluation Metrics for Formal Proving

Evaluating formal provers uses metrics distinct from those for informal reasoning. The key metrics are:

pass@k: The probability that at least one of k independent proof attempts succeeds. This is the standard metric, directly analogous to pass@k in code generation (Section 34.2). Typical values reported: pass@1, pass@10, pass@100.
Proof validity: Every reported proof must be machine-checked by the target proof assistant. Unlike informal math evaluation, there are no "approximately correct" results. The proof either type-checks or it does not.
Premise retrieval accuracy: For retrieval-augmented provers, the recall of relevant premises among the top-k retrieved. High premise recall is a prerequisite for proof success, since missing a critical lemma makes the proof impossible.
Search efficiency: The number of tactic attempts or wall-clock time required to find a proof. Two systems with the same pass@100 may differ dramatically in how quickly they find proofs; the more efficient system is preferable for interactive use.

Table 8.6.2c: State-of-the-Art Results on miniF2F (as of early 2026).

System	pass@1	pass@100	Approach
ReProver (LeanDojo)	26.5%	~48%	Retrieval-augmented tactic generation
GPT-4 + Lean	~29%	~52%	Few-shot prompting with search
AlphaProof	N/A	~65% (est.)	MCTS + RL + formal verification
DeepSeek-Prover-V2	~35%	~60%	Expert iteration + subgoal decomposition

Note

The numbers above should be treated as approximate. Formal proving benchmarks are sensitive to search budget (more attempts always help when the verifier is perfect), the version of mathlib used (which affects available premises), and whether the system has seen the test problems during training. Direct comparisons across papers should account for these variables. AlphaProof's results are estimated from its IMO performance, as detailed benchmark numbers have not been publicly released.

Research Frontier: Open Frontiers and Practical Implications

Formal proving with LLMs remains an active research frontier with several open challenges:

Autoformalization: Translating informal mathematical text into formal Lean statements is a prerequisite for applying formal provers to real mathematical problems. Current systems achieve roughly 25% accuracy on ProofNet's autoformalization task, far below what is needed for practical use.
Scaling to research-level mathematics: IMO problems, while difficult, follow well-established patterns. Open research problems require novel proof strategies that go beyond pattern matching on existing mathlib proofs.
Integration with informal reasoning: The most promising direction may be hybrid systems that use informal chain-of-thought reasoning to develop a proof sketch, then formalize and verify each step. This mirrors how human mathematicians work: intuition first, rigor second.
Applications beyond mathematics: Formal verification of software (using systems like Coq, Agda, or F*) shares the same proof search structure. LLM-based provers could accelerate the verification of critical software systems, from cryptographic protocols to operating system kernels.

Open Questions:

Can LLM-based provers discover genuinely novel mathematical results, or are they limited to reproving known theorems using known techniques?
How can autoformalization accuracy be improved to make formal proving accessible to mathematicians who do not know Lean?
Will formal verification become a standard component of LLM training, providing ground-truth reward signals for reasoning beyond mathematics (code correctness, logical consistency)?

Explore Further: Install Lean 4 and LeanDojo, extract the proof data for a small mathlib file, and train a simple tactic predictor. Measure pass@1 and pass@10 on the miniF2F validation split to establish a baseline for your own experiments.

Lab: Generate Text with GPT-2

Duration: ~45 minutes Intermediate

Objective

Implement greedy decoding and nucleus (top-p) sampling from scratch using raw GPT-2 logits, then compare your results with the transformers.pipeline("text-generation") shortcut to see how a few lines of library code replace dozens of manual steps.

What You'll Practice

Loading a pretrained language model and extracting raw logits
Implementing greedy decoding with an autoregressive loop
Implementing nucleus (top-p) sampling with temperature scaling
Comparing manual decoding against the Hugging Face pipeline abstraction

Setup

Install the transformers library and PyTorch. A GPU is not required; GPT-2 (124M parameters) runs comfortably on CPU.

pip install transformers torch

Code Fragment 8.6.7: Install transformers (Hugging Face) and torch (PyTorch backend), the two dependencies the lab needs to load GPT-2 and run the decoding experiments below. CPU inference is enough for GPT-2 small (124M); the lab finishes in seconds.

Steps

Step 1: Load GPT-2 and implement greedy decoding

Load the model, feed a prompt, and repeatedly select the highest-probability next token until you reach a length limit.

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()
def greedy_decode(prompt, max_new_tokens=50):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    for _ in range(max_new_tokens):
        with torch.no_grad():
            outputs = model(input_ids)
            logits = outputs.logits[:, -1, :] # logits for last position
            next_token = logits.argmax(dim=-1, keepdim=True)
            input_ids = torch.cat([input_ids, next_token], dim=-1)
            if next_token.item() == tokenizer.eos_token_id:
                break
                return tokenizer.decode(input_ids[0], skip_special_tokens=True)
                print("=== Greedy Decoding ===")
                print(greedy_decode("The future of artificial intelligence is"))

Output: === Greedy Decoding === The future of artificial intelligence is in the hands of the people who are building it. The people who are building it are the people who are building it. The people who are building it are the people who are...

Code Fragment 8.6.8: Manual greedy decoding with GPT-2, selecting the highest-probability token at each step. The output illustrates greedy decoding's core weakness: deterministic selection leads to repetitive loops because the model keeps choosing the same high-probability continuations.

Hint

Greedy decoding always picks the single most likely token. This produces deterministic output but often leads to repetitive text. Notice how the model tends to loop on the same phrases.

Step 2: Implement nucleus (top-p) sampling

Add temperature scaling and top-p filtering to produce more diverse, natural text.

import torch
import torch.nn.functional as F
def nucleus_sample(prompt, max_new_tokens=50, temperature=0.8, top_p=0.9):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    for _ in range(max_new_tokens):
        with torch.no_grad():
            outputs = model(input_ids)
            logits = outputs.logits[:, -1, :] / temperature
            # Sort logits and compute cumulative probabilities
            sorted_logits, sorted_indices = torch.sort(logits, descending=True)
            cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
            # Remove tokens with cumulative probability above top_p
            sorted_mask = cumulative_probs - F.softmax(sorted_logits, dim=-1) >= top_p
            sorted_logits[sorted_mask] = float("-inf")
            # Sample from the filtered distribution
            probs = F.softmax(sorted_logits, dim=-1)
            next_index = torch.multinomial(probs, num_samples=1)
            next_token = sorted_indices.gather(-1, next_index)
            input_ids = torch.cat([input_ids, next_token], dim=-1)
            if next_token.item() == tokenizer.eos_token_id:
                break
            return tokenizer.decode(input_ids[0], skip_special_tokens=True)
        print("\n=== Nucleus Sampling (p=0.9, temp=0.8) ===")
        for i in range(3):
            print(f"Sample {i+1}: {nucleus_sample('The future of artificial intelligence is')}")
            print()

Output: === Nucleus Sampling (p=0.9, temp=0.8) === Sample 1: The future of artificial intelligence is likely to be shaped by advances in deep learning and the ability of machines to process vast amounts of unstructured data. Sample 2: The future of artificial intelligence is not just about building smarter machines; it involves rethinking how we collaborate with technology across every domain. Sample 3: The future of artificial intelligence is one where specialized models work alongside general-purpose systems, each handling what they do best.

Code Fragment 8.6.9: Nucleus (top-p) sampling with temperature scaling, keeping only the smallest set of tokens whose cumulative probability exceeds the threshold. Compared to greedy decoding, the three sample outputs are diverse and free of repetition, demonstrating the quality gain from stochastic sampling.

Hint

Nucleus sampling keeps the smallest set of tokens whose cumulative probability exceeds top_p, then samples from that set. Lower top_p values (0.5) produce more focused text; higher values (0.95) allow more variety.

Step 3: Compare with the Hugging Face pipeline

Achieve the same results with the high-level pipeline API in just a few lines.

from transformers import pipeline
# The library way: 3 lines
generator = pipeline("text-generation", model="gpt2")
results = generator(
    "The future of artificial intelligence is",
    max_new_tokens=50, do_sample=True, top_p=0.9, temperature=0.8,
    num_return_sequences=3
    )
print("=== Pipeline Output ===")
for i, r in enumerate(results):
    print(f"Sample {i+1}: {r['generated_text']}\n")

Output: === Pipeline Output === Sample 1: The future of artificial intelligence is increasingly intertwined with questions of governance, transparency, and societal impact. Sample 2: The future of artificial intelligence is being defined by researchers and engineers who are pushing the boundaries of what neural networks can learn. Sample 3: The future of artificial intelligence is not predetermined; the choices made today in model design, data curation, and deployment will shape its trajectory.

Code Fragment 8.6.10: The Hugging Face pipeline("text-generation") in 3 lines collapses tokenization, batching, sampling, and decoding behind a single .generate() call. The same model and same default parameters produce output matching the from-scratch implementations above; library shortcuts trade pedagogical insight for production readiness.

Hint

The pipeline handles tokenization, the autoregressive loop, sampling logic, and decoding internally. Under the hood it uses the same model.generate() method with configurable strategies including greedy, beam search, top-k, and top-p sampling.

Expected Output

Greedy decoding produces deterministic, often repetitive text
Nucleus sampling produces varied outputs across runs, with coherent prose
Pipeline output matches the quality of manual nucleus sampling with far less code

Stretch Goals

Implement top-k sampling and compare output diversity against nucleus sampling
Add a repetition penalty that reduces the logit of any token already generated
Plot token probability distributions at each step to visualize how temperature reshapes the distribution

Key Takeaways

AlphaProof demonstrated that MCTS combined with RL and formal verification can solve IMO-level problems, achieving silver medal performance at the 2024 competition.
Expert iteration and self-play are effective training paradigms for formal proving, bootstrapping from easy proofs to progressively harder ones over multiple rounds.
Three RL paradigms apply to formal proving: expert iteration (filter-then-finetune), MCTS with learned value functions, and online RL with proof feedback. All exploit the perfect verifier.
Dataset extraction for proof data must split by theorem (not by step) to prevent leakage; LeanDojo handles the Lean compiler instrumentation.
pass@k is the standard metric; results are sensitive to search budget, mathlib version, and possible test contamination.
The key open challenges are autoformalization (translating informal to formal math), scaling to research-level problems, and extending formal verification beyond pure mathematics to software and logical reasoning.

Self-Check

1. Why is formal theorem proving described as an "ideal RL environment" for LLMs?

Show Answer

The proof assistant provides a binary, ground-truth reward signal (proof valid or invalid) with zero false positives. Unlike informal math evaluation where a reward model might be fooled by plausible but incorrect reasoning, a formal verifier cannot be gamed. This means more search always helps, and the model can safely learn from its own successes without reward hacking.

2. Why must proof-data splits be done by theorem rather than by step?

Show Answer

Steps within the same proof are highly correlated: knowing earlier steps leaks the strategy used for later ones. A by-step split would put related (goal, tactic) pairs into both train and test, dramatically overstating generalization. By-theorem splitting keeps each proof's full trajectory on one side of the split, so test performance reflects how well the model handles genuinely new theorems.

Exercises

Exercise 8.6a.1: Why Formal Proofs Need LLMs Conceptual

Lean and Coq have existed for decades and can verify any proof you give them. (a) Why have they remained niche tools used mostly by a small community of mathematicians? (b) What specific bottleneck do LLMs remove? (c) Why is the verifier ultimately what makes the LLM useful here, rather than the other way around?

Answer Sketch

(a) Writing proofs in a proof assistant is enormously labor-intensive: each step requires picking the right tactic from thousands of options and shaping the goal state correctly. The skill barrier is high and the productivity gain over informal proofs has historically been low for everyday math. (b) LLMs remove the tactic-search bottleneck: given a goal state, an LLM can propose plausible next tactics with hit rates that beat heuristic search, much like an experienced human's intuition. (c) The verifier guarantees correctness, which lets the LLM hallucinate freely (most proposed tactics fail and are discarded) without poisoning the output. This asymmetry is unusual: for most LLM applications hallucinations are costs, but here they are a controlled cost paid against an oracle.

Exercise 8.6a.2: Predict the Search Cost Predictive

You use an LLM to suggest tactics in a Lean proof. The model's top-1 tactic succeeds 25% of the time at any given step, and a typical theorem requires 8 sequential tactics. Predict: (a) the success rate of greedy top-1 search; (b) how many sampled tactic sequences (best-of-N at the sequence level) you need to hit 90% success; (c) what changes if you do tree search instead of independent samples?

Answer Sketch

(a) Greedy top-1: 0.25^8 = 1.5e-5 (~0.0015%), essentially zero. (b) For 90% sequence-level success: need 1 - (1 - p_seq)^N ≥ 0.9, with p_seq = 1.5e-5; so N ~= ln(0.1) / ln(1 - 1.5e-5) ~= 154,000 samples. Wildly impractical. (c) Tree search lets you reuse partial successes: each correct prefix is shared across many candidate suffixes, so the effective branching factor is much smaller. With beam-style search keeping the top 64 candidates per step, total proof states explored is ~64 * 8 = 512, comfortably tractable. This is exactly why systems like AlphaProof use MCTS rather than independent sampling.

Exercise 8.6a.3: Where Formal Reasoning Doesn't Help Failure Mode

Formal verification is powerful but a poor fit for many real LLM tasks. List four use cases where adding a proof-assistant verifier is the wrong tool, and explain why for each.

Answer Sketch

(1) Open-domain question answering: there is no formal specification of "is this answer correct"; correctness is judged by humans against fuzzy intent. (2) Creative writing: by definition there's no ground truth to verify. (3) Conversational tone and helpfulness: these are preference dimensions, not logical claims. (4) Empirical scientific claims: a proof assistant can verify mathematical reasoning but not whether an experimental result generalizes. The general principle: formal methods help when correctness can be specified precisely (math, sorting algorithms, type-safe code transformations) and fail when it cannot. For most NLP, you need empirical eval, not proof.

What's Next?

Now that we have explored how models reason, from chain-of-thought prompting to formal theorem proving, we shift focus to making these models run efficiently. In Chapter 9: Inference Optimization and Efficient Serving, we cover quantization, KV caching, speculative decoding, and serving frameworks that let you deploy reasoning models in production without breaking the compute budget.

Further Reading

AlphaProof and Self-Play

AlphaProof Team, Google DeepMind (2024). AI Achieves Silver-Medal Standard Solving International Mathematical Olympiad Problems. Google DeepMind Blog. Announcement of AlphaProof's IMO results. Demonstrates the power of combining MCTS, RL, and formal verification for mathematical reasoning. The most significant milestone in AI theorem proving to date.

Polu, S. & Sutskever, I. (2020). Generative Language Modeling for Automated Theorem Proving. arXiv:2009.03393. Early work on applying GPT-style language models to formal theorem proving in Metamath. Established the viability of the approach and inspired subsequent work including LeanDojo and AlphaProof.

Xin, H., Guo, D., Shao, Z., Ren, Z., Zhu, Q., Liu, B., et al. (2024). DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search. arXiv:2408.08152. Combines MCTS with expert iteration for Lean theorem proving, achieving strong results on miniF2F. Demonstrates that open-weight models can compete with proprietary systems on formal proving tasks.