Section 22.8: Research Replication Benchmarks and ML Engineering Agent Evaluation

"The real test of an AI research agent is not whether it can read a paper, but whether it can reproduce the results."
Probe, Reproducibility-Obsessed AI Agent

Big Picture

Can an AI agent reproduce a research paper from scratch? This question, once hypothetical, is now the subject of rigorous benchmarks. PaperBench evaluates agents that attempt to replicate the experiments in published research papers, scoring their work against expert-written rubrics. CORE-Bench measures computational reproducibility by testing whether agents can re-execute existing research code and match published results. MLE-bench evaluates end-to-end ML engineering capability through Kaggle-style competitions. Together, these benchmarks measure something fundamentally different from standard coding benchmarks like HumanEval: they test the agent's ability to manage complex, multi-step workflows that involve reading documentation, installing dependencies, handling nondeterminism, debugging failures, and producing validated results. This section covers the design, methodology, failure modes, and practical applications of all three benchmarks.

Prerequisites

This section builds on the agent architecture concepts from Section 22.6, tool use from Chapter 23, and the evaluation frameworks from Chapter 29. Familiarity with the sandboxed execution environments from Section 26.2 is helpful, as all three benchmarks discussed here rely on containerized execution. Prior experience with ML workflows (training, evaluation, hyperparameter tuning) is assumed.

1. PaperBench: Reproducing Research Papers

PaperBench, introduced by OpenAI in 2025, is a benchmark for evaluating AI agents that attempt to reproduce the results of machine learning research papers. The benchmark consists of 20 papers from ICML 2024, each accompanied by a detailed rubric that decomposes the reproduction task into a hierarchy of sub-tasks. A complete rubric may contain 100 to 300 individual checkpoints organized in a tree structure, from high-level goals ("reproduce Figure 3") down to specific implementation details ("use AdamW optimizer with learning rate 3e-4").

The PaperBench evaluation pipeline works as follows. The agent receives the paper's PDF, any supplementary materials, and access to a sandboxed compute environment with internet access. The agent must read the paper, understand the methodology, write the code to reproduce the experiments, execute the code, and produce outputs that match the paper's reported results. A separate judge agent then evaluates the agent's outputs against the rubric, scoring each checkpoint as passed or failed. The final score is a weighted sum of rubric checkpoints, where weights reflect the relative importance and difficulty of each checkpoint.

# Code Fragment 22.8.1: PaperBench rubric structure and scoring

from dataclasses import dataclass, field

@dataclass
class RubricNode:
 """A single node in the PaperBench rubric tree."""
 node_id: str
 description: str
 weight: float
 children: list["RubricNode"] = field(default_factory=list)
 is_leaf: bool = False
 verification_method: str = "judge_model" # or "exact_match", "numeric_tolerance"
 tolerance: float = None # for numeric verification

@dataclass
class PaperBenchTask:
 """A complete PaperBench task for one research paper."""
 paper_id: str
 paper_title: str
 paper_pdf_path: str
 supplementary_paths: list[str]
 rubric_root: RubricNode
 total_checkpoints: int
 compute_budget_hours: float
 required_packages: list[str]

def build_sample_rubric() -> RubricNode:
 """Example rubric for a simple paper reproduction."""
 return RubricNode(
 node_id="root",
 description="Reproduce all experiments from the paper",
 weight=1.0,
 children=[
 RubricNode(
 node_id="setup",
 description="Environment and data preparation",
 weight=0.2,
 children=[
 RubricNode(
 node_id="setup.deps",
 description="All dependencies installed correctly",
 weight=0.5, is_leaf=True,
 ),
 RubricNode(
 node_id="setup.data",
 description="Dataset downloaded and preprocessed",
 weight=0.5, is_leaf=True,
 ),
 ],
 ),
 RubricNode(
 node_id="training",
 description="Model training pipeline",
 weight=0.4,
 children=[
 RubricNode(
 node_id="training.arch",
 description="Model architecture matches paper description",
 weight=0.3, is_leaf=True,
 ),
 RubricNode(
 node_id="training.optim",
 description="Optimizer and hyperparameters match paper",
 weight=0.3, is_leaf=True,
 ),
 RubricNode(
 node_id="training.converge",
 description="Training converges within stated epochs",
 weight=0.4, is_leaf=True,
 verification_method="numeric_tolerance",
 tolerance=0.02,
 ),
 ],
 ),
 RubricNode(
 node_id="results",
 description="Reproduce reported results",
 weight=0.4,
 children=[
 RubricNode(
 node_id="results.table1",
 description="Table 1 results within tolerance",
 weight=0.5, is_leaf=True,
 verification_method="numeric_tolerance",
 tolerance=0.01,
 ),
 RubricNode(
 node_id="results.fig3",
 description="Figure 3 qualitatively matches",
 weight=0.5, is_leaf=True,
 verification_method="judge_model",
 ),
 ],
 ),
 ],
 )

def score_rubric(root: RubricNode, results: dict[str, bool]) -> float:
 """
 Recursively score a rubric tree given leaf-level pass/fail results.

 Parameters
 ----------
 root : RubricNode
 The root of the rubric tree.
 results : dict
 Mapping from leaf node_id to True (pass) or False (fail).

 Returns
 -------
 float
 Weighted score between 0.0 and 1.0.
 """
 if root.is_leaf:
 return root.weight if results.get(root.node_id, False) else 0.0

 child_scores = sum(score_rubric(child, results) for child in root.children)
 max_child_weight = sum(child.weight for child in root.children)

 if max_child_weight == 0:
 return 0.0

 normalized = child_scores / max_child_weight
 return root.weight * normalized

Code Fragment 22.8.1: The PaperBench rubric structure. Each paper's reproduction task is decomposed into a tree of checkpoints with weights reflecting importance. Leaf nodes are scored as pass/fail, and scores propagate up through the tree. A paper with 200 leaf checkpoints might have an agent scoring 0.35, meaning roughly 35% of the rubric was satisfied.

PaperBench includes a dev split (a small set of papers for testing and debugging) and a test split (the full evaluation set). The dev split is publicly available and is the recommended starting point for researchers and practitioners who want to evaluate their agents. Current state-of-the-art agents achieve scores between 0.20 and 0.40 on the full benchmark, indicating that fully automated paper reproduction remains an open challenge. The most common failure modes are dependency installation errors, incorrect hyperparameter settings, and inability to handle papers that omit key implementation details.

Key Insight

PaperBench reveals that reading comprehension is not the bottleneck for research agents. Modern LLMs can accurately summarize papers and extract methodological details. The failures occur downstream: during environment setup (conflicting package versions, missing system libraries), code generation (subtle bugs in matrix operations, incorrect loss function implementations), and execution management (GPU memory errors, training instability, nondeterministic results). This means that improving research agents requires advances in tool use, error recovery, and iterative debugging, not just better language understanding.

2. CORE-Bench: Computational Reproducibility Evaluation

CORE-Bench (Computational Reproducibility Benchmark) takes a complementary approach to PaperBench. Instead of asking agents to write code from scratch based on a paper, CORE-Bench provides the paper's original code and asks the agent to execute it and reproduce the published results. This isolates the computational reproducibility challenge: can the agent set up the environment, resolve dependency conflicts, fix broken paths, handle missing data files, and successfully run the provided code?

CORE-Bench includes 270 tasks drawn from papers across computer science, social science, and medicine. Each task is classified into one of three difficulty tiers. Easy tasks require running a single script with minimal environment setup. Medium tasks require running multiple scripts in sequence, resolving dependency conflicts, or handling missing data. Hard tasks require significant debugging, code modification, or creative problem-solving to get the original code running.

# Code Fragment 22.8.2: CORE-Bench task structure and evaluation

from dataclasses import dataclass
from enum import Enum
from pathlib import Path

class CoreBenchDifficulty(Enum):
 EASY = "easy"
 MEDIUM = "medium"
 HARD = "hard"

@dataclass
class CoreBenchTask:
 """A single CORE-Bench reproducibility task."""
 task_id: str
 paper_title: str
 paper_doi: str
 code_repo_path: str
 expected_outputs: dict[str, str] # filename -> expected content/hash
 difficulty: CoreBenchDifficulty
 domain: str # "cs", "social_science", "medicine"
 known_issues: list[str] # documented reproducibility barriers

def evaluate_core_bench_task(
 task: CoreBenchTask,
 agent_outputs_dir: str,
 tolerance: float = 0.05,
) -> dict:
 """
 Evaluate whether the agent successfully reproduced the task outputs.

 Compares agent-generated outputs against expected outputs using
 appropriate comparison methods (exact match for text, numeric
 tolerance for data, visual similarity for plots).
 """
 results = {
 "task_id": task.task_id,
 "difficulty": task.difficulty.value,
 "outputs_checked": 0,
 "outputs_matched": 0,
 "failures": [],
 }

 for filename, expected in task.expected_outputs.items():
 agent_file = Path(agent_outputs_dir) / filename
 results["outputs_checked"] += 1

 if not agent_file.exists():
 results["failures"].append({
 "file": filename,
 "reason": "output_file_missing",
 })
 continue

 # Compare outputs
 match = compare_outputs(
 agent_output=agent_file.read_text(),
 expected_output=expected,
 tolerance=tolerance,
 )

 if match:
 results["outputs_matched"] += 1
 else:
 results["failures"].append({
 "file": filename,
 "reason": "output_mismatch",
 })

 results["score"] = (
 results["outputs_matched"] / max(results["outputs_checked"], 1)
 )
 return results

def compare_outputs(
 agent_output: str,
 expected_output: str,
 tolerance: float,
) -> bool:
 """Compare two outputs with appropriate tolerance."""
 # Try numeric comparison first
 try:
 agent_val = float(agent_output.strip())
 expected_val = float(expected_output.strip())
 return abs(agent_val - expected_val) / max(abs(expected_val), 1e-10) <= tolerance
 except ValueError:
 pass

 # Fall back to exact string match (ignoring whitespace)
 return agent_output.strip() == expected_output.strip()

Code Fragment 22.8.2: The CORE-Bench evaluation pipeline. Each task provides original code and expected outputs. The agent must execute the code (resolving any environment or data issues) and produce matching outputs. Numeric outputs are compared with tolerance; text outputs require exact matches.

3. MLE-bench: End-to-End ML Engineering Evaluation

MLE-bench evaluates a broader set of ML engineering skills by presenting agents with Kaggle-style competitions. Each task provides a dataset, a problem description, an evaluation metric, and a submission format. The agent must explore the data, engineer features, select and train models, tune hyperparameters, and produce a submission file that is scored against a held-out test set. The benchmark includes 75 competitions spanning tabular data, natural language processing, computer vision, and time series forecasting.

The key differentiator of MLE-bench is that it measures practical ML engineering competence rather than isolated coding ability. Success requires the agent to make strategic decisions: which algorithm to try first, how to allocate a limited compute budget across experiments, when to stop tuning and submit, and how to diagnose and recover from common ML failures (overfitting, data leakage, evaluation metric mismatches). The benchmark scores submissions using the same evaluation metric as the original Kaggle competition, and reports the agent's percentile rank relative to the original competition leaderboard.

# Code Fragment 22.8.3: MLE-bench task runner

from dataclasses import dataclass
from typing import Optional

@dataclass
class MLEBenchTask:
 """A single MLE-bench competition task."""
 competition_id: str
 title: str
 description: str
 train_data_path: str
 test_data_path: str
 sample_submission_path: str
 evaluation_metric: str # "rmse", "accuracy", "auc", "f1", etc.
 higher_is_better: bool
 time_budget_minutes: int
 leaderboard_scores: list[float] # original competition scores

@dataclass
class MLEBenchResult:
 """Result of an agent's submission to an MLE-bench task."""
 competition_id: str
 agent_score: float
 percentile_rank: float # 0 to 100, higher is better
 medal: Optional[str] # "gold", "silver", "bronze", or None
 time_used_minutes: float
 num_submissions: int
 strategies_tried: list[str]

def score_mle_submission(
 task: MLEBenchTask,
 agent_score: float,
) -> MLEBenchResult:
 """
 Score an agent's submission against the original competition leaderboard.

 The percentile rank indicates what fraction of human competitors
 the agent would have outperformed.
 """
 leaderboard = sorted(
 task.leaderboard_scores,
 reverse=task.higher_is_better,
 )

 # Compute percentile rank
 if task.higher_is_better:
 beaten = sum(1 for s in leaderboard if agent_score > s)
 else:
 beaten = sum(1 for s in leaderboard if agent_score < s)

 percentile = (beaten / len(leaderboard)) * 100 if leaderboard else 0

 # Determine medal (top 10% gold, top 25% silver, top 40% bronze)
 medal = None
 if percentile >= 90:
 medal = "gold"
 elif percentile >= 75:
 medal = "silver"
 elif percentile >= 60:
 medal = "bronze"

 return MLEBenchResult(
 competition_id=task.competition_id,
 agent_score=agent_score,
 percentile_rank=percentile,
 medal=medal,
 time_used_minutes=0, # filled by the harness
 num_submissions=0,
 strategies_tried=[],
 )

# Example: evaluating an agent on a tabular regression task
# task = MLEBenchTask(
# competition_id="house-prices",
# title="House Prices: Advanced Regression",
# evaluation_metric="rmse",
# higher_is_better=False,
# leaderboard_scores=[0.12, 0.13, 0.14, 0.15, ...],
# ...
# )
# result = score_mle_submission(task, agent_score=0.128)
# print(f"Percentile: {result.percentile_rank:.1f}%, Medal: {result.medal}")

Code Fragment 22.8.3: MLE-bench task runner and scoring. The agent's submission is scored using the original competition metric and ranked against the historical leaderboard. A percentile rank of 85 means the agent outperformed 85% of human competitors. Medal thresholds follow Kaggle conventions.

4. Failure Taxonomies for Research and ML Agents

Analyzing agent failures across PaperBench, CORE-Bench, and MLE-bench reveals a recurring set of failure categories. Understanding these categories is essential for improving agent performance and for designing more informative benchmarks. The taxonomy below organizes failures by their root cause rather than their surface symptoms.

Data failures occur when the agent cannot acquire, parse, or preprocess the required data. Common examples include: datasets that have been removed from their original hosting location, CSV files with inconsistent encoding, image datasets that require specific directory structures, and preprocessing steps that are described ambiguously in the paper.

Infrastructure failures occur during environment setup and execution. These include: conflicting Python package versions (the paper requires PyTorch 1.13 but the code uses features from PyTorch 2.0), missing system-level dependencies (CUDA version mismatches, missing C compilers), and resource constraints (the agent's sandbox has less GPU memory than the original experiments required).

Modelling failures occur when the agent's implementation differs from the paper in ways that affect results. These are the most subtle failures because the code may run without errors but produce incorrect results. Examples include: using the wrong activation function, implementing attention with the wrong scaling factor, or applying regularization at the wrong stage of the pipeline.

Scoring failures occur during evaluation. The agent may train a model correctly but compute the evaluation metric incorrectly, use the wrong data split for evaluation, or produce output in a format that the scoring pipeline cannot parse. These failures are particularly frustrating because the core work is correct but the final step is wrong.

# Code Fragment 22.8.4: Classifying agent failures for postmortem analysis

from dataclasses import dataclass
from enum import Enum

class FailureCategory(Enum):
 DATA_ACQUISITION = "data_acquisition"
 DATA_PREPROCESSING = "data_preprocessing"
 DATA_FORMAT = "data_format"
 INFRA_DEPENDENCY = "infra_dependency"
 INFRA_RESOURCE = "infra_resource"
 INFRA_ENVIRONMENT = "infra_environment"
 MODEL_ARCHITECTURE = "model_architecture"
 MODEL_HYPERPARAMETER = "model_hyperparameter"
 MODEL_TRAINING = "model_training"
 SCORING_METRIC = "scoring_metric"
 SCORING_FORMAT = "scoring_format"
 SCORING_SPLIT = "scoring_split"

@dataclass
class FailureRecord:
 """A structured record of an agent failure."""
 task_id: str
 category: FailureCategory
 description: str
 error_message: str
 agent_action_before_failure: str
 recovery_attempted: bool
 recovery_succeeded: bool

def analyze_failure_distribution(
 failures: list[FailureRecord],
) -> dict:
 """
 Analyze the distribution of failures across categories.

 Returns per-category counts and recovery success rates.
 """
 category_stats = {}

 for failure in failures:
 cat = failure.category.value
 if cat not in category_stats:
 category_stats[cat] = {
 "count": 0,
 "recovery_attempted": 0,
 "recovery_succeeded": 0,
 }

 category_stats[cat]["count"] += 1
 if failure.recovery_attempted:
 category_stats[cat]["recovery_attempted"] += 1
 if failure.recovery_succeeded:
 category_stats[cat]["recovery_succeeded"] += 1

 # Compute recovery rates
 for cat, stats in category_stats.items():
 attempted = stats["recovery_attempted"]
 stats["recovery_rate"] = (
 stats["recovery_succeeded"] / attempted if attempted > 0 else 0.0
 )

 return {
 "total_failures": len(failures),
 "by_category": category_stats,
 "top_category": max(
 category_stats, key=lambda k: category_stats[k]["count"]
 ) if category_stats else None,
 }

Code Fragment 22.8.4: A failure classification system for research agent postmortems. Each failure is categorized by root cause (data, infrastructure, modelling, or scoring) and further classified by sub-type. The analysis function computes failure distributions and recovery rates, identifying which failure categories agents handle well and which require improvement.

5. Task Design Challenges: Nondeterminism, Missing Data, and Dependencies

Designing tasks for research replication benchmarks requires careful attention to three challenges that do not arise in standard coding benchmarks. Nondeterminism is inherent in ML training: different random seeds, GPU non-determinism, and floating-point ordering can cause results to vary between runs even with identical code. Benchmarks must define appropriate tolerances (PaperBench uses rubric-based judgment; CORE-Bench uses numeric tolerance; MLE-bench uses leaderboard ranking, which is inherently tolerant of small variations).

Missing datasets are a common problem for replication benchmarks. Papers may reference datasets that are no longer publicly available, require registration or approval to access, or are hosted on services with rate limits. Benchmarks address this by either pre-downloading all required data into the sandbox, providing synthetic substitutes, or restricting the task set to papers with permanently available data.

Dependency management is perhaps the most underappreciated challenge. A paper published in 2023 may depend on package versions that conflict with the Python version available in 2025. The agent must either find compatible package versions, use containerized environments with specific Python versions, or modify the code to work with newer packages. This is a skill that many human ML engineers struggle with, and it is a significant source of failures for agents as well.

# Code Fragment 22.8.5: Handling nondeterminism in benchmark scoring

import numpy as np
from typing import Optional

def score_with_nondeterminism_tolerance(
 agent_results: list[float],
 reference_results: list[float],
 relative_tolerance: float = 0.05,
 absolute_tolerance: float = 0.01,
 num_reference_runs: int = 5,
 reference_std: Optional[float] = None,
) -> dict:
 """
 Score agent results against a reference, accounting for
 nondeterminism in ML training.

 Uses a two-tier tolerance: results within the reference's
 own variability range (measured across multiple reference runs)
 are considered a match. Results within a broader absolute or
 relative tolerance are considered a partial match.
 """
 agent_arr = np.array(agent_results)
 ref_arr = np.array(reference_results)

 # If reference standard deviation is provided, use it
 if reference_std is None:
 # Estimate from the tolerance
 reference_std = np.mean(np.abs(ref_arr)) * relative_tolerance

 diffs = np.abs(agent_arr - ref_arr)
 relative_diffs = diffs / np.maximum(np.abs(ref_arr), 1e-10)

 # Tier 1: within reference variability (strong match)
 strong_matches = np.sum(diffs <= 2 * reference_std)

 # Tier 2: within tolerance (acceptable match)
 acceptable_matches = np.sum(
 (relative_diffs <= relative_tolerance) |
 (diffs <= absolute_tolerance)
 )

 # Tier 3: outside tolerance (mismatch)
 mismatches = len(agent_arr) - acceptable_matches

 return {
 "total_metrics": len(agent_arr),
 "strong_matches": int(strong_matches),
 "acceptable_matches": int(acceptable_matches),
 "mismatches": int(mismatches),
 "match_rate": float(acceptable_matches / max(len(agent_arr), 1)),
 "mean_relative_error": float(np.mean(relative_diffs)),
 }

Code Fragment 22.8.5: A scoring function that accounts for nondeterminism in ML experiments. Results are classified into three tiers: strong matches (within the reference's own variability), acceptable matches (within a broader tolerance), and mismatches. This prevents false negatives where the agent's implementation is correct but produces slightly different numbers due to random initialization or hardware differences.

Real-World Scenario: PaperBench Dev Split Postmortem

Who: An ML infrastructure team at an AI lab evaluating their research replication agent.

Situation: The team ran the PaperBench dev split against their agent to establish a baseline before a customer demo. The agent attempted five papers and achieved rubric scores of 0.42, 0.18, 0.31, 0.05, and 0.38.

Problem: The overall score of 0.27 average was well below the team's target of 0.50. Worse, the failures were scattered across different categories, making it unclear where to focus improvement effort.

Decision: The team conducted a structured postmortem, mapping each paper's failure to a taxonomy: Paper 1 (0.42) failed on evaluation due to a custom metric not in scikit-learn (scoring failure). Paper 2 (0.18) could not access a gated dataset (data failure). Paper 3 (0.31) used cosine scheduling instead of linear warmup (modelling failure). Paper 4 (0.05) required a specific CUDA version (infrastructure failure). Paper 5 (0.38) produced slightly misformatted figures (scoring/formatting failure).

Result: The taxonomy revealed that infrastructure and data access failures (Papers 2 and 4) were complete blockers, while modelling and scoring failures allowed partial progress. The team prioritized dependency management and dataset credential handling, raising average scores to 0.41 within two weeks.

Lesson: Structured failure taxonomies transform opaque benchmark scores into prioritized engineering roadmaps by distinguishing total blockers from partial degradations.

Key Insight

Research replication benchmarks measure a fundamentally different capability than coding benchmarks. HumanEval and SWE-bench test whether an agent can write correct code for a well-specified task. PaperBench, CORE-Bench, and MLE-bench test whether an agent can manage the full lifecycle of a research or ML project: reading underspecified instructions, making design decisions under uncertainty, debugging failures iteratively, and producing validated results. The gap between coding benchmark scores and research benchmark scores for current agents suggests that project management and iterative problem-solving remain significant challenges for AI agents.

Lab: Run PaperBench Dev Split and Write a Postmortem

This section outlines a hands-on lab exercise for running the PaperBench dev split and conducting a structured failure analysis. The lab is designed to give practitioners direct experience with the benchmark and to develop the analytical skills needed to interpret agent evaluation results.

# Code Fragment 22.8.6: Setting up and running PaperBench dev split

# Clone the PaperBench repository
git clone https://github.com/openai/preparedness.git
cd preparedness/paperbench

# Install dependencies
pip install -e ".[dev]"

# Configure your agent (example: using an OpenAI agent)
export OPENAI_API_KEY="your-key-here"

# Run the dev split (subset of papers for testing)
python run_benchmark.py \
 --split dev \
 --agent openai \
 --model gpt-4o \
 --output-dir results/dev_run_001 \
 --timeout-per-paper 3600

# The benchmark will:
# 1. Download each paper's PDF and supplementary materials
# 2. Launch a sandboxed environment for the agent
# 3. Let the agent attempt reproduction
# 4. Score the results against the rubric
# 5. Generate a detailed report

# View results summary
python analyze_results.py --results-dir results/dev_run_001

Code Fragment 22.8.6: Setup and execution commands for the PaperBench dev split. The benchmark runs each paper in an isolated sandbox with a configurable timeout. Results include per-paper rubric scores, detailed checkpoint breakdowns, and the agent's execution trace (all tool calls, code generated, and errors encountered).

After running the benchmark, write a structured postmortem for each paper using the following template: (1) Summary: overall rubric score and which major sections passed or failed. (2) Failure classification: map each failure to the taxonomy from Section 4 (data, infrastructure, modelling, or scoring). (3) Root cause analysis: for the most impactful failure, trace the agent's decision history to identify where it went wrong. (4) Proposed improvements: specific changes to the agent's prompts, tools, or architecture that would address the identified failures. This postmortem practice builds the diagnostic skills needed to systematically improve research agents.

Warning

PaperBench requires significant compute resources. Running the full benchmark requires GPU access (for training ML models) and can consume hundreds of dollars in API credits (for the agent's LLM calls). The dev split is much cheaper but still requires a capable GPU for some papers. Budget approximately $20 to $50 in API credits and 2 to 4 GPU-hours for a dev split run. Always set a per-paper timeout to prevent runaway costs from agents stuck in retry loops.

Exercises

Exercise 22.8.1: Rubric Design Conceptual

Choose a recent ML paper you are familiar with. Design a PaperBench-style rubric tree with at least 20 leaf checkpoints organized into four to five top-level categories. Assign weights to each checkpoint reflecting its importance to the reproduction. Explain your weighting decisions.

Answer Sketch

A rubric for a paper on "BERT fine-tuning for sentiment analysis" might include: Environment (10%): correct Python version, transformers library version, dataset download. Data Preparation (15%): tokenization matches paper, train/val/test splits match, data augmentation if specified. Model (25%): correct base model, correct classification head architecture, correct freezing/unfreezing schedule. Training (25%): correct optimizer, learning rate, batch size, number of epochs, early stopping criteria. Evaluation (25%): correct metric (F1 vs accuracy), correct test set, results within tolerance. Higher weights on training and evaluation because those directly determine whether results match.

Exercise 22.8.2: Failure Taxonomy Application Conceptual

Review the agent trace from a failed PaperBench or CORE-Bench run (you may use a hypothetical trace if you do not have access to a real one). Classify each error into the failure taxonomy from Section 4. Identify which failure was the "blocking" failure (the one that prevented further progress) and which were "downstream" failures caused by the blocking failure.

Answer Sketch

Example trace: (1) Agent installs torch==1.13.0, gets dependency conflict with numpy 2.0 (INFRA_DEPENDENCY, blocking). (2) Agent falls back to torch==2.0.0, but the paper's custom CUDA kernel does not compile (INFRA_ENVIRONMENT, downstream). (3) Agent skips the custom kernel and uses a pure Python fallback, but training is 10x slower and times out (INFRA_RESOURCE, downstream). The blocking failure is the initial dependency conflict. Fixing that single issue would likely resolve the cascade.

Exercise 22.8.3: MLE-bench Strategy Analysis Coding

Design an agent strategy for MLE-bench that allocates a fixed compute budget across exploration (trying different model types) and exploitation (tuning the best model). Implement a simple budget allocator that decides, at each step, whether to explore a new model family or tune the current best. Test it on a simulated competition with known ground truth.

Answer Sketch

Use an explore-then-exploit strategy: spend the first 40% of the budget trying at least three model families (linear, tree-based, neural), each with default hyperparameters. Record validation scores. Spend the remaining 60% on hyperparameter tuning of the top model. Implement as a state machine with EXPLORE and EXPLOIT phases. The transition trigger is either budget exhaustion for the explore phase or a plateau in validation score improvement. For the simulated competition, generate a regression dataset where gradient boosting is the best model family, and verify that the agent discovers this during exploration.

Exercise 22.8.4: Nondeterminism Analysis Coding

Run a simple ML training script (e.g., MNIST classification with a small CNN) five times with different random seeds. Record the test accuracy for each run. Compute the mean, standard deviation, and 95% confidence interval. Then use the scoring function from Code Fragment 22.8.5 to determine what tolerance would be needed to accept all five runs as "matching" the mean.

Answer Sketch

Typical MNIST CNN results: 99.1%, 99.2%, 99.0%, 99.15%, 99.05%. Mean: 99.1%, std: 0.07%. A relative tolerance of 0.001 (0.1%) would accept all runs. The 95% confidence interval is approximately 99.1% plus or minus 0.09%. This exercise demonstrates that even simple tasks have nontrivial nondeterminism, and benchmark tolerances must be calibrated empirically rather than set arbitrarily.

Exercise 22.8.5: Postmortem Lab Coding

Run the PaperBench dev split (or simulate it using the rubric scorer from Code Fragment 22.8.1 with synthetic pass/fail results). Write a structured postmortem following the template from Section 6: summary, failure classification, root cause analysis, and proposed improvements. Present your findings as a one-page report with a failure distribution chart.

Answer Sketch

Simulate five papers with rubric trees of 50 checkpoints each. Randomly assign 60% of checkpoints as passed, 40% as failed, with failures clustered in infrastructure (50% of failures), modelling (25%), data (15%), and scoring (10%). The postmortem should note that infrastructure failures are the dominant category and propose: (1) pre-building Docker images for common ML frameworks, (2) adding a dependency resolver tool that checks compatibility before installation, (3) implementing a "fallback environment" strategy that tries multiple Python versions. The failure distribution chart is a bar chart showing counts per failure category.

Key Takeaways

PaperBench tests the full stack of agent capabilities: reading papers, writing code, debugging, and reproducing experimental results.
MLE-bench focuses on ML engineering tasks (Kaggle-style competitions), complementing SWE-bench's focus on bug-fixing.
Research replication benchmarks reveal gaps in agent reasoning that simpler coding benchmarks miss.

Self-Check

Q1: What does PaperBench evaluate, and why is reproducing research papers a strong test of agent capability?

Show Answer

PaperBench evaluates an agent's ability to reproduce the experiments and results from published ML research papers. This tests reading comprehension, code generation, environment setup, debugging, and scientific reasoning, making it one of the most comprehensive agent benchmarks.

Q2: How does MLE-bench differ from SWE-bench in what it measures about coding agents?

Show Answer

SWE-bench measures bug-fixing ability (resolving GitHub issues with patches), while MLE-bench measures ML engineering ability (competing on Kaggle-style data science challenges), testing model selection, feature engineering, hyperparameter tuning, and submission formatting.

What Comes Next

In the next chapter, Chapter 23: Tool Use, Function Calling & Protocols, we explore how agents interact with the outside world through tool use. The evaluation frameworks covered here provide the measurement infrastructure for assessing whether tool-using agents can handle the complex, multi-step workflows that research replication demands.

References & Further Reading

Research Replication Benchmarks

Starace, J., Qu, Y., Powers, T., et al. (2025). PaperBench: Evaluating AI's Ability to Replicate AI Research. arXiv:2504.01848.

The primary reference for PaperBench. Introduces the rubric-based evaluation methodology and reports results across multiple frontier models. Essential reading for understanding the benchmark design covered in Sections 1 and 2.

Paper · GitHub

Tian, N., Koh, H. Y., and Ritter, A. (2024). CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark. arXiv:2409.11363.

Defines the three-tier (easy, medium, hard) computational reproducibility benchmark. Complements PaperBench by focusing on re-executing existing code rather than writing new code from papers.

Paper

Chan, J. S., Chowdhury, N., Jaffe, O., et al. (2024). MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering. arXiv:2410.07095.

Evaluates agents on 75 Kaggle competitions covering the full ML engineering pipeline. Provides the medal-based scoring system discussed in Section 3.

Paper · GitHub

Related Code Generation Benchmarks

Jimenez, C. E., Yang, J., Wettig, A., et al. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024.

The bug-fixing benchmark that complements MLE-bench. Tests whether agents can resolve real GitHub issues by generating patches, providing a baseline for comparing different types of coding agent evaluations.

Paper

Chen, M., Tworek, J., Jun, H., et al. (2021). Evaluating Large Language Models Trained on Code. arXiv:2107.03374.

Introduces HumanEval and the pass@k metric for code generation evaluation. Provides the foundation against which the more complex benchmarks in this section are compared.

Paper