"The real test of an AI research agent is not whether it can read a paper, but whether it can reproduce the results."
Probe, Reproducibility-Obsessed AI Agent
Can an AI agent reproduce a research paper from scratch? This question, once hypothetical, is now the subject of rigorous benchmarks. PaperBench evaluates agents that attempt to replicate the experiments in published research papers, scoring their work against expert-written rubrics. CORE-Bench measures computational reproducibility by testing whether agents can re-execute existing research code and match published results. MLE-bench evaluates end-to-end ML engineering capability through Kaggle-style competitions. Together, these benchmarks measure something fundamentally different from standard coding benchmarks like HumanEval: they test the agent's ability to manage complex, multi-step workflows that involve reading documentation, installing dependencies, handling nondeterminism, debugging failures, and producing validated results. This section covers the design, methodology, failure modes, and practical applications of all three benchmarks.
Prerequisites
This section builds on the agent architecture concepts from Section 22.6, tool use from Chapter 23, and the evaluation frameworks from Chapter 29. Familiarity with the sandboxed execution environments from Section 26.2 is helpful, as all three benchmarks discussed here rely on containerized execution. Prior experience with ML workflows (training, evaluation, hyperparameter tuning) is assumed.
1. PaperBench: Reproducing Research Papers
PaperBench, introduced by OpenAI in 2025, is a benchmark for evaluating AI agents that attempt to reproduce the results of machine learning research papers. The benchmark consists of 20 papers from ICML 2024, each accompanied by a detailed rubric that decomposes the reproduction task into a hierarchy of sub-tasks. A complete rubric may contain 100 to 300 individual checkpoints organized in a tree structure, from high-level goals ("reproduce Figure 3") down to specific implementation details ("use AdamW optimizer with learning rate 3e-4").
The PaperBench evaluation pipeline works as follows. The agent receives the paper's PDF, any supplementary materials, and access to a sandboxed compute environment with internet access. The agent must read the paper, understand the methodology, write the code to reproduce the experiments, execute the code, and produce outputs that match the paper's reported results. A separate judge agent then evaluates the agent's outputs against the rubric, scoring each checkpoint as passed or failed. The final score is a weighted sum of rubric checkpoints, where weights reflect the relative importance and difficulty of each checkpoint.
# Code Fragment 22.8.1: PaperBench rubric structure and scoring
from dataclasses import dataclass, field
@dataclass
class RubricNode:
"""A single node in the PaperBench rubric tree."""
node_id: str
description: str
weight: float
children: list["RubricNode"] = field(default_factory=list)
is_leaf: bool = False
verification_method: str = "judge_model" # or "exact_match", "numeric_tolerance"
tolerance: float = None # for numeric verification
@dataclass
class PaperBenchTask:
"""A complete PaperBench task for one research paper."""
paper_id: str
paper_title: str
paper_pdf_path: str
supplementary_paths: list[str]
rubric_root: RubricNode
total_checkpoints: int
compute_budget_hours: float
required_packages: list[str]
def build_sample_rubric() -> RubricNode:
"""Example rubric for a simple paper reproduction."""
return RubricNode(
node_id="root",
description="Reproduce all experiments from the paper",
weight=1.0,
children=[
RubricNode(
node_id="setup",
description="Environment and data preparation",
weight=0.2,
children=[
RubricNode(
node_id="setup.deps",
description="All dependencies installed correctly",
weight=0.5, is_leaf=True,
),
RubricNode(
node_id="setup.data",
description="Dataset downloaded and preprocessed",
weight=0.5, is_leaf=True,
),
],
),
RubricNode(
node_id="training",
description="Model training pipeline",
weight=0.4,
children=[
RubricNode(
node_id="training.arch",
description="Model architecture matches paper description",
weight=0.3, is_leaf=True,
),
RubricNode(
node_id="training.optim",
description="Optimizer and hyperparameters match paper",
weight=0.3, is_leaf=True,
),
RubricNode(
node_id="training.converge",
description="Training converges within stated epochs",
weight=0.4, is_leaf=True,
verification_method="numeric_tolerance",
tolerance=0.02,
),
],
),
RubricNode(
node_id="results",
description="Reproduce reported results",
weight=0.4,
children=[
RubricNode(
node_id="results.table1",
description="Table 1 results within tolerance",
weight=0.5, is_leaf=True,
verification_method="numeric_tolerance",
tolerance=0.01,
),
RubricNode(
node_id="results.fig3",
description="Figure 3 qualitatively matches",
weight=0.5, is_leaf=True,
verification_method="judge_model",
),
],
),
],
)
def score_rubric(root: RubricNode, results: dict[str, bool]) -> float:
"""
Recursively score a rubric tree given leaf-level pass/fail results.
Parameters
----------
root : RubricNode
The root of the rubric tree.
results : dict
Mapping from leaf node_id to True (pass) or False (fail).
Returns
-------
float
Weighted score between 0.0 and 1.0.
"""
if root.is_leaf:
return root.weight if results.get(root.node_id, False) else 0.0
child_scores = sum(score_rubric(child, results) for child in root.children)
max_child_weight = sum(child.weight for child in root.children)
if max_child_weight == 0:
return 0.0
normalized = child_scores / max_child_weight
return root.weight * normalized
PaperBench includes a dev split (a small set of papers for testing and debugging) and a test split (the full evaluation set). The dev split is publicly available and is the recommended starting point for researchers and practitioners who want to evaluate their agents. Current state-of-the-art agents achieve scores between 0.20 and 0.40 on the full benchmark, indicating that fully automated paper reproduction remains an open challenge. The most common failure modes are dependency installation errors, incorrect hyperparameter settings, and inability to handle papers that omit key implementation details.
PaperBench reveals that reading comprehension is not the bottleneck for research agents. Modern LLMs can accurately summarize papers and extract methodological details. The failures occur downstream: during environment setup (conflicting package versions, missing system libraries), code generation (subtle bugs in matrix operations, incorrect loss function implementations), and execution management (GPU memory errors, training instability, nondeterministic results). This means that improving research agents requires advances in tool use, error recovery, and iterative debugging, not just better language understanding.
2. CORE-Bench: Computational Reproducibility Evaluation
CORE-Bench (Computational Reproducibility Benchmark) takes a complementary approach to PaperBench. Instead of asking agents to write code from scratch based on a paper, CORE-Bench provides the paper's original code and asks the agent to execute it and reproduce the published results. This isolates the computational reproducibility challenge: can the agent set up the environment, resolve dependency conflicts, fix broken paths, handle missing data files, and successfully run the provided code?
CORE-Bench includes 270 tasks drawn from papers across computer science, social science, and medicine. Each task is classified into one of three difficulty tiers. Easy tasks require running a single script with minimal environment setup. Medium tasks require running multiple scripts in sequence, resolving dependency conflicts, or handling missing data. Hard tasks require significant debugging, code modification, or creative problem-solving to get the original code running.
# Code Fragment 22.8.2: CORE-Bench task structure and evaluation
from dataclasses import dataclass
from enum import Enum
from pathlib import Path
class CoreBenchDifficulty(Enum):
EASY = "easy"
MEDIUM = "medium"
HARD = "hard"
@dataclass
class CoreBenchTask:
"""A single CORE-Bench reproducibility task."""
task_id: str
paper_title: str
paper_doi: str
code_repo_path: str
expected_outputs: dict[str, str] # filename -> expected content/hash
difficulty: CoreBenchDifficulty
domain: str # "cs", "social_science", "medicine"
known_issues: list[str] # documented reproducibility barriers
def evaluate_core_bench_task(
task: CoreBenchTask,
agent_outputs_dir: str,
tolerance: float = 0.05,
) -> dict:
"""
Evaluate whether the agent successfully reproduced the task outputs.
Compares agent-generated outputs against expected outputs using
appropriate comparison methods (exact match for text, numeric
tolerance for data, visual similarity for plots).
"""
results = {
"task_id": task.task_id,
"difficulty": task.difficulty.value,
"outputs_checked": 0,
"outputs_matched": 0,
"failures": [],
}
for filename, expected in task.expected_outputs.items():
agent_file = Path(agent_outputs_dir) / filename
results["outputs_checked"] += 1
if not agent_file.exists():
results["failures"].append({
"file": filename,
"reason": "output_file_missing",
})
continue
# Compare outputs
match = compare_outputs(
agent_output=agent_file.read_text(),
expected_output=expected,
tolerance=tolerance,
)
if match:
results["outputs_matched"] += 1
else:
results["failures"].append({
"file": filename,
"reason": "output_mismatch",
})
results["score"] = (
results["outputs_matched"] / max(results["outputs_checked"], 1)
)
return results
def compare_outputs(
agent_output: str,
expected_output: str,
tolerance: float,
) -> bool:
"""Compare two outputs with appropriate tolerance."""
# Try numeric comparison first
try:
agent_val = float(agent_output.strip())
expected_val = float(expected_output.strip())
return abs(agent_val - expected_val) / max(abs(expected_val), 1e-10) <= tolerance
except ValueError:
pass
# Fall back to exact string match (ignoring whitespace)
return agent_output.strip() == expected_output.strip()
3. MLE-bench: End-to-End ML Engineering Evaluation
MLE-bench evaluates a broader set of ML engineering skills by presenting agents with Kaggle-style competitions. Each task provides a dataset, a problem description, an evaluation metric, and a submission format. The agent must explore the data, engineer features, select and train models, tune hyperparameters, and produce a submission file that is scored against a held-out test set. The benchmark includes 75 competitions spanning tabular data, natural language processing, computer vision, and time series forecasting.
The key differentiator of MLE-bench is that it measures practical ML engineering competence rather than isolated coding ability. Success requires the agent to make strategic decisions: which algorithm to try first, how to allocate a limited compute budget across experiments, when to stop tuning and submit, and how to diagnose and recover from common ML failures (overfitting, data leakage, evaluation metric mismatches). The benchmark scores submissions using the same evaluation metric as the original Kaggle competition, and reports the agent's percentile rank relative to the original competition leaderboard.
# Code Fragment 22.8.3: MLE-bench task runner
from dataclasses import dataclass
from typing import Optional
@dataclass
class MLEBenchTask:
"""A single MLE-bench competition task."""
competition_id: str
title: str
description: str
train_data_path: str
test_data_path: str
sample_submission_path: str
evaluation_metric: str # "rmse", "accuracy", "auc", "f1", etc.
higher_is_better: bool
time_budget_minutes: int
leaderboard_scores: list[float] # original competition scores
@dataclass
class MLEBenchResult:
"""Result of an agent's submission to an MLE-bench task."""
competition_id: str
agent_score: float
percentile_rank: float # 0 to 100, higher is better
medal: Optional[str] # "gold", "silver", "bronze", or None
time_used_minutes: float
num_submissions: int
strategies_tried: list[str]
def score_mle_submission(
task: MLEBenchTask,
agent_score: float,
) -> MLEBenchResult:
"""
Score an agent's submission against the original competition leaderboard.
The percentile rank indicates what fraction of human competitors
the agent would have outperformed.
"""
leaderboard = sorted(
task.leaderboard_scores,
reverse=task.higher_is_better,
)
# Compute percentile rank
if task.higher_is_better:
beaten = sum(1 for s in leaderboard if agent_score > s)
else:
beaten = sum(1 for s in leaderboard if agent_score < s)
percentile = (beaten / len(leaderboard)) * 100 if leaderboard else 0
# Determine medal (top 10% gold, top 25% silver, top 40% bronze)
medal = None
if percentile >= 90:
medal = "gold"
elif percentile >= 75:
medal = "silver"
elif percentile >= 60:
medal = "bronze"
return MLEBenchResult(
competition_id=task.competition_id,
agent_score=agent_score,
percentile_rank=percentile,
medal=medal,
time_used_minutes=0, # filled by the harness
num_submissions=0,
strategies_tried=[],
)
# Example: evaluating an agent on a tabular regression task
# task = MLEBenchTask(
# competition_id="house-prices",
# title="House Prices: Advanced Regression",
# evaluation_metric="rmse",
# higher_is_better=False,
# leaderboard_scores=[0.12, 0.13, 0.14, 0.15, ...],
# ...
# )
# result = score_mle_submission(task, agent_score=0.128)
# print(f"Percentile: {result.percentile_rank:.1f}%, Medal: {result.medal}")
4. Failure Taxonomies for Research and ML Agents
Analyzing agent failures across PaperBench, CORE-Bench, and MLE-bench reveals a recurring set of failure categories. Understanding these categories is essential for improving agent performance and for designing more informative benchmarks. The taxonomy below organizes failures by their root cause rather than their surface symptoms.
Data failures occur when the agent cannot acquire, parse, or preprocess the required data. Common examples include: datasets that have been removed from their original hosting location, CSV files with inconsistent encoding, image datasets that require specific directory structures, and preprocessing steps that are described ambiguously in the paper.
Infrastructure failures occur during environment setup and execution. These include: conflicting Python package versions (the paper requires PyTorch 1.13 but the code uses features from PyTorch 2.0), missing system-level dependencies (CUDA version mismatches, missing C compilers), and resource constraints (the agent's sandbox has less GPU memory than the original experiments required).
Modelling failures occur when the agent's implementation differs from the paper in ways that affect results. These are the most subtle failures because the code may run without errors but produce incorrect results. Examples include: using the wrong activation function, implementing attention with the wrong scaling factor, or applying regularization at the wrong stage of the pipeline.
Scoring failures occur during evaluation. The agent may train a model correctly but compute the evaluation metric incorrectly, use the wrong data split for evaluation, or produce output in a format that the scoring pipeline cannot parse. These failures are particularly frustrating because the core work is correct but the final step is wrong.
# Code Fragment 22.8.4: Classifying agent failures for postmortem analysis
from dataclasses import dataclass
from enum import Enum
class FailureCategory(Enum):
DATA_ACQUISITION = "data_acquisition"
DATA_PREPROCESSING = "data_preprocessing"
DATA_FORMAT = "data_format"
INFRA_DEPENDENCY = "infra_dependency"
INFRA_RESOURCE = "infra_resource"
INFRA_ENVIRONMENT = "infra_environment"
MODEL_ARCHITECTURE = "model_architecture"
MODEL_HYPERPARAMETER = "model_hyperparameter"
MODEL_TRAINING = "model_training"
SCORING_METRIC = "scoring_metric"
SCORING_FORMAT = "scoring_format"
SCORING_SPLIT = "scoring_split"
@dataclass
class FailureRecord:
"""A structured record of an agent failure."""
task_id: str
category: FailureCategory
description: str
error_message: str
agent_action_before_failure: str
recovery_attempted: bool
recovery_succeeded: bool
def analyze_failure_distribution(
failures: list[FailureRecord],
) -> dict:
"""
Analyze the distribution of failures across categories.
Returns per-category counts and recovery success rates.
"""
category_stats = {}
for failure in failures:
cat = failure.category.value
if cat not in category_stats:
category_stats[cat] = {
"count": 0,
"recovery_attempted": 0,
"recovery_succeeded": 0,
}
category_stats[cat]["count"] += 1
if failure.recovery_attempted:
category_stats[cat]["recovery_attempted"] += 1
if failure.recovery_succeeded:
category_stats[cat]["recovery_succeeded"] += 1
# Compute recovery rates
for cat, stats in category_stats.items():
attempted = stats["recovery_attempted"]
stats["recovery_rate"] = (
stats["recovery_succeeded"] / attempted if attempted > 0 else 0.0
)
return {
"total_failures": len(failures),
"by_category": category_stats,
"top_category": max(
category_stats, key=lambda k: category_stats[k]["count"]
) if category_stats else None,
}
5. Task Design Challenges: Nondeterminism, Missing Data, and Dependencies
Designing tasks for research replication benchmarks requires careful attention to three challenges that do not arise in standard coding benchmarks. Nondeterminism is inherent in ML training: different random seeds, GPU non-determinism, and floating-point ordering can cause results to vary between runs even with identical code. Benchmarks must define appropriate tolerances (PaperBench uses rubric-based judgment; CORE-Bench uses numeric tolerance; MLE-bench uses leaderboard ranking, which is inherently tolerant of small variations).
Missing datasets are a common problem for replication benchmarks. Papers may reference datasets that are no longer publicly available, require registration or approval to access, or are hosted on services with rate limits. Benchmarks address this by either pre-downloading all required data into the sandbox, providing synthetic substitutes, or restricting the task set to papers with permanently available data.
Dependency management is perhaps the most underappreciated challenge. A paper published in 2023 may depend on package versions that conflict with the Python version available in 2025. The agent must either find compatible package versions, use containerized environments with specific Python versions, or modify the code to work with newer packages. This is a skill that many human ML engineers struggle with, and it is a significant source of failures for agents as well.
# Code Fragment 22.8.5: Handling nondeterminism in benchmark scoring
import numpy as np
from typing import Optional
def score_with_nondeterminism_tolerance(
agent_results: list[float],
reference_results: list[float],
relative_tolerance: float = 0.05,
absolute_tolerance: float = 0.01,
num_reference_runs: int = 5,
reference_std: Optional[float] = None,
) -> dict:
"""
Score agent results against a reference, accounting for
nondeterminism in ML training.
Uses a two-tier tolerance: results within the reference's
own variability range (measured across multiple reference runs)
are considered a match. Results within a broader absolute or
relative tolerance are considered a partial match.
"""
agent_arr = np.array(agent_results)
ref_arr = np.array(reference_results)
# If reference standard deviation is provided, use it
if reference_std is None:
# Estimate from the tolerance
reference_std = np.mean(np.abs(ref_arr)) * relative_tolerance
diffs = np.abs(agent_arr - ref_arr)
relative_diffs = diffs / np.maximum(np.abs(ref_arr), 1e-10)
# Tier 1: within reference variability (strong match)
strong_matches = np.sum(diffs <= 2 * reference_std)
# Tier 2: within tolerance (acceptable match)
acceptable_matches = np.sum(
(relative_diffs <= relative_tolerance) |
(diffs <= absolute_tolerance)
)
# Tier 3: outside tolerance (mismatch)
mismatches = len(agent_arr) - acceptable_matches
return {
"total_metrics": len(agent_arr),
"strong_matches": int(strong_matches),
"acceptable_matches": int(acceptable_matches),
"mismatches": int(mismatches),
"match_rate": float(acceptable_matches / max(len(agent_arr), 1)),
"mean_relative_error": float(np.mean(relative_diffs)),
}
Who: An ML infrastructure team at an AI lab evaluating their research replication agent.
Situation: The team ran the PaperBench dev split against their agent to establish a baseline before a customer demo. The agent attempted five papers and achieved rubric scores of 0.42, 0.18, 0.31, 0.05, and 0.38.
Problem: The overall score of 0.27 average was well below the team's target of 0.50. Worse, the failures were scattered across different categories, making it unclear where to focus improvement effort.
Decision: The team conducted a structured postmortem, mapping each paper's failure to a taxonomy: Paper 1 (0.42) failed on evaluation due to a custom metric not in scikit-learn (scoring failure). Paper 2 (0.18) could not access a gated dataset (data failure). Paper 3 (0.31) used cosine scheduling instead of linear warmup (modelling failure). Paper 4 (0.05) required a specific CUDA version (infrastructure failure). Paper 5 (0.38) produced slightly misformatted figures (scoring/formatting failure).
Result: The taxonomy revealed that infrastructure and data access failures (Papers 2 and 4) were complete blockers, while modelling and scoring failures allowed partial progress. The team prioritized dependency management and dataset credential handling, raising average scores to 0.41 within two weeks.
Lesson: Structured failure taxonomies transform opaque benchmark scores into prioritized engineering roadmaps by distinguishing total blockers from partial degradations.
Research replication benchmarks measure a fundamentally different capability than coding benchmarks. HumanEval and SWE-bench test whether an agent can write correct code for a well-specified task. PaperBench, CORE-Bench, and MLE-bench test whether an agent can manage the full lifecycle of a research or ML project: reading underspecified instructions, making design decisions under uncertainty, debugging failures iteratively, and producing validated results. The gap between coding benchmark scores and research benchmark scores for current agents suggests that project management and iterative problem-solving remain significant challenges for AI agents.
Lab: Run PaperBench Dev Split and Write a Postmortem
This section outlines a hands-on lab exercise for running the PaperBench dev split and conducting a structured failure analysis. The lab is designed to give practitioners direct experience with the benchmark and to develop the analytical skills needed to interpret agent evaluation results.
# Code Fragment 22.8.6: Setting up and running PaperBench dev split
# Clone the PaperBench repository
git clone https://github.com/openai/preparedness.git
cd preparedness/paperbench
# Install dependencies
pip install -e ".[dev]"
# Configure your agent (example: using an OpenAI agent)
export OPENAI_API_KEY="your-key-here"
# Run the dev split (subset of papers for testing)
python run_benchmark.py \
--split dev \
--agent openai \
--model gpt-4o \
--output-dir results/dev_run_001 \
--timeout-per-paper 3600
# The benchmark will:
# 1. Download each paper's PDF and supplementary materials
# 2. Launch a sandboxed environment for the agent
# 3. Let the agent attempt reproduction
# 4. Score the results against the rubric
# 5. Generate a detailed report
# View results summary
python analyze_results.py --results-dir results/dev_run_001
After running the benchmark, write a structured postmortem for each paper using the following template: (1) Summary: overall rubric score and which major sections passed or failed. (2) Failure classification: map each failure to the taxonomy from Section 4 (data, infrastructure, modelling, or scoring). (3) Root cause analysis: for the most impactful failure, trace the agent's decision history to identify where it went wrong. (4) Proposed improvements: specific changes to the agent's prompts, tools, or architecture that would address the identified failures. This postmortem practice builds the diagnostic skills needed to systematically improve research agents.
PaperBench requires significant compute resources. Running the full benchmark requires GPU access (for training ML models) and can consume hundreds of dollars in API credits (for the agent's LLM calls). The dev split is much cheaper but still requires a capable GPU for some papers. Budget approximately $20 to $50 in API credits and 2 to 4 GPU-hours for a dev split run. Always set a per-paper timeout to prevent runaway costs from agents stuck in retry loops.
Exercises
Choose a recent ML paper you are familiar with. Design a PaperBench-style rubric tree with at least 20 leaf checkpoints organized into four to five top-level categories. Assign weights to each checkpoint reflecting its importance to the reproduction. Explain your weighting decisions.
Answer Sketch
A rubric for a paper on "BERT fine-tuning for sentiment analysis" might include: Environment (10%): correct Python version, transformers library version, dataset download. Data Preparation (15%): tokenization matches paper, train/val/test splits match, data augmentation if specified. Model (25%): correct base model, correct classification head architecture, correct freezing/unfreezing schedule. Training (25%): correct optimizer, learning rate, batch size, number of epochs, early stopping criteria. Evaluation (25%): correct metric (F1 vs accuracy), correct test set, results within tolerance. Higher weights on training and evaluation because those directly determine whether results match.
Review the agent trace from a failed PaperBench or CORE-Bench run (you may use a hypothetical trace if you do not have access to a real one). Classify each error into the failure taxonomy from Section 4. Identify which failure was the "blocking" failure (the one that prevented further progress) and which were "downstream" failures caused by the blocking failure.
Answer Sketch
Example trace: (1) Agent installs torch==1.13.0, gets dependency conflict with numpy 2.0 (INFRA_DEPENDENCY, blocking). (2) Agent falls back to torch==2.0.0, but the paper's custom CUDA kernel does not compile (INFRA_ENVIRONMENT, downstream). (3) Agent skips the custom kernel and uses a pure Python fallback, but training is 10x slower and times out (INFRA_RESOURCE, downstream). The blocking failure is the initial dependency conflict. Fixing that single issue would likely resolve the cascade.
Design an agent strategy for MLE-bench that allocates a fixed compute budget across exploration (trying different model types) and exploitation (tuning the best model). Implement a simple budget allocator that decides, at each step, whether to explore a new model family or tune the current best. Test it on a simulated competition with known ground truth.
Answer Sketch
Use an explore-then-exploit strategy: spend the first 40% of the budget trying at least three model families (linear, tree-based, neural), each with default hyperparameters. Record validation scores. Spend the remaining 60% on hyperparameter tuning of the top model. Implement as a state machine with EXPLORE and EXPLOIT phases. The transition trigger is either budget exhaustion for the explore phase or a plateau in validation score improvement. For the simulated competition, generate a regression dataset where gradient boosting is the best model family, and verify that the agent discovers this during exploration.
Run a simple ML training script (e.g., MNIST classification with a small CNN) five times with different random seeds. Record the test accuracy for each run. Compute the mean, standard deviation, and 95% confidence interval. Then use the scoring function from Code Fragment 22.8.5 to determine what tolerance would be needed to accept all five runs as "matching" the mean.
Answer Sketch
Typical MNIST CNN results: 99.1%, 99.2%, 99.0%, 99.15%, 99.05%. Mean: 99.1%, std: 0.07%. A relative tolerance of 0.001 (0.1%) would accept all runs. The 95% confidence interval is approximately 99.1% plus or minus 0.09%. This exercise demonstrates that even simple tasks have nontrivial nondeterminism, and benchmark tolerances must be calibrated empirically rather than set arbitrarily.
Run the PaperBench dev split (or simulate it using the rubric scorer from Code Fragment 22.8.1 with synthetic pass/fail results). Write a structured postmortem following the template from Section 6: summary, failure classification, root cause analysis, and proposed improvements. Present your findings as a one-page report with a failure distribution chart.
Answer Sketch
Simulate five papers with rubric trees of 50 checkpoints each. Randomly assign 60% of checkpoints as passed, 40% as failed, with failures clustered in infrastructure (50% of failures), modelling (25%), data (15%), and scoring (10%). The postmortem should note that infrastructure failures are the dominant category and propose: (1) pre-building Docker images for common ML frameworks, (2) adding a dependency resolver tool that checks compatibility before installation, (3) implementing a "fallback environment" strategy that tries multiple Python versions. The failure distribution chart is a bar chart showing counts per failure category.
- PaperBench tests the full stack of agent capabilities: reading papers, writing code, debugging, and reproducing experimental results.
- MLE-bench focuses on ML engineering tasks (Kaggle-style competitions), complementing SWE-bench's focus on bug-fixing.
- Research replication benchmarks reveal gaps in agent reasoning that simpler coding benchmarks miss.
Show Answer
PaperBench evaluates an agent's ability to reproduce the experiments and results from published ML research papers. This tests reading comprehension, code generation, environment setup, debugging, and scientific reasoning, making it one of the most comprehensive agent benchmarks.
Show Answer
SWE-bench measures bug-fixing ability (resolving GitHub issues with patches), while MLE-bench measures ML engineering ability (competing on Kaggle-style data science challenges), testing model selection, feature engineering, hyperparameter tuning, and submission formatting.
What Comes Next
In the next chapter, Chapter 23: Tool Use, Function Calling & Protocols, we explore how agents interact with the outside world through tool use. The evaluation frameworks covered here provide the measurement infrastructure for assessing whether tool-using agents can handle the complex, multi-step workflows that research replication demands.
The primary reference for PaperBench. Introduces the rubric-based evaluation methodology and reports results across multiple frontier models. Essential reading for understanding the benchmark design covered in Sections 1 and 2.
Defines the three-tier (easy, medium, hard) computational reproducibility benchmark. Complements PaperBench by focusing on re-executing existing code rather than writing new code from papers.
Evaluates agents on 75 Kaggle competitions covering the full ML engineering pipeline. Provides the medal-based scoring system discussed in Section 3.
The bug-fixing benchmark that complements MLE-bench. Tests whether agents can resolve real GitHub issues by generating patches, providing a baseline for comparing different types of coding agent evaluations.
Introduces HumanEval and the pass@k metric for code generation evaluation. Provides the foundation against which the more complex benchmarks in this section are compared.
