Section 22.5: Agent Evaluation & Benchmarks

"You cannot improve what you cannot measure, and you cannot measure an agent by asking it to grade itself."
Eval, Rigorously Benchmarked AI Agent

Big Picture

An agent you cannot measure is an agent you cannot improve. Unlike static LLM benchmarks where the model produces a single output, agent evaluation must account for non-deterministic tool calls, variable-length action sequences, and the interplay between task completion, cost, and safety. This section covers the major agent benchmarks (SWE-bench, WebArena, GAIA, PaperBench), explains how to design custom evaluation harnesses for your own agents, and introduces the Pareto frontier approach to balancing accuracy against cost. The general evaluation frameworks from Chapter 29 provide complementary techniques for the non-agentic dimensions of your system.

Prerequisites

This section builds on all previous sections in this chapter. Familiarity with SWE-bench is helpful but not required; the benchmark is explained here from first principles. For deeper coverage of code generation agents, see Section 25.1 later in this part.

Hands-On Lab

This section includes a hands-on lab: Lab: Evaluate an Agent on SWE-bench Lite. Look for the lab exercise within the section content.

1. Why Agent Evaluation Is Hard

Evaluating agents is fundamentally harder than evaluating language models on static benchmarks. An agent's output depends not just on the model's capabilities but on the tools available, the environment's state, the agent's memory, and the specific sequence of actions taken. Two runs of the same agent on the same task can produce different results because of stochastic model outputs, timing-dependent tool responses, or different exploration paths. This non-determinism makes reproducibility a persistent challenge.

Agent evaluation also requires assessing multiple dimensions simultaneously. A code generation agent that solves the problem but takes 50 tool calls and $2 in API costs is less useful than one that solves it in 5 calls for $0.10. Metrics must capture task completion (did the agent succeed?), efficiency (how many steps and tokens did it use?), safety (did it avoid harmful actions?), and robustness (does it handle edge cases and failures gracefully?).

The field has converged on a set of standardized benchmarks that test different agent capabilities. Each benchmark creates a controlled environment where agents can be compared fairly, with automated scoring that removes subjective human judgment from the evaluation loop. Understanding these benchmarks, their strengths, and their limitations, is essential for anyone building or selecting agent systems.

Key Insight

The most informative agent metric is not pass rate alone but pass rate at a given cost. An agent that solves 80% of tasks at $0.05 per task is often more valuable than one that solves 90% at $2.00 per task. Always report evaluation results as a Pareto frontier of accuracy vs. cost (and accuracy vs. latency). This prevents the common mistake of optimizing for benchmark scores while ignoring the economic viability of the agent in production.

2. Major Agent Benchmarks

SWE-bench

SWE-bench evaluates software engineering agents on real GitHub issues from popular open-source projects. Each task provides a repository snapshot, a natural language issue description, and a test patch that verifies the fix. The agent must navigate the codebase, understand the issue, and produce a code patch that passes the tests. SWE-bench Lite is a curated subset of 300 tasks; SWE-bench Verified uses human-validated tasks to reduce noise. As of early 2026, the best agents solve roughly 50 to 60% of SWE-bench Verified tasks, with Claude Code and Devin among the top performers.

GAIA

GAIA (General AI Assistants) tests real-world assistant capabilities across three difficulty levels. Tasks require web browsing, file manipulation, mathematical reasoning, and multi-step planning. Unlike coding benchmarks, GAIA tasks mirror the diverse challenges a general-purpose agent encounters: "Find the cheapest flight from NYC to London next Tuesday and calculate the per-mile cost." The benchmark uses exact-match scoring against verified ground truth answers.

WebArena and OSWorld

WebArena provides realistic web environments (e-commerce sites, forums, content management systems) where agents must accomplish tasks by navigating web pages, filling forms, and clicking buttons. OSWorld extends this to full desktop environments where agents interact with applications through screenshots, mouse clicks, and keyboard inputs. These benchmarks are critical for evaluating the emerging category of computer use agents discussed in Section 25.3.

# Evaluating an agent on SWE-bench Lite
from swebench.harness.run_evaluation import run_evaluation

results = run_evaluation(
 predictions_path="predictions.json", # Agent's patches
 swe_bench_tasks="princeton-nlp/SWE-bench_Lite",
 log_dir="./eval_logs",
 timeout=300, # 5 minutes per task
)

# Analyze results across dimensions
total = len(results)
resolved = sum(1 for r in results if r["resolved"])
print(f"Pass rate: {resolved}/{total} ({100*resolved/total:.1f}%)")

# Cost analysis
total_tokens = sum(r["total_tokens"] for r in results)
total_cost = sum(r["api_cost"] for r in results)
print(f"Average tokens per task: {total_tokens/total:,.0f}")
print(f"Average cost per task: ${total_{cost}/total:.3f}")
print(f"Cost per resolved task: ${total_cost/max(resolved,1):.3f}")

Pass rate: 147/300 (49.0%) Average tokens per task: 12,450 Average cost per task: $0.087 Cost per resolved task: $0.178

Code Fragment 22.5.1: Evaluating an agent on SWE-bench Lite

3. Building Custom Agent Evaluations

Standard benchmarks test general capabilities, but production agents need domain-specific evaluation. A legal research agent should be evaluated on legal accuracy, citation quality, and jurisdictional awareness. A customer support agent should be measured on resolution accuracy, customer satisfaction proxies, and escalation appropriateness. Building custom evaluations requires defining task sets, scoring criteria, and automated verification methods specific to your domain.

The most effective custom evaluations use a "golden set" approach: a curated collection of tasks with verified correct outputs, annotated with difficulty levels and capability tags. Human experts create the golden set, and automated scoring compares agent outputs against the verified answers. For tasks without single correct answers (summarization, recommendations), LLM-as-judge evaluation provides a scalable alternative, though it requires calibration against human judgments.

Real-World Scenario: Custom Evaluation for a Financial Analysis Agent

Who: A quantitative analytics team at a mid-size investment firm building an internal research agent.

Situation: The team deployed an agent that answered questions about SEC filings, earnings reports, and market data. Early user feedback was positive, but the team had no systematic way to measure accuracy or track regressions across model updates.

Problem: When the team upgraded from GPT-4o to a newer model version, analysts reported that "something felt off" in calculation-heavy queries. Without a benchmark, the team could not confirm, quantify, or localize the regression.

Decision: They built a golden set of 200 questions with verified answers, tagged by type (factual lookup, calculation, trend analysis, comparison). Scoring combined exact-match accuracy for numbers, source attribution checks, LLM-as-judge reasoning quality on a 1-to-5 scale, and cost per query.

Result: The evaluation revealed the agent scored 92% on factual lookups but only 64% on multi-step calculations. The model upgrade had improved lookups by 3% but degraded calculations by 11%. The team traced the issue to a prompt change that discouraged tool use for "simple math," causing the agent to attempt mental arithmetic on compound calculations.

Lesson: Custom domain evaluations with tagged question types let you pinpoint exactly which capabilities regressed, turning vague user complaints into actionable engineering tasks.

Lab: Evaluate an Agent on SWE-bench Lite

Lab: Evaluate a Code Agent on SWE-bench Lite

Objective

Set up the SWE-bench evaluation harness, run a simple code generation agent against SWE-bench Lite tasks, collect metrics across accuracy, cost, and efficiency dimensions, and identify patterns in which types of issues the agent handles well vs. poorly.

What You'll Practice

Configuring and running the SWE-bench evaluation harness
Instrumenting an agent to collect pass/fail, token usage, and cost metrics
Categorizing agent failure modes (misunderstanding, incorrect code, navigation failure)
Comparing standard vs. reasoning model performance on coding tasks

Setup

The following cell installs the required packages and configures the environment for this lab.

pip install swebench openai tiktoken pandas

Code Fragment 22.5.2: This command installs swebench, openai, tiktoken, and pandas for the SWE-bench evaluation lab. The swebench package provides the evaluation harness for running agents on real GitHub issue tasks.

Install the SWE-bench evaluation harness and configure a code generation agent with access to file reading, code editing, and test execution tools.

Steps

Step 1: Set up the SWE-bench harness and agent

Install dependencies, load 10 SWE-bench Lite tasks from different repositories, and configure your agent with file read/write and test execution tools.

# TODO: Load SWE-bench Lite tasks and configure the agent
# from swebench import get_tasks
# tasks = get_tasks("lite", limit=10)

Code Fragment 22.5.3: Step 1 stub: load 10 SWE-bench Lite tasks from the evaluation harness and configure the agent with file read/write and test execution tools for the benchmarking exercise.

Step 2: Run the agent and collect metrics

For each task, run the agent and record pass/fail, token usage, API cost, and number of tool calls.

# TODO: Loop over tasks, run agent, record metrics in a DataFrame

Code Fragment 22.5.4: Step 2 stub: loop over loaded tasks, run the agent on each, and record pass/fail outcomes, token usage, API cost, and tool call counts in a pandas DataFrame.

Step 3: Categorize failures

For failed tasks, determine whether the agent misunderstood the issue, produced incorrect code, or failed to navigate the repository.

# TODO: Analyze agent logs for each failure and assign a category

Code Fragment 22.5.5: Step 3 stub: analyze agent logs from failed tasks and classify each failure as misunderstanding the issue, producing incorrect code, or failing to navigate the repository structure.

Step 4: Compare model variants

Re-run the same tasks with a reasoning model and compare pass rates, cost, and efficiency.

# TODO: Run with reasoning model, then create a comparison table

Code Fragment 22.5.6: Step 4 stub: re-run the same tasks with a reasoning model and create a comparison table showing pass rates, cost, and efficiency differences between standard and reasoning variants.

Expected Output

A results table showing pass/fail, token usage, cost, and tool calls per task
A failure categorization breakdown (misunderstanding vs. incorrect code vs. navigation)
A side-by-side comparison of standard vs. reasoning model performance

Stretch Goals

Add a self-debugging retry loop and measure how many tasks it recovers
Compare cost-efficiency: is the reasoning model worth its higher per-token cost?
Implement a repository navigation strategy (search before read) and measure its impact on pass rate

Complete Solution

# Complete solution outline for SWE-bench evaluation
# This lab is primarily a measurement/evaluation exercise.
# The key deliverable is the metrics table and failure analysis,
# not a single script. See the SWE-bench documentation for
# harness setup: https://github.com/princeton-nlp/SWE-bench

Code Fragment 22.5.7: This solution outline describes the SWE-bench evaluation lab deliverables: a metrics table with per-task results, a failure categorization breakdown, and a model comparison. The lab focuses on measurement methodology rather than a single implementation script.

Exercises

Exercise 22.5.1: Agent Evaluation Dimensions Conceptual

List four dimensions on which agents should be evaluated beyond simple task completion. Explain why pass rate alone is an insufficient metric.

Answer Sketch

Task completion (did it succeed?), efficiency (steps and tokens used), safety (did it avoid harmful actions?), and robustness (does it handle edge cases?). Pass rate alone ignores cost: an agent solving 80% at $0.05/task may be more valuable than one solving 90% at $2/task. It also ignores latency, safety violations, and failure modes.

Exercise 22.5.2: SWE-bench Task Analysis Conceptual

Explain the structure of a SWE-bench task: what inputs does the agent receive, what output must it produce, and how is success determined? Why is SWE-bench Verified considered more reliable than the original SWE-bench?

Answer Sketch

Inputs: a repository snapshot and a natural language issue description. Output: a code patch. Success: the patch passes the provided test suite. SWE-bench Verified uses human-validated tasks, removing noisy or ambiguous tasks from the original set that could give misleading results about agent capabilities.

Exercise 22.5.3: Custom Evaluation Design Coding

Design a custom evaluation harness for a customer support agent. Write a Python class SupportAgentEval that takes a list of test cases (question, expected_resolution, difficulty) and produces a report with accuracy, average cost, and failure categorization.

Answer Sketch

The class should: (1) iterate through test cases, (2) run the agent on each, (3) compare output to expected_resolution using exact match or LLM-as-judge, (4) record token usage and latency per case, (5) categorize failures (misunderstanding, wrong tool, incomplete answer), and (6) produce a summary DataFrame with per-category pass rates and cost statistics.

Exercise 22.5.4: Pareto Frontier Plotting Coding

Given a set of agent evaluation results (accuracy, cost_per_task), write a Python function that identifies the Pareto-optimal configurations and plots the accuracy vs. cost frontier using matplotlib.

Answer Sketch

Sort results by cost. A point is Pareto-optimal if no other point has both higher accuracy and lower cost. Iterate through sorted results, tracking the maximum accuracy seen. A point is on the frontier if its accuracy exceeds the current maximum. Plot all points as scatter, highlight Pareto-optimal points, and connect them with a line.

Exercise 22.5.5: Benchmark Limitations Discussion

WebArena and OSWorld test agents in simulated environments. Discuss two ways these benchmarks might overestimate or underestimate agent capabilities compared to real-world deployment.

Answer Sketch

Overestimate: benchmarks use clean, deterministic environments; real websites have CAPTCHAs, dynamic content, and rate limits. Underestimate: benchmarks evaluate single sessions; real agents can learn from past attempts and use persistent memory. Also, benchmark scoring may miss partial successes that would still be useful to a human user.

Key Takeaways

Agent evaluation is harder than LLM evaluation because of multi-step interactions, tool use, and compounding errors.
SWE-bench, GAIA, and WebArena test different agent capabilities: code generation, general reasoning, and web navigation respectively.
Reproducibility requires controlling for stochasticity: fix seeds, record trajectories, and report variance across multiple runs.

Self-Check

Q1: Why is agent evaluation fundamentally harder than evaluating a standalone LLM?

Show Answer

Agents interact with environments over multiple steps, use tools, and make sequential decisions where early errors compound. The same agent can produce different trajectories on the same task due to stochasticity, making reproducibility and scoring much harder than single-turn LLM evaluation.

Q2: What does SWE-bench measure, and why is it considered a strong agent benchmark?

Show Answer

SWE-bench measures an agent's ability to resolve real GitHub issues by generating code patches that pass existing test suites. It is considered strong because it tests end-to-end software engineering (reading code, understanding issues, writing correct patches) rather than isolated coding ability.

What Comes Next

In Chapter 23: Tool Use, Function Calling and Protocols, we shift from evaluating agents to building their core capability: interacting with external tools through function calling, MCP, and other protocols.

References and Further Reading

Agent Benchmarks

Jimenez, C.E., Yang, J., Wettig, A., et al. (2024). "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" ICLR 2024.

The standard benchmark for evaluating code agents on real-world software engineering tasks, requiring agents to resolve actual GitHub issues from popular repositories.

Paper

Zhou, S., Xu, F.F., Zhu, H., et al. (2024). "WebArena: A Realistic Web Environment for Building Autonomous Agents." ICLR 2024.

Provides a realistic web environment benchmark with self-hosted websites where agents must complete complex tasks, establishing the standard for evaluating web agents.

Paper

Liu, X., Yu, H., Zhang, H., et al. (2024). "AgentBench: Evaluating LLMs as Agents." ICLR 2024.

A multi-dimensional benchmark evaluating LLM agents across 8 distinct environments including operating systems, databases, and web browsing, providing a holistic agent evaluation framework.

Paper

Evaluation Methodology

Kapoor, S., Stroebl, B., Siber, Z.S., et al. (2024). "AI Agents That Matter." arXiv preprint.

Identifies common pitfalls in agent evaluation including overfitting to benchmarks, cost neglect, and lack of statistical rigor. Essential reading for designing meaningful agent evaluations.

Paper

Xie, J., Zhang, K., Chen, J., et al. (2024). "TravelPlanner: A Benchmark for Real-World Planning with Language Agents." ICML 2024.

A planning benchmark requiring agents to create travel itineraries with complex constraints, testing multi-step reasoning and constraint satisfaction abilities.

Paper

Wijk, J., Pandey, R., et al. (2024). "RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents Against Human Experts." arXiv preprint.

Evaluates agents on research engineering tasks with human expert baselines, providing rigorous comparison between human and AI agent capabilities on complex technical work.

Paper