"You cannot improve what you cannot measure, and you cannot measure an agent by asking it to grade itself."
Eval, Rigorously Benchmarked AI Agent
An agent you cannot measure is an agent you cannot improve. Unlike static LLM benchmarks where the model produces a single output, agent evaluation must account for non-deterministic tool calls, variable-length action sequences, and the interplay between task completion, cost, and safety. This section covers the major agent benchmarks (SWE-bench, WebArena, GAIA, PaperBench), explains how to design custom evaluation harnesses for your own agents, and introduces the Pareto frontier approach to balancing accuracy against cost. The general evaluation frameworks from Chapter 29 provide complementary techniques for the non-agentic dimensions of your system.
Prerequisites
This section builds on all previous sections in this chapter. Familiarity with SWE-bench is helpful but not required; the benchmark is explained here from first principles. For deeper coverage of code generation agents, see Section 25.1 later in this part.
This section includes a hands-on lab: Lab: Evaluate an Agent on SWE-bench Lite. Look for the lab exercise within the section content.
1. Why Agent Evaluation Is Hard
Evaluating agents is fundamentally harder than evaluating language models on static benchmarks. An agent's output depends not just on the model's capabilities but on the tools available, the environment's state, the agent's memory, and the specific sequence of actions taken. Two runs of the same agent on the same task can produce different results because of stochastic model outputs, timing-dependent tool responses, or different exploration paths. This non-determinism makes reproducibility a persistent challenge.
Agent evaluation also requires assessing multiple dimensions simultaneously. A code generation agent that solves the problem but takes 50 tool calls and $2 in API costs is less useful than one that solves it in 5 calls for $0.10. Metrics must capture task completion (did the agent succeed?), efficiency (how many steps and tokens did it use?), safety (did it avoid harmful actions?), and robustness (does it handle edge cases and failures gracefully?).
The field has converged on a set of standardized benchmarks that test different agent capabilities. Each benchmark creates a controlled environment where agents can be compared fairly, with automated scoring that removes subjective human judgment from the evaluation loop. Understanding these benchmarks, their strengths, and their limitations, is essential for anyone building or selecting agent systems.
The most informative agent metric is not pass rate alone but pass rate at a given cost. An agent that solves 80% of tasks at $0.05 per task is often more valuable than one that solves 90% at $2.00 per task. Always report evaluation results as a Pareto frontier of accuracy vs. cost (and accuracy vs. latency). This prevents the common mistake of optimizing for benchmark scores while ignoring the economic viability of the agent in production.
2. Major Agent Benchmarks
SWE-bench
SWE-bench evaluates software engineering agents on real GitHub issues from popular open-source projects. Each task provides a repository snapshot, a natural language issue description, and a test patch that verifies the fix. The agent must navigate the codebase, understand the issue, and produce a code patch that passes the tests. SWE-bench Lite is a curated subset of 300 tasks; SWE-bench Verified uses human-validated tasks to reduce noise. As of early 2026, the best agents solve roughly 50 to 60% of SWE-bench Verified tasks, with Claude Code and Devin among the top performers.
GAIA
GAIA (General AI Assistants) tests real-world assistant capabilities across three difficulty levels. Tasks require web browsing, file manipulation, mathematical reasoning, and multi-step planning. Unlike coding benchmarks, GAIA tasks mirror the diverse challenges a general-purpose agent encounters: "Find the cheapest flight from NYC to London next Tuesday and calculate the per-mile cost." The benchmark uses exact-match scoring against verified ground truth answers.
WebArena and OSWorld
WebArena provides realistic web environments (e-commerce sites, forums, content management systems) where agents must accomplish tasks by navigating web pages, filling forms, and clicking buttons. OSWorld extends this to full desktop environments where agents interact with applications through screenshots, mouse clicks, and keyboard inputs. These benchmarks are critical for evaluating the emerging category of computer use agents discussed in Section 25.3.
# Evaluating an agent on SWE-bench Lite
from swebench.harness.run_evaluation import run_evaluation
results = run_evaluation(
predictions_path="predictions.json", # Agent's patches
swe_bench_tasks="princeton-nlp/SWE-bench_Lite",
log_dir="./eval_logs",
timeout=300, # 5 minutes per task
)
# Analyze results across dimensions
total = len(results)
resolved = sum(1 for r in results if r["resolved"])
print(f"Pass rate: {resolved}/{total} ({100*resolved/total:.1f}%)")
# Cost analysis
total_tokens = sum(r["total_tokens"] for r in results)
total_cost = sum(r["api_cost"] for r in results)
print(f"Average tokens per task: {total_tokens/total:,.0f}")
print(f"Average cost per task: ${total_{cost}/total:.3f}")
print(f"Cost per resolved task: ${total_cost/max(resolved,1):.3f}")
3. Building Custom Agent Evaluations
Standard benchmarks test general capabilities, but production agents need domain-specific evaluation. A legal research agent should be evaluated on legal accuracy, citation quality, and jurisdictional awareness. A customer support agent should be measured on resolution accuracy, customer satisfaction proxies, and escalation appropriateness. Building custom evaluations requires defining task sets, scoring criteria, and automated verification methods specific to your domain.
The most effective custom evaluations use a "golden set" approach: a curated collection of tasks with verified correct outputs, annotated with difficulty levels and capability tags. Human experts create the golden set, and automated scoring compares agent outputs against the verified answers. For tasks without single correct answers (summarization, recommendations), LLM-as-judge evaluation provides a scalable alternative, though it requires calibration against human judgments.
Who: A quantitative analytics team at a mid-size investment firm building an internal research agent.
Situation: The team deployed an agent that answered questions about SEC filings, earnings reports, and market data. Early user feedback was positive, but the team had no systematic way to measure accuracy or track regressions across model updates.
Problem: When the team upgraded from GPT-4o to a newer model version, analysts reported that "something felt off" in calculation-heavy queries. Without a benchmark, the team could not confirm, quantify, or localize the regression.
Decision: They built a golden set of 200 questions with verified answers, tagged by type (factual lookup, calculation, trend analysis, comparison). Scoring combined exact-match accuracy for numbers, source attribution checks, LLM-as-judge reasoning quality on a 1-to-5 scale, and cost per query.
Result: The evaluation revealed the agent scored 92% on factual lookups but only 64% on multi-step calculations. The model upgrade had improved lookups by 3% but degraded calculations by 11%. The team traced the issue to a prompt change that discouraged tool use for "simple math," causing the agent to attempt mental arithmetic on compound calculations.
Lesson: Custom domain evaluations with tagged question types let you pinpoint exactly which capabilities regressed, turning vague user complaints into actionable engineering tasks.
Lab: Evaluate an Agent on SWE-bench Lite
Lab: Evaluate a Code Agent on SWE-bench Lite
Objective
Set up the SWE-bench evaluation harness, run a simple code generation agent against SWE-bench Lite tasks, collect metrics across accuracy, cost, and efficiency dimensions, and identify patterns in which types of issues the agent handles well vs. poorly.
What You'll Practice
- Configuring and running the SWE-bench evaluation harness
- Instrumenting an agent to collect pass/fail, token usage, and cost metrics
- Categorizing agent failure modes (misunderstanding, incorrect code, navigation failure)
- Comparing standard vs. reasoning model performance on coding tasks
Setup
The following cell installs the required packages and configures the environment for this lab.
pip install swebench openai tiktoken pandas
Install the SWE-bench evaluation harness and configure a code generation agent with access to file reading, code editing, and test execution tools.
Steps
Step 1: Set up the SWE-bench harness and agent
Install dependencies, load 10 SWE-bench Lite tasks from different repositories, and configure your agent with file read/write and test execution tools.
# TODO: Load SWE-bench Lite tasks and configure the agent
# from swebench import get_tasks
# tasks = get_tasks("lite", limit=10)
Step 2: Run the agent and collect metrics
For each task, run the agent and record pass/fail, token usage, API cost, and number of tool calls.
# TODO: Loop over tasks, run agent, record metrics in a DataFrame
Step 3: Categorize failures
For failed tasks, determine whether the agent misunderstood the issue, produced incorrect code, or failed to navigate the repository.
# TODO: Analyze agent logs for each failure and assign a category
Step 4: Compare model variants
Re-run the same tasks with a reasoning model and compare pass rates, cost, and efficiency.
# TODO: Run with reasoning model, then create a comparison table
Expected Output
- A results table showing pass/fail, token usage, cost, and tool calls per task
- A failure categorization breakdown (misunderstanding vs. incorrect code vs. navigation)
- A side-by-side comparison of standard vs. reasoning model performance
Stretch Goals
- Add a self-debugging retry loop and measure how many tasks it recovers
- Compare cost-efficiency: is the reasoning model worth its higher per-token cost?
- Implement a repository navigation strategy (search before read) and measure its impact on pass rate
Complete Solution
# Complete solution outline for SWE-bench evaluation
# This lab is primarily a measurement/evaluation exercise.
# The key deliverable is the metrics table and failure analysis,
# not a single script. See the SWE-bench documentation for
# harness setup: https://github.com/princeton-nlp/SWE-bench
Exercises
List four dimensions on which agents should be evaluated beyond simple task completion. Explain why pass rate alone is an insufficient metric.
Answer Sketch
Task completion (did it succeed?), efficiency (steps and tokens used), safety (did it avoid harmful actions?), and robustness (does it handle edge cases?). Pass rate alone ignores cost: an agent solving 80% at $0.05/task may be more valuable than one solving 90% at $2/task. It also ignores latency, safety violations, and failure modes.
Explain the structure of a SWE-bench task: what inputs does the agent receive, what output must it produce, and how is success determined? Why is SWE-bench Verified considered more reliable than the original SWE-bench?
Answer Sketch
Inputs: a repository snapshot and a natural language issue description. Output: a code patch. Success: the patch passes the provided test suite. SWE-bench Verified uses human-validated tasks, removing noisy or ambiguous tasks from the original set that could give misleading results about agent capabilities.
Design a custom evaluation harness for a customer support agent. Write a Python class SupportAgentEval that takes a list of test cases (question, expected_resolution, difficulty) and produces a report with accuracy, average cost, and failure categorization.
Answer Sketch
The class should: (1) iterate through test cases, (2) run the agent on each, (3) compare output to expected_resolution using exact match or LLM-as-judge, (4) record token usage and latency per case, (5) categorize failures (misunderstanding, wrong tool, incomplete answer), and (6) produce a summary DataFrame with per-category pass rates and cost statistics.
Given a set of agent evaluation results (accuracy, cost_per_task), write a Python function that identifies the Pareto-optimal configurations and plots the accuracy vs. cost frontier using matplotlib.
Answer Sketch
Sort results by cost. A point is Pareto-optimal if no other point has both higher accuracy and lower cost. Iterate through sorted results, tracking the maximum accuracy seen. A point is on the frontier if its accuracy exceeds the current maximum. Plot all points as scatter, highlight Pareto-optimal points, and connect them with a line.
WebArena and OSWorld test agents in simulated environments. Discuss two ways these benchmarks might overestimate or underestimate agent capabilities compared to real-world deployment.
Answer Sketch
Overestimate: benchmarks use clean, deterministic environments; real websites have CAPTCHAs, dynamic content, and rate limits. Underestimate: benchmarks evaluate single sessions; real agents can learn from past attempts and use persistent memory. Also, benchmark scoring may miss partial successes that would still be useful to a human user.
- Agent evaluation is harder than LLM evaluation because of multi-step interactions, tool use, and compounding errors.
- SWE-bench, GAIA, and WebArena test different agent capabilities: code generation, general reasoning, and web navigation respectively.
- Reproducibility requires controlling for stochasticity: fix seeds, record trajectories, and report variance across multiple runs.
Show Answer
Agents interact with environments over multiple steps, use tools, and make sequential decisions where early errors compound. The same agent can produce different trajectories on the same task due to stochasticity, making reproducibility and scoring much harder than single-turn LLM evaluation.
Show Answer
SWE-bench measures an agent's ability to resolve real GitHub issues by generating code patches that pass existing test suites. It is considered strong because it tests end-to-end software engineering (reading code, understanding issues, writing correct patches) rather than isolated coding ability.
What Comes Next
In Chapter 23: Tool Use, Function Calling and Protocols, we shift from evaluating agents to building their core capability: interacting with external tools through function calling, MCP, and other protocols.
References and Further Reading
Agent Benchmarks
The standard benchmark for evaluating code agents on real-world software engineering tasks, requiring agents to resolve actual GitHub issues from popular repositories.
Provides a realistic web environment benchmark with self-hosted websites where agents must complete complex tasks, establishing the standard for evaluating web agents.
Liu, X., Yu, H., Zhang, H., et al. (2024). "AgentBench: Evaluating LLMs as Agents." ICLR 2024.
A multi-dimensional benchmark evaluating LLM agents across 8 distinct environments including operating systems, databases, and web browsing, providing a holistic agent evaluation framework.
Evaluation Methodology
Kapoor, S., Stroebl, B., Siber, Z.S., et al. (2024). "AI Agents That Matter." arXiv preprint.
Identifies common pitfalls in agent evaluation including overfitting to benchmarks, cost neglect, and lack of statistical rigor. Essential reading for designing meaningful agent evaluations.
A planning benchmark requiring agents to create travel itineraries with complex constraints, testing multi-step reasoning and constraint satisfaction abilities.
Evaluates agents on research engineering tasks with human expert baselines, providing rigorous comparison between human and AI agent capabilities on complex technical work.
