Agentic Evaluation: AgentBench, SWE-Bench, GAIA, τ-bench

Section 43.2

"An agent that gets the right answer half the time, but bricks your database the other half, has a zero-percent useful success rate."

EvalEval, Vibe-Averse AI Agent
Big Picture

Agents break every assumption that classical NLP evaluation rests on. A standard NLP benchmark compares a single model output against a reference string; an agent produces a trajectory: a sequence of tool calls, observations, side effects, and partial outputs spread over minutes or hours of wall time. The same task can succeed through multiple correct paths and fail in multiple instructive ways. Side effects mean re-running a benchmark mutates the world, so eval is no longer idempotent. This section walks through the 2024-2026 agentic benchmark cube, AgentBench, BFCL, SWE-Bench and SWE-Bench Verified, GAIA and ARC-AGI-2, WebArena and OSWorld, τ-bench, τ²-bench, and MM-τ-p2, the trajectory eval primitives (success-at-N, partial credit, exact-match-at-final-state), the dual-control evaluation pattern introduced by τ²-bench, the contamination problem at scale, and the surprisingly large cost of running a 200-task benchmark on a moderate agent. We close with a contamination-check code fragment and a forward pointer to eval-as-product workflows.

Prerequisites

This section assumes familiarity with agent architectures from Section 26.1 and tool calling from Section 27.1. The eval foundations in Section 42.1 and the RAG eval patterns in Section 43.1 are useful context, especially the layered-eval idea, which extends naturally to trajectories.

Before looking at benchmarks, be clear about what makes agentic eval hard. Two properties of agent runs break standard test infrastructure.

43.2.1 Why Standard NLP Eval Does Not Work for Agents

A friendly cartoon robot navigating a SWE-Bench style obstacle course with multiple challenge stations representing different test cases
Figure 43.2.1: Agentic benchmarks like SWE-Bench, GAIA, and AgentBench function as obstacle courses: each task is a sealed environment where the agent's trajectory of tool calls, observations, and side effects determines success, not a single output string.

Standard eval assumes (1) a single input produces a single output, and (2) the same input always produces the same output (modulo sampling temperature). Both fail for agents.

Key Insight: Goal-State Diff, Not Output String Match

The single most useful primitive in agentic eval is the goal-state diff: rather than scoring the agent's text output, you snapshot the relevant environment state before and after the run, and compare it against a target state. For τ-bench's retail and airline domains, this is a database row diff. For SWE-Bench, it is a "did the patch pass the hidden tests" check. For browser agents, it is a comparison of DOM state or downloaded files. The agent's chat output becomes evidence, not the score.

43.2.2 The 2024-2026 Agentic Benchmark Cube

The agentic benchmark landscape between 2024 and 2026 is best understood as a multi-axis cube. Each axis stresses a different agent capability; production agents are usually evaluated on a slice across multiple axes, not on a single benchmark.

Table 43.2.1a: The agentic benchmark cube as of 2026. Axis labels are conventional rather than universal; some benchmarks straddle axes.
AxisHeadline BenchmarksWhat It StressesScoring Primitive
Tool useAgentBench (2023, refreshed 2024-26), BFCL (Berkeley Function Calling Leaderboard, 2024-26)Tool selection, argument filling, multi-turn tool chainingPer-task success rate, function-arg exact match
Software engineeringSWE-Bench (2023), SWE-Bench Verified (2024), LiveCodeBench (2024-26)Real GitHub-issue resolution, end-to-end patch qualityHidden-test pass rate
General researchGAIA (2023), ARC-AGI-2 (2024-26)Multi-step reasoning, web research, multimodal synthesisExact final-answer match (graded by humans + judge)
Browser controlWebArena (2023-26), Visual WebArena (2024), OSWorld (2024-26)Real-world web/OS navigation, GUI controlGoal-state DOM/filesystem diff
API orchestrationτ-bench (2024), τ²-bench (2025), MM-τ-p2 (2026)Multi-turn dialogue + tool calls + database stateDatabase-row goal-state compare
Tool decathlonAtlas, MCP-Atlas (2025-26)Breadth across many tool ecosystems, MCP-style protocolsComposite per-tool success rate

A 2026 production-readiness review for an agent typically picks one benchmark per axis to gate on, plus a private golden set drawn from production traffic. Public benchmarks anchor the comparison to peer models; the private set guards against contamination and shape mismatch.

AgentBench and BFCL

AgentBench (Liu et al., 2023, refreshed for 2024-26) covers eight environments: operating system, database, knowledge graph, digital card game, lateral thinking puzzles, web browsing, web shopping, and household. Each environment has a curated task set; the agent is scored on per-task completion. The 2024-26 refreshes added MCP-style tool wrappers and a leaderboard.

BFCL (Berkeley Function-Calling Leaderboard; Patil et al., 2024) targets a narrower question: does the agent select the right function and fill its arguments correctly? BFCL v3 (mid-2025) includes simple, parallel, parallel-multiple, multi-turn, and Java/JavaScript multilingual variants. The headline metric is AST match (does the predicted function call's AST match the gold call) and executable match (does the predicted call actually run successfully).

SWE-Bench and LiveCodeBench

SWE-Bench (Jimenez et al., 2023) is the most influential coding benchmark of the agent era. Each task is a real GitHub issue from a popular open-source Python repo (e.g., Django, scikit-learn) paired with the human-written PR that resolved it. The agent's job is to generate a patch that makes the hidden test suite pass without breaking the existing tests. Pass rate on SWE-Bench is a standard headline metric in 2025-2026 model releases.

SWE-Bench Verified is OpenAI's 500-task curated subset (2024), filtered by humans for clarity, correctness, and solvability. It emerged after the discovery that the original SWE-Bench had test-set leakage in newer model training data and that some tasks were under-specified.

LiveCodeBench (Jain et al., 2024-26) attacks contamination differently: tasks are continuously scraped from competitive-programming platforms (LeetCode, AtCoder, Codeforces) with a date stamp, and models are evaluated only on tasks dated after the model's training cutoff. The benchmark is dynamic by construction.

GAIA and ARC-AGI-2

GAIA (Mialon et al., 2023) tests "general AI assistants" on questions that are easy for humans but hard for current LLMs: questions requiring web research, file manipulation, mathematical reasoning, and multimodal interpretation. Tasks are split into three levels; level 3 tasks routinely take humans 30 minutes and require multiple tool calls.

ARC-AGI-2 (Chollet et al., 2024-26) is the second-generation Abstraction and Reasoning Corpus. The benchmark targets fluid intelligence: each task is a small grid-puzzle that requires inferring a transformation rule from a handful of examples. ARC-AGI-2 raised the difficulty floor relative to the original after large models began passing it; the 2025 leaderboard saw single-digit pass rates from frontier models.

WebArena and OSWorld

WebArena (Zhou et al., 2023-26) provides a Dockerized environment with self-hosted clones of GitLab, Reddit, an e-commerce site, a CMS, and a map service. Tasks are realistic web workflows ("file a bug report with the following label and assignee") scored by goal-state DOM diff. OSWorld (Xie et al., 2024-26) extends this to full Linux/Windows desktop environments, including image-editing, spreadsheet, and IDE tasks; scoring uses filesystem and process-state inspection.

τ-bench, τ²-bench, MM-τ-p2

τ-bench (Yao et al., 2024; arXiv 2406.12045) is the methodological standout of the family. The benchmark places an agent in a customer-service role, interacting with a simulated user (powered by a second LLM) and a backend database. The task is to satisfy the user's underlying intent, which the user describes naturally but does not state as a structured goal. Scoring is by database-state diff against a target state: did the agent correctly issue a refund, update a flight, or modify the right record?

τ²-bench (Barres et al., 2025; arXiv 2506.07982) extends this to dual-control settings: both the agent and the simulated user can update state. This models real-world coordination tasks where the user might, for instance, share a credit-card number that the agent then enters, or where the user updates their own profile while the agent updates the booking. Goal-state diff is still the scoring primitive, but the trajectory space is wider because either party can advance the state.

MM-τ-p2 (Persona-Adaptive Multi-Modal Agent Evaluation; 2026) adds two further axes: multimodal inputs (images, screenshots, PDFs) and persona-adaptive simulated users (different user types have different patience, expressiveness, and adversarial tendencies). The result is a stress-test for production assistants that have to handle the long tail of user variety.

43.2.3 Trajectory Evaluation Primitives

Warning: Common Misconception

"Higher pass@k means a better agent" is the easiest misreading of agentic leaderboards. Pass@k says "the agent succeeded in at least one of k attempts," so a flaky agent that succeeds 1 in 10 times can post a high pass@10. For a production system you usually care about pass@1 (single-shot reliability) or about the cost-adjusted curve. The token-cost side of agentic eval is often hidden: an agent that needs k=20 samples to reach the same pass@k as another agent's pass@1 is approximately 20x more expensive to run. Always read the k and the average token cost.

Across all of these benchmarks, four scoring primitives recur. Knowing which primitive a benchmark uses tells you what to optimize for.

Real-World Scenario
Why pass@1 and pass@5 Tell Different Stories

Two coding agents on LiveCodeBench: Agent A scores pass@1 = 0.42 and pass@5 = 0.48. Agent B scores pass@1 = 0.31 and pass@5 = 0.58. At pass@1, A wins; at pass@5, B wins. The interpretation is operational: A is a reliable single-shot solver; B is a creative-but-inconsistent solver that benefits from multi-sample voting. If your product surface allows the user to retry (e.g., an IDE plugin where the user can click "regenerate"), B is better. If you can only afford one shot (e.g., a billing-critical pipeline that costs $0.50 per agent invocation), A is better. Reporting only one number hides the choice.

43.2.4 Dual-Control Evaluation: τ²-bench's Methodological Twist

Most agent benchmarks freeze the user. The agent acts; the user is passive (a prompt, a state-mutation by an oracle, or a simulated user that only responds to questions). τ²-bench breaks this asymmetry. In a dual-control rollout, both the agent and the simulated user can call tools, update state, and advance the conversation.

Practically, this models scenarios like: the agent asks the user to upload a passport scan; the user uploads it (a state mutation); the agent then verifies it. Or: the agent suggests a flight; the user counters with a date constraint; the agent re-searches. Or: the agent updates the booking while the user simultaneously updates their profile.

The eval question shifts from "did the agent succeed?" to "did the agent and user, jointly, reach the goal state?" This is closer to real production. A real customer-service agent does not unilaterally drive the conversation; it coordinates.

Warning: Simulated Users Drift

The simulated user in τ-bench and τ²-bench is itself an LLM. As the simulator model changes (e.g., upgraded from GPT-4 to GPT-4.5), the simulated user behaves differently, and the benchmark's difficulty drifts. Reproducible runs require pinning the simulator model and version alongside the agent model. Many published τ-bench scores fail this and are not directly comparable across papers.

43.2.5 Code: Agent Harness Wired to a Benchmark

The pattern below shows the core loop of an agentic eval harness. It loads tasks from a benchmark spec, runs an agent rollout for each, and scores by goal-state diff against the target state. The harness is benchmark-agnostic: swap the task loader and the diff function and the same loop works for τ-bench, SWE-Bench, or a custom internal suite.

# Minimal agentic eval harness: load tasks, run rollouts, score by goal-state diff, report pass@k
import json
from collections import Counter

def run_one_rollout(agent, task, env, max_steps=25):
    """Reset env, give task to agent, step until done or max_steps reached."""
    env.reset(initial_state=task["initial_state"])
    obs = env.observe()
    for _ in range(max_steps):
        action = agent.act(task["prompt"], obs)     # LLM call w/ tool schema
        obs, done = env.step(action)                 # execute tool call
        if done:
            break
    return env.snapshot_state()                       # final environment state

def score_pass_at_k(agent, env, tasks, k=5):
    """For each task, run k rollouts; success if ANY final state matches the goal."""
    results = Counter()
    for task in tasks:
        any_pass = False
        for _ in range(k):
            final_state = run_one_rollout(agent, task, env)
            if goal_state_match(final_state, task["goal_state"]):
                any_pass = True
                break
        results["pass" if any_pass else "fail"] += 1
    return results["pass"] / (results["pass"] + results["fail"])

def goal_state_match(final, goal):
    """Benchmark-specific. For tau-bench: row-level diff on the relevant DB tables."""
    return json.dumps(final, sort_keys=True) == json.dumps(goal, sort_keys=True)

# Example: run on a 200-task slice of an AgentBench-style suite
tasks = load_benchmark_tasks("agentbench-v2/os/tasks.jsonl")
env = DockerizedOSEnv()
agent = MyAgent(model="claude-opus-4-7", tools=OS_TOOLS)
pass_rate = score_pass_at_k(agent, env, tasks, k=5)
print(f"AgentBench-OS pass@5 = {pass_rate:.3f}")
Code Fragment 43.2.1b: A benchmark-agnostic agentic eval harness. The two abstractions are env (must support reset, observe, step, snapshot_state) and goal_state_match (benchmark-specific). The same loop drives τ-bench (DB-state match), SWE-Bench (test-suite pass), and WebArena (DOM diff).

43.2.6 The Contamination Problem at Scale

Contamination, the leakage of benchmark items into model training data, is the defining hygiene problem of agentic eval. The 2024-2026 history of SWE-Bench illustrates it:

  1. SWE-Bench launched in late 2023 with ~2,300 real GitHub issue tasks.
  2. Through 2024, frontier-model scores climbed unevenly. Some labs reported >50% pass rates; others reported much lower scores with comparable models.
  3. Investigation showed that some tasks were ambiguously specified (the model had to guess which behavior the issue meant) and that newer training data included resolved PR threads with their solutions inline.
  4. OpenAI's SWE-Bench Verified (August 2024) curated 500 tasks that human reviewers confirmed were well-specified and not obviously leaked. It became the new standard benchmark; the original SWE-Bench is now described as the "full" version.
  5. LiveCodeBench took a different tack: dynamic by construction, only score on tasks dated after the model cutoff.

The lessons generalize. Three contamination-resistance patterns are now standard:

Warning: A Benchmark Score Is Always Provisional

Benchmark numbers reported six months after release often disagree with numbers reported at launch. Models change; benchmarks get patched; contamination is discovered post-hoc. Treat every public score as a snapshot, not a fact. When comparing models, prefer benchmark slices dated after both models' training cutoffs, and prefer dynamic benchmarks over static ones for headline claims.

Code: An N-gram Contamination Check

The fragment below computes an n-gram overlap between a benchmark's task strings and a sample of the training distribution. High overlap is a contamination signal. This is the simplest of several contamination detectors; full pipelines also use embedding-similarity, exact-substring matching, and membership-inference attacks.

# Simple n-gram overlap contamination check between benchmark items and a training-data sample
from collections import Counter

def ngrams(text, n=13):
    """Yield word n-grams. 13 is the common threshold for "memorization-suspicious" overlap."""
    toks = text.split()
    for i in range(len(toks) - n + 1):
        yield " ".join(toks[i:i+n])

def build_train_ngram_index(train_docs, n=13):
    idx = Counter()
    for d in train_docs:
        idx.update(ngrams(d, n))
    return idx

def contamination_score(bench_item, train_index, n=13):
    """Fraction of benchmark n-grams that also appear in the training index."""
    item_grams = list(ngrams(bench_item, n))
    if not item_grams:
        return 0.0
    hits = sum(1 for g in item_grams if train_index[g] > 0)
    return hits / len(item_grams)

# Usage: build index once, score every benchmark item, flag the suspicious ones
train_index = build_train_ngram_index(load_train_sample("common-crawl-2024-shard.jsonl"))
suspicious = [
    item for item in benchmark_items
    if contamination_score(item["text"], train_index) > 0.30
]
print(f"{len(suspicious)}/{len(benchmark_items)} benchmark items flagged for review.")
Code Fragment 43.2.2: 13-gram overlap contamination detector. The 30% threshold is conventional but should be tuned per domain; legal and biomedical text have higher natural n-gram overlap than narrative prose. Real production pipelines combine this with exact-substring matching and embedding similarity.

43.2.7 The Cost of Agentic Eval

A 200-task AgentBench run on a moderate frontier agent has a measurable bill. Empirically (mid-2026 figures for Claude- and GPT-class agents):

The implication for design: full benchmark runs cannot happen on every commit. Production teams typically run a 20-30 task smoke-test slice on every PR (cost: ~$30-100) and a full 200-task run nightly or on release candidates only. Golden-set selection for the smoke-test slice is itself an evaluation problem: pick tasks that are sensitive to the specific changes the team is most likely to make.

Fun Note: The "Eval Whale" Phenomenon

By late 2025, several AI labs reported that their internal eval pipelines were consuming more inference compute than their model-training runs. The "eval whale" became a budgeting joke and a real concern: if you spend $500K running benchmarks every week, you have effectively built a second model-training program. Smart teams started building cached rollout traces, deterministic replay environments, and judge-model distillation so the eval bill scaled sub-linearly with model iteration.

Key Takeaways
Exercise 36.2.1: Design a Smoke-Test Slice Analysis

You maintain an agent that runs an AgentBench-OS benchmark suite of 200 tasks nightly at a cost of $400/run. Your team merges 20 PRs/week. Design a 25-task smoke-test slice that runs on every PR. Specify: (a) how you select the 25 tasks; (b) what cost/coverage tradeoff you accept; (c) how you detect when the smoke slice has gone stale (no longer correlates with the full 200-task score).

Hint
(a) Sample by failure-mode coverage: pick the 25 tasks that, over the past 30 days, have flipped pass/fail most often (these are the sensitive tasks); ensure each AgentBench-OS subcategory (filesystem, process, network, package-manager, etc.) is represented at least twice. (b) Cost: ~$50/PR × 20 PRs/week = $1000/week vs nightly $400 × 7 = $2800; smoke-test budget is one third of nightly. Coverage tradeoff: smoke catches ~60-75% of regressions a full run would catch, sufficient for blocking merges and triggering a follow-up nightly run. (c) Staleness detection: weekly, compute the Pearson correlation between the smoke score and the full nightly score across all 7 nights; if correlation drops below 0.7, refresh the smoke selection from the latest 30 days of pass/fail data.
Exercise 36.2.2: Map Benchmarks To An Agent Conceptual

For each of these production agents, identify which 2-3 benchmarks from the 2024-2026 cube you would gate on and explain why. (a) A pull-request-review agent that comments on code changes. (b) A customer-support agent for an airline that books and modifies reservations. (c) A research-assistant agent that searches the web and synthesizes findings into a brief. (d) A general-purpose "personal assistant" agent that does email, calendar, and browsing.

Answer Sketch
(a) PR-review: SWE-Bench Verified (real-issue resolution), LiveCodeBench (post-cutoff freshness), BFCL multi-turn (tool selection for git/CI tools). (b) Airline support: τ-bench airline domain (exact use case), τ²-bench (dual-control for user-driven changes), private golden set from real escalations. (c) Research assistant: GAIA (multi-step research), WebArena (browser control), private set of synthesis tasks. (d) Personal assistant: AgentBench (broad tool use), OSWorld (desktop control), MM-τ-p2 (multimodal personas), plus a private set; the breadth of the use case means no single benchmark suffices.
Self-Check
Q1: Explain why "exact match on the final answer string" is a poor scoring primitive for an agent that must "book a flight from SFO to JFK on a specified date." What primitive does τ-bench use instead, and why is it more faithful to the task?
Show Answer
The agent's natural-language reply can be phrased many ways ("I have booked your flight" vs "Booked: AA 100, SFO-JFK, depart 8am Friday") and still represent the same correct action. Exact string match would penalize valid paraphrases. τ-bench uses a goal-state diff against a target database state: did the agent actually create the right reservation in the booking database, with the right passenger and date? This ignores the agent's verbal output entirely and scores the effect on the world, which is what the user actually cares about.
Q2: An agent reports pass@1 = 0.35 and pass@5 = 0.62 on LiveCodeBench. In what production setting should you ship it, and in what setting should you not? Explain.
Show Answer
Ship in settings where retries are cheap and the user can re-roll: IDE-integrated coding assistants, generate-N-and-pick interactive tools, brainstorming surfaces. The pass@5 = 0.62 says the agent will produce a correct solution within 5 tries most of the time. Do not ship in settings where a single attempt must succeed: billing-critical automation, irreversible-side-effect pipelines, regulated workflows where the per-call cost or risk is high. The pass@1 = 0.35 says single-shot reliability is poor; on a one-attempt-only basis, this agent fails about two-thirds of the time.
Q3: SWE-Bench Verified is a 500-task subset of the original SWE-Bench. Why was the subset created, and what does its existence say about how to evaluate agents using the original SWE-Bench in 2026?
Show Answer
SWE-Bench Verified was created (by OpenAI in mid-2024) after analysis showed the original SWE-Bench had two problems: some tasks were under-specified (the issue text did not uniquely determine the correct behavior), and some had leaked into newer model training data via PR-thread scraping. Human reviewers curated 500 tasks confirmed to be well-specified and not obviously leaked. The implication for 2026 is: when reporting SWE-Bench scores, always specify which subset (full vs Verified), prefer Verified for headline comparisons, and treat any single SWE-Bench score with skepticism unless the report includes the model's training cutoff date and a contamination check.
What's Next

The benchmarks in this section are offline anchors. They tell you how your agent compares to peer models on a published, version-pinned task set. But the agents that ship to real users get exercised by inputs that look nothing like the benchmark, and the failure modes that matter most (a single $50,000 wrong refund, a single security regression) are exactly the ones a 200-task suite cannot guarantee against. §37.7 Eval-as-Product picks up where this section leaves off: how Braintrust, Latitude, and Laminar treat continuous evaluation as a first-class product surface, replaying production traces against new agent versions, surfacing regressions to developers in real time, and turning eval into something developers use, not just a CI step. The two views, benchmark anchors here, eval-as-product there, are how 2026 production-agent teams actually run eval.

Further Reading

AgentBench and Tool Use

Liu, X., Yu, H., Zhang, H., et al. (2023). "AgentBench: Evaluating LLMs as Agents." arXiv:2308.03688
Patil, S.G., Zhang, T., Wang, X., Gonzalez, J.E. (2024). "Berkeley Function-Calling Leaderboard (BFCL v3)." gorilla.cs.berkeley.edu/leaderboard.html

SWE-Bench and Code Generation

Jimenez, C.E., Yang, J., Wettig, A., et al. (2023). "SWE-bench: Can Language Models Resolve Real-world GitHub Issues?" arXiv:2310.06770
OpenAI (2024). "Introducing SWE-Bench Verified." openai.com/index/introducing-swe-bench-verified/
Jain, N., Han, K., Gu, A., et al. (2024). "LiveCodeBench: Holistic and Contamination-Free Evaluation of Large Language Models for Code." arXiv:2403.07974

GAIA and General Research

Mialon, G., Fourrier, C., Swift, C., et al. (2023). "GAIA: A Benchmark for General AI Assistants." arXiv:2311.12983
Chollet, F., et al. (2024-2026). "ARC-AGI-2 Leaderboard and Benchmark." arcprize.org

Browser and OS Control

Zhou, S., Xu, F.F., Zhu, H., et al. (2023). "WebArena: A Realistic Web Environment for Building Autonomous Agents." arXiv:2307.13854
Xie, T., Zhang, D., Chen, J., et al. (2024). "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments." arXiv:2404.07972

τ-bench, τ²-bench, MM-τ-p2

Yao, S., Shinn, N., Razavi, P., Narasimhan, K. (2024). "τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains." arXiv:2406.12045
Barres, V., et al. (2025). "τ²-Bench: Evaluating Conversational Agents in Dual-Control Environments." arXiv:2506.07982
MM-τ-p2 Authors (2026). "MM-τ-p2: Persona-Adaptive Multi-Modal Agent Evaluation." arXiv:2603.09643

Surveys and Contamination Hygiene

"A Survey on Evaluation of LLM-based Agents (2025)." arXiv:2503.16416
"Evaluation and Benchmarking of LLM Agents: A Survey (2025)." arXiv:2507.21504