Section 30.4: Datasets & Benchmarks

Agent benchmarks have to test something that traditional NLP benchmarks do not: whether the system can complete an actual multi-step task with side effects. The benchmarks below are the ones with active leaderboards as of 2026.

Six-cell grid of agent benchmark families: software engineering, browser/web, tool-use mechanics, customer service, general assistants, and computer use — **Figure 30.4.1:** Agent benchmarks sort into six families by task shape. Pick benchmarks matching your deployment: SWE-bench for code agents, tau-bench for customer service, GAIA for general assistants. A "agent score" without a family is meaningless.

30.4.1 Software engineering agent benchmarks

SWE-bench, SWE-bench Verified, SWE-bench Multimodal (2024-Q4), and SWE-bench Live / Pro (2025): the most-cited coding-agent benchmark family. SWE-bench Verified is the 500-issue human-validated subset. SWE-bench Multimodal adds image-bearing issues (UI bugs, dashboards). SWE-bench Live and Pro are 2025 contamination-resistant variants that periodically refresh the issue pool so old contamination patterns expire.
SWE-Lancer (OpenAI, 2025) is a coding-agent benchmark built from real freelance jobs on Upwork, with monetary values attached to each task ($50K total). Its objective is to measure economic value an agent could generate, not just pass-rate, which matters when you want to know what an agent is actually worth. The core concept is per-task dollar amounts plus a managerial-task variant where the agent reviews and selects bids. Pick SWE-Lancer when you want a benchmark that translates to dollars; for comparability with the literature, SWE-bench is still the default.
AgentBench (Tsinghua, 2023) is a broad agent-evaluation suite covering 8 distinct environments: OS interaction, databases, knowledge graphs, card games, lateral thinking, web browsing, web shopping, and household. Its objective is to test cross-domain agent capabilities so a single number reflects more than one specialty, which matters when SWE-bench-only scores would over-fit to coding. The core concept is per-environment task pools with environment-specific success criteria. Pick AgentBench for breadth; for specific domains (code, web, customer service), domain-specific benchmarks are more diagnostic.

30.4.2 Browser and web agent benchmarks

WebArena (Carnegie Mellon, 2023) is the containerized web-agent benchmark featuring four realistic sites (an e-commerce store, GitLab, a CMS, a Reddit-like forum) where the agent must complete 812 cross-site tasks. Its objective is reproducible browser-agent evaluation without depending on real websites whose content changes weekly, which matters because real-web benchmarks become impossible to reproduce in months. The core concept is fully self-hosted Docker images of the four sites so every run starts from the same state. Pick WebArena for any browser-agent comparison; for multimodal browser tasks, use VisualWebArena.
VisualWebArena (Carnegie Mellon, 2024) is WebArena's multimodal extension where tasks include images (find a similar product, identify which checkbox to click in a screenshot). Its objective is to evaluate the visual grounding agents need for real web use, which matters because text-only browser agents miss the half of web UIs that are images or icons. Pick VisualWebArena when your agent uses vision; pair with WebArena for non-visual baseline comparison.
BrowseComp (OpenAI, 2024) is OpenAI's web-research benchmark of hard, open-ended questions whose answers require multi-page browsing and synthesis. Its objective is to test agents on questions Google cannot answer in one query, which matters as a frontier evaluation for research-agent capability. The core concept is curated questions with hidden gold answers and a strict scoring rubric. Pick BrowseComp as an end-to-end web-research benchmark; the questions are deliberately hard, so expect frontier-only models to get above 30%.
MMInA (Multi-hop Multimodal Internet Agents, 2024) is a multi-hop information-seeking benchmark on real websites, requiring multi-page reasoning over text + images. Its objective is to push beyond single-page or single-modality tasks into realistic multi-step research, which matters when your agent needs to combine evidence from different sites. Pick MMInA when evaluating multi-hop multimodal agents.

30.4.3 Tool-use and general agent benchmarks

Berkeley Function-Calling Leaderboard (BFCL) (UC Berkeley, 2023; v3 2024) is the standard tool-use benchmark measuring whether models correctly select, parameterize, and chain function calls. Its objective is to evaluate the mechanics of tool calling specifically (parsing, parameter types, parallel calls, missed-parameter handling), which matters because function-calling errors cascade silently in agents. The core concept is a curated tool catalog plus prompts with gold-standard expected calls; BFCL v3 added multi-turn and parallel-tool evaluation. Pick BFCL for picking a function-calling-capable model; the synthetic functions limit external validity, so cross-check with tau-bench.
tau-bench (Sierra Research, 2024) is the customer-service agent benchmark covering airline and retail domains with realistic policies, tools, and user-simulator interactions. Its objective is to test whether an agent can follow domain-specific business rules under user pressure, which matters because real customer-service agents must refuse off-policy requests gracefully. The core concept is multi-turn user-simulator conversations where success requires both task completion and policy compliance. Pick tau-bench when your application is customer-service-shaped; for general agents, GAIA is broader.
GAIA and GAIA-2 (Meta + HF, 2023 / 2024) is a general-AI-assistants benchmark of real-world questions requiring web search, file operations, multimodal reasoning, and tool use. GAIA-2 (2024) added harder multi-step questions and updated the gold answers. Pick GAIA-2 over the original for current evaluation; the small test set means be careful about over-fitting to it.
AssistantBench (Yoran et al., 2024): web-research assistant benchmark; relevant for evaluating "find this on the web and answer" tasks.
τ²-bench (tau-bench v2, 2024-12): improved customer-service benchmark fixing several validation issues from v1.
MLE-bench (OpenAI, 2024-12): agents doing ML engineering tasks; the right benchmark for "agent submits a Kaggle-style competition entry".
AgentClinic (multi-turn medical agent benchmark, 2024): vertical clinical reasoning.
BrowseComp (OpenAI, 2025): continues from 2024 with refreshed questions; the standard for "deep web research" capability.
OpenHands (Wang et al., 2024): open platform for AI software developer agents; the relevant infrastructure paper.

METR's "time horizon" measurement (the observation that the duration of tasks AI agents can reliably complete has been doubling roughly every 7 months) is the single most-cited longitudinal claim about agent progress in 2025-26. See Section 78.3 for the full context; relevant here as a meta-metric for agent capability trajectory.

30.4.4 Comparing the benchmarks

Table 30.4.1a: 30.4.1 Agent benchmarks (2026).

Benchmark	Domain	SOTA pass rate	Caveat
SWE-bench Verified	Real GitHub issues	~60-75% (frontier)	Verified subset
WebArena	Web tasks	~50%	Containerized envs only
GAIA	General assistance	~70% (frontier)	Test set partially private
BFCL	Function calling	~90% (frontier)	Synthetic functions
tau-bench	Customer service	~50-70%	Two domains only

Warning: Beware single-number rankings

Agent benchmarks aggregate many sub-tasks. A high single-number score can hide catastrophic failure on a critical sub-domain. Always look at the breakdown.

What's Next?

In the next section, Section 30.5: Models, we build on the material covered here.

Further Reading

Agent Benchmarks

Liu, X., Yu, H., Zhang, H., et al. (2023). "AgentBench: Evaluating LLMs as Agents." ICLR 2024. arXiv:2308.03688. Reference agent benchmark suite.

Yao, S., Chen, H., Yang, J., & Narasimhan, K. (2022). "WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents." NeurIPS 2022. arXiv:2207.01206. Reference web-agent benchmark.

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., et al. (2024). "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" ICLR 2024. arXiv:2310.06770. The current standard benchmark for software-engineering agents using real GitHub bug-fix tasks; the canonical reference for tracking coding-agent progress in 2024 to 2026.

Zhou, S., Xu, F. F., Zhu, H., et al. (2024). "WebArena: A Realistic Web Environment for Building Autonomous Agents." ICLR 2024. arXiv:2307.13854. Realistic multi-site web environment for agent evaluation that goes beyond WebShop's single-domain scope; reference for browser-agent benchmarking.

Tool-Use Benchmarks

Qin, Y., Liang, S., Ye, Y., et al. (2024). "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs." ICLR 2024. arXiv:2307.16789. Reference large-scale tool-use benchmark.

Patil, S. G., Zhang, T., Wang, X., & Gonzalez, J. E. (2023). "Gorilla: Large Language Model Connected with Massive APIs." NeurIPS 2024. arXiv:2305.15334. UC Berkeley's reference work on API-grounded LLMs; the Berkeley Function-Calling Leaderboard that grew out of this paper is the standard reference for tool-use accuracy.