Datasets & Benchmarks

Section 30.4

Agent benchmarks have to test something that traditional NLP benchmarks do not: whether the system can complete an actual multi-step task with side effects. The benchmarks below are the ones with active leaderboards as of 2026.

Six-cell grid of agent benchmark families: software engineering, browser/web, tool-use mechanics, customer service, general assistants, and computer use
Figure 30.4.1: Agent benchmarks sort into six families by task shape. Pick benchmarks matching your deployment: SWE-bench for code agents, tau-bench for customer service, GAIA for general assistants. A "agent score" without a family is meaningless.

30.4.1 Software engineering agent benchmarks

30.4.2 Browser and web agent benchmarks

30.4.3 Tool-use and general agent benchmarks

METR's "time horizon" measurement (the observation that the duration of tasks AI agents can reliably complete has been doubling roughly every 7 months) is the single most-cited longitudinal claim about agent progress in 2025-26. See Section 78.3 for the full context; relevant here as a meta-metric for agent capability trajectory.

30.4.4 Comparing the benchmarks

Table 30.4.1a: 30.4.1 Agent benchmarks (2026).
Benchmark Domain SOTA pass rate Caveat
SWE-bench Verified Real GitHub issues ~60-75% (frontier) Verified subset
WebArena Web tasks ~50% Containerized envs only
GAIA General assistance ~70% (frontier) Test set partially private
BFCL Function calling ~90% (frontier) Synthetic functions
tau-bench Customer service ~50-70% Two domains only
Warning: Beware single-number rankings

Agent benchmarks aggregate many sub-tasks. A high single-number score can hide catastrophic failure on a critical sub-domain. Always look at the breakdown.

What's Next?

In the next section, Section 30.5: Models, we build on the material covered here.

Further Reading

Agent Benchmarks

Liu, X., Yu, H., Zhang, H., et al. (2023). "AgentBench: Evaluating LLMs as Agents." ICLR 2024. arXiv:2308.03688. Reference agent benchmark suite.
Yao, S., Chen, H., Yang, J., & Narasimhan, K. (2022). "WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents." NeurIPS 2022. arXiv:2207.01206. Reference web-agent benchmark.
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., et al. (2024). "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" ICLR 2024. arXiv:2310.06770. The current standard benchmark for software-engineering agents using real GitHub bug-fix tasks; the canonical reference for tracking coding-agent progress in 2024 to 2026.
Zhou, S., Xu, F. F., Zhu, H., et al. (2024). "WebArena: A Realistic Web Environment for Building Autonomous Agents." ICLR 2024. arXiv:2307.13854. Realistic multi-site web environment for agent evaluation that goes beyond WebShop's single-domain scope; reference for browser-agent benchmarking.

Tool-Use Benchmarks

Qin, Y., Liang, S., Ye, Y., et al. (2024). "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs." ICLR 2024. arXiv:2307.16789. Reference large-scale tool-use benchmark.
Patil, S. G., Zhang, T., Wang, X., & Gonzalez, J. E. (2023). "Gorilla: Large Language Model Connected with Massive APIs." NeurIPS 2024. arXiv:2305.15334. UC Berkeley's reference work on API-grounded LLMs; the Berkeley Function-Calling Leaderboard that grew out of this paper is the standard reference for tool-use accuracy.