Agent benchmarks have to test something that traditional NLP benchmarks do not: whether the system can complete an actual multi-step task with side effects. The benchmarks below are the ones with active leaderboards as of 2026.
30.4.1 Software engineering agent benchmarks
- SWE-bench, SWE-bench Verified, SWE-bench Multimodal (2024-Q4), and SWE-bench Live / Pro (2025): the most-cited coding-agent benchmark family. SWE-bench Verified is the 500-issue human-validated subset. SWE-bench Multimodal adds image-bearing issues (UI bugs, dashboards). SWE-bench Live and Pro are 2025 contamination-resistant variants that periodically refresh the issue pool so old contamination patterns expire.
- SWE-Lancer (OpenAI, 2025) is a coding-agent benchmark built from real freelance jobs on Upwork, with monetary values attached to each task ($50K total). Its objective is to measure economic value an agent could generate, not just pass-rate, which matters when you want to know what an agent is actually worth. The core concept is per-task dollar amounts plus a managerial-task variant where the agent reviews and selects bids. Pick SWE-Lancer when you want a benchmark that translates to dollars; for comparability with the literature, SWE-bench is still the default.
- AgentBench (Tsinghua, 2023) is a broad agent-evaluation suite covering 8 distinct environments: OS interaction, databases, knowledge graphs, card games, lateral thinking, web browsing, web shopping, and household. Its objective is to test cross-domain agent capabilities so a single number reflects more than one specialty, which matters when SWE-bench-only scores would over-fit to coding. The core concept is per-environment task pools with environment-specific success criteria. Pick AgentBench for breadth; for specific domains (code, web, customer service), domain-specific benchmarks are more diagnostic.
30.4.2 Browser and web agent benchmarks
- WebArena (Carnegie Mellon, 2023) is the containerized web-agent benchmark featuring four realistic sites (an e-commerce store, GitLab, a CMS, a Reddit-like forum) where the agent must complete 812 cross-site tasks. Its objective is reproducible browser-agent evaluation without depending on real websites whose content changes weekly, which matters because real-web benchmarks become impossible to reproduce in months. The core concept is fully self-hosted Docker images of the four sites so every run starts from the same state. Pick WebArena for any browser-agent comparison; for multimodal browser tasks, use VisualWebArena.
- VisualWebArena (Carnegie Mellon, 2024) is WebArena's multimodal extension where tasks include images (find a similar product, identify which checkbox to click in a screenshot). Its objective is to evaluate the visual grounding agents need for real web use, which matters because text-only browser agents miss the half of web UIs that are images or icons. Pick VisualWebArena when your agent uses vision; pair with WebArena for non-visual baseline comparison.
- BrowseComp (OpenAI, 2024) is OpenAI's web-research benchmark of hard, open-ended questions whose answers require multi-page browsing and synthesis. Its objective is to test agents on questions Google cannot answer in one query, which matters as a frontier evaluation for research-agent capability. The core concept is curated questions with hidden gold answers and a strict scoring rubric. Pick BrowseComp as an end-to-end web-research benchmark; the questions are deliberately hard, so expect frontier-only models to get above 30%.
- MMInA (Multi-hop Multimodal Internet Agents, 2024) is a multi-hop information-seeking benchmark on real websites, requiring multi-page reasoning over text + images. Its objective is to push beyond single-page or single-modality tasks into realistic multi-step research, which matters when your agent needs to combine evidence from different sites. Pick MMInA when evaluating multi-hop multimodal agents.
30.4.3 Tool-use and general agent benchmarks
- Berkeley Function-Calling Leaderboard (BFCL) (UC Berkeley, 2023; v3 2024) is the standard tool-use benchmark measuring whether models correctly select, parameterize, and chain function calls. Its objective is to evaluate the mechanics of tool calling specifically (parsing, parameter types, parallel calls, missed-parameter handling), which matters because function-calling errors cascade silently in agents. The core concept is a curated tool catalog plus prompts with gold-standard expected calls; BFCL v3 added multi-turn and parallel-tool evaluation. Pick BFCL for picking a function-calling-capable model; the synthetic functions limit external validity, so cross-check with tau-bench.
- tau-bench (Sierra Research, 2024) is the customer-service agent benchmark covering airline and retail domains with realistic policies, tools, and user-simulator interactions. Its objective is to test whether an agent can follow domain-specific business rules under user pressure, which matters because real customer-service agents must refuse off-policy requests gracefully. The core concept is multi-turn user-simulator conversations where success requires both task completion and policy compliance. Pick tau-bench when your application is customer-service-shaped; for general agents, GAIA is broader.
- GAIA and GAIA-2 (Meta + HF, 2023 / 2024) is a general-AI-assistants benchmark of real-world questions requiring web search, file operations, multimodal reasoning, and tool use. GAIA-2 (2024) added harder multi-step questions and updated the gold answers. Pick GAIA-2 over the original for current evaluation; the small test set means be careful about over-fitting to it.
- AssistantBench (Yoran et al., 2024): web-research assistant benchmark; relevant for evaluating "find this on the web and answer" tasks.
- τ²-bench (tau-bench v2, 2024-12): improved customer-service benchmark fixing several validation issues from v1.
- MLE-bench (OpenAI, 2024-12): agents doing ML engineering tasks; the right benchmark for "agent submits a Kaggle-style competition entry".
- AgentClinic (multi-turn medical agent benchmark, 2024): vertical clinical reasoning.
- BrowseComp (OpenAI, 2025): continues from 2024 with refreshed questions; the standard for "deep web research" capability.
- OpenHands (Wang et al., 2024): open platform for AI software developer agents; the relevant infrastructure paper.
METR's "time horizon" measurement (the observation that the duration of tasks AI agents can reliably complete has been doubling roughly every 7 months) is the single most-cited longitudinal claim about agent progress in 2025-26. See Section 78.3 for the full context; relevant here as a meta-metric for agent capability trajectory.
30.4.4 Comparing the benchmarks
| Benchmark | Domain | SOTA pass rate | Caveat |
|---|---|---|---|
| SWE-bench Verified | Real GitHub issues | ~60-75% (frontier) | Verified subset |
| WebArena | Web tasks | ~50% | Containerized envs only |
| GAIA | General assistance | ~70% (frontier) | Test set partially private |
| BFCL | Function calling | ~90% (frontier) | Synthetic functions |
| tau-bench | Customer service | ~50-70% | Two domains only |
Agent benchmarks aggregate many sub-tasks. A high single-number score can hide catastrophic failure on a critical sub-domain. Always look at the breakdown.
What's Next?
In the next section, Section 30.5: Models, we build on the material covered here.