Models

Section 30.5

Agentic workloads put unusual pressure on models. They need strong function calling, instruction following over long traces, ability to recover from errors, and (for browser/code-use agents) multimodal grounding. The model recommendations differ from a generic chat workload.

Three-panel summary of what agents need from a model: tool-call reliability (with reasoning depth), long-trace coherence (with context window and code specialty), and how agent workloads differ from chat
Figure 30.5.1: Agent workloads stress different model axes than chat. Tool-call reliability compounds across steps; long-trace coherence prevents derailment; reasoning depth helps when single-step heuristics fail. A model topping the chat leaderboard may still trail on SWE-bench.

30.5.1 Frontier models for agents (2026)

30.5.2 Open-weights options

30.5.3 Vision-language models for browser agents

Browser-using agents need to read screenshots, not just DOM. The relevant models:

Real-World Scenario
Pattern: the WebVoyager annotate-then-act pipeline

The reference design that turns a screen-reading VLM into a working browser agent is WebVoyager (He et al., 2024). The agent loop has five stages, all wired together as nodes in a LangGraph state machine with one node per tool. Stage 1 (capture): drive a headless Chrome instance via Playwright, screenshot the page, and pull the accessibility tree using the W3C ARIA standard. Stage 2 (annotate): for every actionable ARIA element (button, link, input), overlay a numbered coloured bounding box on the screenshot and emit a parallel list of {id, role, accessible_name} tuples. Stage 3 (predict): send both the annotated image and the textual element list to the VLM with a prompt that asks for the next action as Action[id, value], where the id refers to one of the numbered boxes. Stage 4 (parse and dispatch): parse the model's output into one of a small set of typed tool nodes (click(id), type(id, text), scroll(direction), go_back(), answer(text)), each implemented as a Playwright call. Stage 5 (scratchpad): append the action and the resulting new screenshot to a scratchpad that limits context growth (typically keep only the last three screenshots; older actions become text-only summaries). The conditional edge routes from "parse" back into "capture" until the model emits answer. The screenshot-annotation step is what makes the loop reliable: without numbered boxes the VLM has to write pixel coordinates and miscounts; with them the action space collapses to picking an integer.

30.5.4 Comparing the models

Table 30.5.1a: 30.5.1 Agent-capable models (mid-2026).
Model Best for SWE-bench Verified Access
Claude Opus 4.5 Coding agents, long traces ~70-75% API only
Claude Sonnet 4.5 Cost-efficient agents ~55-65% API only
GPT-5 / o3 Reasoning-heavy agents ~60-70% API only
Gemini 2.5 Pro Long-context agents ~50-60% API only
Llama-4 70B Self-hosted agents ~30-45% Open weights
Key Insight
Tool-call accuracy is one metric, long-trace context fidelity is the other

Most short-trace agent failures trace back to a single bad tool call: wrong arguments, missing field, wrong tool entirely. Per-call function-calling accuracy is more predictive of agent success than raw IQ; Claude and GPT-5 lead this metric as of mid-2026; open-weights models trail by 5 to 20 points. Long-trace agent failures have a different root cause: context fidelity degrades as the trace grows past 50K tokens, with the model forgetting or contradicting earlier tool results. Claude's 1M context with high-fidelity retrieval and o3's reasoning-tracing are the current leaders on long-trace fidelity. Plan capacity tests for both metrics before locking in an agent model.

Key Insight: Test-time scaffolding is a production technique

"Sample multiple traces, pick the best with a verifier" (best-of-N at the agent step level) is a production technique that lifts capability without retraining. Combined with self-consistency voting and external verifiers (test-suite pass, parse success, retrieval-confidence score), test-time scaffolding extends the capability of mid-tier models into flagship territory at a roughly proportional cost increase. The 2025 frontier reasoning models do this internally; for agents on smaller bases, you implement it yourself.

What's Next?

In the next section, Section 30.6: External Reading & Communities, we build on the material covered here.

Further Reading

Agent Models

OpenAI (2024). "GPT-4o System Card." openai.com/index/gpt-4o-system-card. Reference for an agent-capable frontier LLM.
Anthropic (2024). "Claude 3.5 Sonnet." anthropic.com/news/claude-3-5-sonnet. Reference for an agent-capable LLM with computer-use support.
DeepSeek-AI (2024). "DeepSeek-V3 Technical Report." arXiv:2412.19437. Reference open-weight agent-capable model.