Chapter 29: Specialized Agents | Building Language AI

Chapter opener illustration: Specialized Agents.

"A jack of all trades is a master of none, but oftentimes better than a master of one."
Agent X, Versatile AI Agent

Looking Back

Chapters 26-28 covered agents as a general pattern. This chapter zooms in on the specializations that actually ship: code agents (Cursor, Claude Code, Devin), browser agents (web navigation, form-filling), research agents (deep research, Open Deep Research), and the benchmarks that measure them (SWE-bench, WebArena, GAIA). The patterns here are the most production-grade in the book.

Chapter Overview

In late 2025, Anthropic's Claude Code passed 60 percent on SWE-bench Verified, the benchmark of real-world GitHub issues that two years earlier had seemed unreachable. Cursor agents now run autonomously on multi-file pull requests; Anthropic's Computer Use models book travel by clicking buttons; OpenAI's Deep Research compiles 30-source literature reviews in under ten minutes. None of these systems are general agents. They are specialists, tuned for one domain, with prompt scaffolding, custom tools, and evaluation benchmarks that look nothing alike. This chapter is the field guide to which specialization wins which task, and which scaffolds you can copy into your own product.

The chapter also covers domain-specific agent design patterns for healthcare, legal, finance, and customer service, where compliance constraints, safety requirements, and domain knowledge integration demand careful architectural choices. It concludes with a detailed examination of AI-generated code quality, security vulnerabilities, and trust calibration strategies for human-AI collaboration in software engineering.

Big Picture

While Chapters 26 through 28 cover general agent principles, this chapter focuses on domain-specific agent types: coding assistants, research agents, data analysis agents, and more. Understanding specialization patterns helps you design agents that excel at specific tasks rather than being mediocre generalists.

Note: Learning Objectives

Design code generation agent architectures using self-debugging loops, test-driven development, and SWE-bench evaluation patterns
Build browser and web agents using Playwright MCP, Stagehand, and WebArena-style task automation
Implement computer use agents with screenshot-based reasoning, GUI automation, and desktop interaction using Anthropic Computer Use
Construct research and data analysis agents for literature review, scientific discovery workflows, and data pipeline automation
Apply domain-specific agent design patterns for healthcare, legal, finance, and customer service with appropriate compliance constraints
Evaluate AI-generated code quality using static analysis tools (CodeQL, Semgrep, Bandit) and establish trust calibration for code review

Prerequisites

Chapter 26: AI Agent Foundations (agent architectures, memory, planning)
Chapter 27: Tool Use, Function Calling & Protocols (function calling, MCP, tool design)
Chapter 11: LLM APIs (chat completions, streaming, structured outputs)
Experience building at least one simple agent or tool-calling pipeline

Sections

Lab 29: Build a Browser-Use Agent That Fills a Form Across Tabs

Objective

Implement a browser-using agent (Anthropic's computer_use or browser-use Python library) that opens a real website, navigates multiple pages, and submits a form. By the end you will see why computer-use agents are so much harder than tool-use agents, and where the failure modes cluster.

Steps

Step 1: Pick a target. Use a benign sandbox: the-internet.herokuapp.com (Sauce Labs test playground). Specifically the "Multiple Windows" + "Form Authentication" pages.
Step 2: Install browser-use. pip install browser-use playwright. playwright install chromium. Run their hello-world (a Google search via LLM).
Step 3: Single-tab task. Write: "Log in at /login with username 'tomsmith' and password 'SuperSecretPassword!'. Confirm success." Run with browser-use backed by GPT-4o. Watch the trace; capture screenshots at each step.
Step 4: Multi-tab task. Extend to: "Open /windows, click the link, then come back and report what page title you saw." The agent must manage two tabs.
Step 5: Failure analysis. Run the same task 5 times. Tally success rate. Open 2 failures: was it a click target that changed, a hallucination ("I see a button" that does not exist), or a wrong tab? This is where computer-use research lives.
Step 6: Library alternative. Try the same task with Anthropic's computer_use API (pixel-based). Compare reliability and cost per task. The DOM-based and pixel-based approaches have different failure modes.

Expected Output

Expected time: 3 to 4 hours. Difficulty: intermediate. Artifact: a benchmarked browser agent with success-rate logs.

What's Next?

Next: Chapter 30: Tools of the Trade, Agent Stack. Chapter 30 closes Part VI with the consolidated agentic toolbox: LangGraph, LlamaIndex agents, CrewAI, AutoGen, OpenAI Agents SDK, Anthropic's claude-agent-sdk, the MCP registry, agent observability stacks (Langfuse, AgentOps), and the eval rigs for trajectory-level testing. Then Part VII attacks the agent's biggest weakness so far: it can plan and act, but it does not know things outside its training data. Retrieval changes that.