Section 26.1: What Makes an LLM an Agent (and What Doesn't)

The best way to predict the future is to invent it. The second best way is to build an agent and let it figure things out.
Agent X, Self-Starting AI Agent

Big Picture

An AI agent is an LLM operating in a loop. Instead of producing a single response, an agent repeatedly perceives its environment, reasons about what to do, takes an action, and observes the result. This perception-reasoning-action cycle is the fundamental abstraction that transforms language models from passive text generators into active problem solvers. Understanding this loop, and the design patterns built on top of it, is essential for building any agentic system. The ReAct framework from Section 12.2 introduced the reasoning-plus-action pattern that agents formalize.

Prerequisites

This section assumes familiarity with LLM API basics from Section 11.1 and prompt engineering fundamentals from Section 12.1. An understanding of chain-of-thought reasoning (Section 8.1) and the ReAct pattern (Section 12.2) will be particularly helpful, as the agent loop builds directly on those ideas.

A control room with monitors and switches where an AI agent observes, plans, and executes actions — **Figure 26.1.1**: The agent control room: observe the environment, plan the next move, execute an action, and repeat. It is an endless loop of perceive, think, act.

26.1.1 What Makes an Agent?

The Perception-Reasoning-Action Loop

Every agent runs the same four-step cycle. The agent perceives by reading input: user messages, tool outputs, results from the previous step. It reasons with the language model about the next move. It acts: call a tool, write a response, ask for clarification. Then it observes the result and loops back to perceive. Figure 26.1.3 shows the loop.

Figure 26.1.3: The perception-reasoning-action loop that defines all AI agents

Agents vs. Chains vs. Workflows

The simple-to-complex orchestration spectrum has three rungs. A chain is a fixed sequence of LLM calls; you wrote the steps in code. A workflow adds if/else and loops, but you wrote those branches too. An agent hands control flow to the LLM: it picks which tool to call, in what order, and when to stop. Use the hybrid ML/LLM decision framework from Chapter 13 to decide whether the extra autonomy is worth the unpredictability.

Table 26.1.1a: Aspect Comparison (as of 2026).

Aspect	Chain	Workflow	Agent
Control flow	Fixed sequence	Developer-defined conditionals	LLM-determined
Steps known in advance	Yes, always	Paths defined, selection dynamic	No, emergent
Determinism	High	Medium	Low
Error handling	Static retry logic	Branching on error type	Model reasons about recovery
Complexity	Simple	Moderate	High
Best for	Predictable pipelines	Structured tasks with variants	Open-ended problem solving

Key Insight

Start with the simplest approach that works. Anthropic and other leading AI labs recommend using agents only when simpler patterns fail. Chains are easiest to debug and most predictable. Workflows add flexibility with manageable complexity. Agents provide maximum flexibility but introduce non-determinism, higher latency, and harder debugging. Choose the right level of autonomy for your use case.

Warning: Common Misconception: Agents Are Autonomous AI

Readers often confuse "agentic" with "autonomous" or even "AGI." An LLM agent is not a sentient system making independent decisions; it is a loop where a language model repeatedly selects the next action from a predefined set of tools. The model has no goals of its own, no persistent state beyond what the developer provides, and no ability to act outside its tool set. When an agent "decides" to call a search API, it is producing a structured text output that matches a tool schema. The autonomy is in action selection within a constrained loop, not in general intelligence. This distinction matters for both engineering (agents need guardrails, not trust) and for setting realistic expectations with stakeholders.

26.1.2 The Four Agentic Design Patterns

Andrew Ng identified four foundational agentic design patterns that appear across virtually all agent architectures. These patterns can be used individually or composed together, and understanding them provides a vocabulary for designing and analyzing agentic systems.

Four cartoon robots each holding a symbolic tool representing the four agentic design patterns: reflection, tool use, planning, and multi-agent collaboration — **Figure 26.1.4**: Four agentic design patterns define the modern agent landscape: reflection, tool use, planning, and multi-agent coordination. Most real systems combine several.

Pattern 1: Reflection

In the reflection pattern, the LLM reviews its own output and iteratively improves it. This can be as simple as asking the model to critique its response, or as sophisticated as having separate "generator" and "critic" roles. Reflection is powerful because it lets the model catch errors, improve quality, and refine its approach without external feedback.

# Reflection pattern: draft, critique, revise. The same model wears two hats,
# first generator then critic, and we loop until the critic signals no major issues.
from openai import OpenAI

client = OpenAI()

def _chat(messages: list[dict], model: str = "gpt-4o") -> str:
    return client.chat.completions.create(model=model, messages=messages).choices[0].message.content

def reflect_and_improve(task: str, max_rounds: int = 3) -> str:
    """Generate a response, then iteratively critique and revise it."""
    draft = _chat([{"role": "user", "content": task}])
    for _ in range(max_rounds):
        # Ask the critic for specific, actionable feedback on the current draft.
        critique = _chat([
            {"role": "system",
             "content": "You are a critical reviewer. Find flaws, gaps, and areas for improvement. Be specific."},
            {"role": "user",
             "content": f"Task: {task}\n\nDraft:\n{draft}\n\nProvide specific, actionable critique."},
        ])
        if "no major issues" in critique.lower():
            break  # Critic is satisfied; stop early to save tokens.
        # Otherwise revise: feed the original task, the current draft, and the critique back in.
        draft = _chat([
            {"role": "system", "content": "Revise the draft to address all critique points."},
            {"role": "user",
             "content": f"Original task: {task}\n\nCurrent draft:\n{draft}\n\nCritique:\n{critique}\n\nRevised version:"},
        ])
    return draft

Output: Attempt 1 failed. Reflection: The function does not handle non-list iterables correctly and uses isinstance(item, list) which misses nested cases. I should use a recursive check for all iterables. Attempt 2 failed. Reflection: The base case fails on an empty list because the function tries to index into it. I should check for empty input first. Solved on attempt 3!

Code Fragment 26.1.1b: Reflection loop where reflect_and_improve drafts an answer, asks the same model to critique it, and revises until the critic returns "no major issues" or max_rounds is hit. The same pattern underpins constitutional-AI fine-tuning, where the critic is a stronger model used at training time.

Pattern 2: Tool Use

Tool use extends the LLM beyond text generation by giving it the ability to call external functions: searching the web, querying databases, executing code, sending emails, or interacting with any API. The model receives tool descriptions, decides when and which tools to call, and incorporates the results into its reasoning. This is covered in depth in Section 26.6.

Tool use is architecturally significant, not merely an API feature. When a model gains the ability to call external functions, it transitions from a closed system (bounded by its training data) to an open system that can interact with the live world. This is the same leap that distinguishes a calculator from a spreadsheet connected to a database. The model's role shifts from "answer generator" to "action coordinator," and the design constraints change accordingly: latency now depends on external services, reliability depends on tool robustness, and safety requires controlling what actions the model can take. Chapter 27 explores these architectural implications in depth, including standardized protocols like MCP and A2A that formalize tool interfaces.

Pattern 3: Planning

Planning involves the LLM decomposing a complex task into subtasks before executing them. Rather than acting step by step reactively, a planning agent creates an explicit plan, then executes each step while potentially revising the plan based on intermediate results. Plan-and-execute architectures, reflection loops, and tree search methods all fall under this pattern. Section 26.2 covers planning in detail.

Pattern 4: Multi-Agent Collaboration

In the multi-agent pattern, multiple LLM instances (each potentially with different system prompts, tools, or roles) collaborate to solve a problem. One agent might research while another writes; a supervisor agent might coordinate workers; or agents might debate to reach a consensus. Chapter 27 is dedicated entirely to multi-agent architectures. summarizes these four patterns.

Figure 26.1.5: The four agentic design patterns (Ng, 2024)

26.1.3 The ReAct Framework

ReAct (Reasoning + Acting) is the most widely adopted agent architecture. We introduced ReAct as a prompting pattern in Section 12.2; here we build it into a full agent system with tool execution, state management, and error handling. Algorithm 1 formalizes the ReAct loop.

ReAct Thought-Action-Observation loop with explicit feedback edge and termination on FINAL ANSWER — **Figure 26.1.6**: The ReAct loop. Each Thought is conditioned on every prior Thought, Action, and Observation; the loop terminates when the model emits FINAL_ANSWER or hits the step budget.

Formally, the agent maintains a trajectory $\tau_t = (h_1, a_1, o_1, \ldots, h_{t-1}, a_{t-1}, o_{t-1})$ of thoughts $h$, actions $a$, and observations $o$. At step $t$ the LLM $M$ jointly samples the next thought and action conditioned on the system prompt $\pi$, the user task $T$, and the trajectory:

$$ (h_t, a_t) \sim p_M(\cdot \mid \pi, T, \tau_t) $$

The environment then returns an observation $o_t = \mathrm{exec}(a_t)$, and the trajectory grows to $\tau_{t+1} = \tau_t \cup (h_t, a_t, o_t)$. The loop terminates either when $a_t = \texttt{FINAL\_ANSWER}(y)$ (returning $y$) or when $t$ reaches the step budget $S$. The "Re" in ReAct is exactly the conditioning of $a_t$ on $h_t$: pure Act baselines drop $h_t$ and sample $a_t \sim p_M(\cdot \mid \pi, T, \tau_t)$ directly, which empirically gives lower accuracy on multi-hop tool tasks because the model cannot externalise the intermediate reasoning that grounds the next call.

Algorithm 26.1.2

This pseudocode formalizes the ReAct agent loop: given a user task T, tool set, and LLM M, the agent iterates through Thought, Action, and Observation steps up to max_steps S. The loop terminates when the LLM emits a final_answer action or the step budget is exhausted, returning the accumulated trajectory.


Input: user task T, tool set {tool_1, ..., tool_n}, LLM M, max steps S
Output: final answer or action result

1. Initialize context = [system_prompt, T]
2. for step = 1 to S:
  a. Thought: response = M(context)
  The LLM reasons about current state, what is known, what is needed
  b. if response contains FINAL_ANSWER:
  return extracted answer
  c. Action: parse tool_name and arguments from response
  d. Observation: result = execute(tool_name, arguments)
  e. Append (Thought, Action, Observation) to context
3. return "Max steps reached without resolution"

The key insight is that the explicit reasoning in step 2a (the "Thought") dramatically improves decision quality compared to acting without thinking or thinking without acting. Each thought provides a chain-of-reasoning that is also valuable for debugging when the agent makes mistakes.

Key Insight

Why ReAct works better than pure chain-of-thought for agents. Pure chain-of-thought (CoT) reasons in a closed loop: the model thinks step by step but never checks its reasoning against reality. ReAct adds grounding by interleaving reasoning with real-world observations from tool calls. When the model hypothesizes "the bug is in the authentication module," CoT continues reasoning from that hypothesis whether or not it is correct. ReAct instead calls a search tool, observes actual code, and corrects course if the hypothesis was wrong. This grounding effect is why ReAct agents outperform CoT-only approaches on tasks requiring factual accuracy, external data, or multi-step verification. The trade-off is latency: each tool call adds seconds to the total execution time.

from typing import Callable
class ReActAgent:
    """Minimal ReAct agent: Thought -> Action -> Observation loop."""
    def __init__(self, client, tools: dict[str, Callable], model: str = "gpt-4o"):
        self.client = client
        self.tools = tools
        self.model = model
    def run(self, task: str, max_steps: int = 10) -> str:
        # Build tool descriptions for the system prompt
        tool_desc = "\n".join(
            f"- {name}: {func.__doc__}" for name, func in self.tools.items()
            )
        system_prompt = f"""You are a ReAct agent. For each step:
        1. Thought: Reason about the current state and what to do next
        2. Action: Call a tool using the format: ACTION: tool_name(args)
        3. Wait for Observation (tool result)
        When you have the final answer, respond: FINAL ANSWER: [your answer]
        Available tools:
        {tool_desc}"""
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": task}
            ]
        for step in range(max_steps):
            response = self.client.chat.completions.create(
                model=self.model,
                messages=messages
                ).choices[0].message.content
            messages.append({"role": "assistant", "content": response})
            # Check for final answer
            if "FINAL ANSWER:" in response:
                return response.split("FINAL ANSWER:")[1].strip()
            # Parse and execute action
            if "ACTION:" in response:
                action_str = response.split("ACTION:")[1].strip()
                observation = self._execute_action(action_str)
                messages.append({
                        "role": "user",
                        "content": f"Observation: {observation}"
                        })
                return "Max steps reached without final answer."
    def _execute_action(self, action_str: str) -> str:
        # Parse "tool_name(args)" format and execute
        try:
            name = action_str.split("(")[0].strip()
            args_str = action_str.split("(", 1)[1].rsplit(")", 1)[0]
            if name in self.tools:
                return str(self.tools[name](args_str))
            return f"Error: Unknown tool '{name}'"
        except Exception as e:
            return f"Error executing action: {e}"

Code Fragment 26.1.2a: Building a tool-use agent with a registry of callable functions, automatic schema generation, and a conversation loop that parses and executes tool calls.

Note

The ReAct implementation above uses text parsing for simplicity. In production, you would use the provider's native function calling API (covered in Section 26.6), which gives structured JSON outputs instead of requiring text parsing. The conceptual loop is the same: think, act, observe.

ReAct Trace Example

A typical ReAct trace shows the interleaved thought-action-observation pattern. Notice how the agent explicitly reasons before each action, and how observations feed back into the next reasoning step.

# Example trace for: "What is the population of the capital of France?"
Thought: I need to find the capital of France, then look up its population.
 The capital of France is Paris, but let me verify and get the
 current population figure.
Action: search("Paris population 2024")
Observation: Paris has a city population of approximately 2.1 million
 and a metropolitan area population of about 12.3 million.
Thought: I now have the information. The capital of France is Paris,
 with a city population of about 2.1 million. I should provide
 both the city and metro figures for completeness.
FINAL ANSWER: The capital of France is Paris, with a city population
of approximately 2.1 million and a metropolitan area population
of about 12.3 million.

Code Fragment 26.1.3a: A worked ReAct trace for "What is the population of the capital of France?", showing the Thought, Action, Observation, FINAL ANSWER pattern.

Production Pattern

Production Example: ReAct in Shipped Products

The ReAct pattern is the backbone of several widely deployed agents. LangChain's create_react_agent and LangGraph's prebuilt ReAct executor are the default starting point for tens of thousands of production deployments tracked on LangSmith. Anthropic's Claude Code CLI (the same coding agent that produced this book's pipeline) uses an explicit Thought-Action-Observation loop with a tool registry of Bash, Read, Edit, Grep, and Glob. The published Claude 4 system card describes the same loop driving Computer Use: think about the screenshot, click or type, observe the next screenshot, repeat. When you see "Thinking..." in Cursor, Devin, or Replit's Agent mode, that is a ReAct loop emitting its Thought turn.

26.1.4 Cognitive Architectures and State Machines

As agents grow more complex, the simple ReAct loop becomes insufficient. Cognitive architectures provide a richer framework for organizing agent behavior by introducing explicit state management, memory systems, and structured decision-making processes. A cognitive architecture defines how an agent thinks, not just what it thinks about.

Library Shortcut: langgraph (state-machine agent runtime)

langgraph from LangChain models the agent as an explicit StateGraph: nodes are functions, edges are transitions, and the typed state object is the shared blackboard. The built-in create_react_agent wires up a ReAct loop in three lines, while custom graphs let you add reflection, human-in-the-loop checkpoints, and persistent checkpointing via Postgres or SQLite. It is the runtime under most production agents shipped in 2025-2026.

Show code

pip install langgraph langchain
from langgraph.prebuilt import create_react_agent
from langchain_openai import ChatOpenAI
def search(query: str) -> str:
    "Search the web for the query."
    return f"results for {query}"
agent = create_react_agent(ChatOpenAI(model="gpt-4o"), tools=[search])
result = agent.invoke({"messages": [("user", "Who won the 2024 Turing Award?")]})

Code Fragment 26.1.4a: Three lines of LangGraph wire up a ReAct agent with tool execution and checkpointing.

Agent State Machines

Many production agents are best modeled as state machines, where the agent transitions between well-defined states based on its observations and decisions. This provides predictability and debuggability while still allowing the LLM to make autonomous decisions within each state. Figure 26.1.6a shows the agent state machine with its transitions.

# Define AgentState, AgentContext, StatefulAgent; implement __init__, run, _handle_planning
from enum import Enum
from dataclasses import dataclass, field
class AgentState(Enum):
    PLANNING = "planning"
    EXECUTING = "executing"
    REFLECTING = "reflecting"
    WAITING_FOR_HUMAN = "waiting_for_human"
    COMPLETE = "complete"
    ERROR = "error"
@dataclass
class AgentContext:
    """Tracks the full state of an agent's execution."""
    task: str
    state: AgentState = AgentState.PLANNING
    plan: list[str] = field(default_factory=list)
    completed_steps: list[str] = field(default_factory=list)
    observations: list[dict] = field(default_factory=list)
    current_step_index: int = 0
    error_count: int = 0
    max_errors: int = 3
class StatefulAgent:
    """Agent that operates as a state machine with explicit transitions."""
    def __init__(self, client, tools):
        self.client = client
        self.tools = tools
        self.transitions = {
            AgentState.PLANNING: self._handle_planning,
            AgentState.EXECUTING: self._handle_executing,
            AgentState.REFLECTING: self._handle_reflecting,
            AgentState.ERROR: self._handle_error,
            }
        def run(self, task: str) -> str:
            ctx = AgentContext(task=task)
            while ctx.state not in (AgentState.COMPLETE, AgentState.WAITING_FOR_HUMAN):
                handler = self.transitions.get(ctx.state)
                if handler:
                    ctx = handler(ctx)
                else:
                    break
                    return self._format_result(ctx)
            def _handle_planning(self, ctx: AgentContext) -> AgentContext:
                # LLM creates a step-by-step plan
                plan = self._call_llm(
                    f"Break this task into concrete steps:\n{ctx.task}"
                    )
                ctx.plan = self._parse_plan(plan)
                ctx.state = AgentState.EXECUTING
                return ctx
    def _handle_executing(self, ctx: AgentContext) -> AgentContext:
        if ctx.current_step_index >= len(ctx.plan):
            ctx.state = AgentState.REFLECTING
            return ctx
            step = ctx.plan[ctx.current_step_index]
            try:
                result = self._execute_step(step, ctx)
                ctx.observations.append({"step": step, "result": result})
                ctx.completed_steps.append(step)
                ctx.current_step_index += 1
            except Exception as e:
                ctx.error_count += 1
                ctx.state = AgentState.ERROR if ctx.error_count >= ctx.max_errors \
            else AgentState.EXECUTING
            return ctx
            def _handle_reflecting(self, ctx: AgentContext) -> AgentContext:
                # LLM reviews results and decides: complete or replan
                assessment = self._call_llm(
                    f"Task: {ctx.task}\nCompleted: {ctx.completed_steps}\n"
                    f"Results: {ctx.observations}\n\n"
                    f"Is the task fully complete? If not, what remains?"
                    )
                if "complete" in assessment.lower():
                    ctx.state = AgentState.COMPLETE
                else:
                    ctx.state = AgentState.PLANNING # Replan with new context
                    return ctx

Code Fragment 26.1.5a: Adding explicit planning to the agent loop: the LLM first generates a step-by-step plan, then executes each step while tracking progress.

Agent state machine: PLANNING transitions to EXECUTING then REFLECTING then COMPLETE; reflection can loop back to PLANNING (replan), and EXECUTING can drop into ERROR on failure.

Figure 26.1.7: Agent state machine with planning, executing, reflecting, and error states

26.1.5 Agent Memory Systems

Agents need memory beyond the single context window. We sketch the categories briefly here; Section 26.6 covers the full taxonomy, storage strategies, retrieval policies, and production patterns. The four-way split below maps directly to the human-memory taxonomy in cognitive psychology (Tulving 1972, "Episodic and Semantic Memory"), which is not an accident: the names were borrowed in the 2023-2024 MemGPT, Voyager, and Generative Agents papers, and the analogy has stuck in production agent architectures since.

Working memory: the live context window during a turn (system prompt, user message, tool results, reasoning traces). Bounded by the model's context limit; the engineering question is what to keep and what to evict. A GPT-4o-with-128k-context turn typically devotes 5-10k tokens to the system prompt and tool schemas, 20-40k to retrieved evidence, and the rest to scratch reasoning; above 60k working-memory occupancy, all frontier models in 2024-2026 show measurable quality degradation per the RULER long-context benchmark (Hsieh et al., 2024).
Episodic memory: records of past interactions stored outside the context window so the agent can recall what happened in earlier sessions. Typically a vector database keyed by user/session/topic. Park et al.'s 2023 "Generative Agents" paper (Stanford, arXiv:2304.03442) stored every observation as an episodic record in a Chroma vector DB and ran a memory-stream retrieval on every action; without this layer, the simulated agents lost coherence within a single in-game day.
Semantic memory: extracted facts, preferences, and learned heuristics that persist across sessions. Distilled from episodic memory or populated by the agent's own reflection step. MemGPT (Packer et al., 2023, arXiv:2310.08560) implements this as a hierarchical promotion: an episodic record is promoted to a "main context" semantic entry when the agent's reflection loop scores it as high-value, mirroring the systems-consolidation theory in human declarative memory.
Procedural memory: skill libraries, tool descriptions, and learned routines the agent invokes by name. Often stored as structured templates or callable code rather than embeddings. The canonical 2024 reference is Voyager (Wang et al., NeurIPS 2024), which played Minecraft by incrementally extending a JavaScript skill library; by the end of a 30-day run the library contained over 100 reusable skills (craft_iron_pickaxe, build_furnace, etc.), each invocable by name without re-reasoning.

The hard problems (when to write, when to retrieve, how to compress, how to forget, how to prevent contamination across users) are the topic of Section 26.6.

26.1.6 Token Budget Management

Token management is one of the most practical challenges in building agents. Unlike a single-turn completion where you control the input size, agents accumulate context over many iterations. Without careful budgeting, agents hit context limits, lose important early context, or incur excessive costs.

Strategies for Managing Token Budgets

Summarize tool outputs: Instead of including raw API responses, extract only the relevant fields. A search result page might be 10,000 tokens raw but only 200 tokens of useful information.
Sliding window with summarization: Periodically summarize older conversation turns and replace them with a compact summary, keeping recent turns intact.
Tiered context priority: Assign priorities to different message types. System prompts and the current task have highest priority; old tool results have lowest priority and are evicted first.
Lazy loading: Instead of loading all context upfront, fetch information only when the agent needs it. Store tool descriptions in a separate index and inject only the ones the agent requests.
Step limits: Set hard limits on the number of agent iterations. If the agent cannot solve a task in N steps, it should report what it found and ask for guidance.

Table 26.1.2b: Strategy Comparison (as of 2026).

Strategy	Token Savings	Implementation	Risk
Summarize tool outputs	50-90%	LLM-based or rule-based extraction	May lose relevant details
Sliding window	Variable	Drop oldest N messages	Loses early context
Tiered priority eviction	30-60%	Score and rank all messages	Complex priority logic
Lazy tool loading	20-40%	Tool registry with on-demand injection	Extra LLM call to select tools
Hard step limits	Bounded	Counter in agent loop	May not complete complex tasks

Key Insight

The best agents are frugal with their context. Every token in the context window should earn its place. Production agents typically combine multiple strategies: summarizing tool outputs immediately, using a sliding window for conversation history, and imposing step limits as a safety net. The goal is to maintain the information density of the context while staying well within token limits.

26.1.7 Designing for Failure

Agents fail in ways that are qualitatively different from non-agentic systems. A simple chain either succeeds or produces an error. An agent can get stuck in loops, waste tokens on unproductive actions, misinterpret tool outputs, or take increasingly erratic actions as its context window degrades. Robust agent design requires anticipating and handling these failure modes. Chapter 42 covers how to observe and measure these failures in production.

Common Agent Failure Modes

Infinite loops: The agent repeats the same action because it does not recognize that the result is unchanged. Always implement a maximum step counter.
Tool misuse: The agent calls a tool with invalid arguments or misinterprets the output. Clear tool descriptions and structured error messages help.
Goal drift: Over many steps, the agent gradually shifts away from the original task. Periodically re-injecting the original task description helps maintain focus.
Context window overflow: The agent accumulates so much history that it cannot generate useful output. Token management strategies (above) are essential.
Cascading errors: An early mistake propagates through subsequent steps, leading the agent further astray. Reflection checkpoints catch and correct errors early.

Tip: Start with a Single Tool, Then Expand

When building your first agent, start with one well-tested tool (for example, web search or code execution). Get the tool-calling loop working reliably before adding more tools. Agents with many poorly-tested tools fail in unpredictable ways.

Real-World Scenario

Building a First AI Agent for IT Helpdesk Triage

Who: An IT operations manager and a junior ML engineer at a 2,000-employee enterprise

Situation: The IT helpdesk received 300+ tickets daily. Tier-1 agents spent 40% of their time on routine issues (password resets, VPN troubleshooting, software installation requests) that followed well-documented runbooks.

Problem: A simple chatbot with static decision trees handled only 25% of tickets because users described problems in unpredictable ways ("my computer is being weird" vs. "Outlook keeps crashing after the latest Windows update").

Dilemma: A fully autonomous agent with access to Active Directory and device management tools could resolve tickets end-to-end but posed security risks (what if it disabled the wrong account?). A classification-only agent was safe but still required human execution of every resolution.

Decision: They built a ReAct-style agent with tiered autonomy: the agent could autonomously execute low-risk actions (check account status, look up device info, send KB articles), but required human approval for medium-risk actions (password resets, group membership changes) and could not perform high-risk actions (account deletion, admin privilege grants) at all.

How: Each tool was tagged with a risk level. The agent's system prompt enforced the approval workflow. A supervisor dashboard showed pending approvals with the agent's reasoning chain, allowing Tier-1 agents to approve with one click.

Result: The agent autonomously resolved 35% of tickets and pre-triaged another 40% (reducing Tier-1 handling time from 12 minutes to 3 minutes per ticket). Zero security incidents occurred in the first 6 months.

Lesson: Tiered autonomy (auto-execute low risk, human-approve medium risk, block high risk) lets you deploy agents safely while still capturing significant efficiency gains from automation.

Research Frontier

Agentic Reasoning and Self-Improvement (2024-2026): Recent work explores agents that learn from their own execution traces, adapting their strategies without retraining the underlying model. Reflexion (Shinn et al., 2023) demonstrated verbal reinforcement learning where agents store failure reflections in episodic memory. Voyager (Wang et al., 2023) showed that agents can build a persistent skill library, composing new abilities from previously learned ones.

Open questions remain about how to balance exploration with exploitation in agentic settings and how to evaluate agents on long-horizon tasks where success depends on dozens of sequential decisions. The intersection of reasoning models (covered in Section 26.4) with self-improving agents is a particularly active area.

Key Takeaways

An AI agent is an LLM operating in a perception-reasoning-action loop, where the model determines the control flow rather than the developer.
Prefer the simplest orchestration pattern that works: chains before workflows, workflows before agents.
The four agentic design patterns (Reflection, Tool Use, Planning, Multi-Agent) are composable building blocks for all agent architectures.
ReAct interleaves explicit reasoning with actions and observations, providing a structured and debuggable agent loop.
Agent state machines combine the predictability of workflows with the flexibility of agents by defining explicit states and transitions.
Three-tier memory (working, episodic, semantic) addresses different timescales of agent information needs.
Token budget management is a critical production concern; combine output summarization, sliding windows, and step limits.
Design for failure by implementing step limits, re-injecting task descriptions, adding reflection checkpoints, and handling cascading errors.

Self-Check

Q1: What is the fundamental difference between an AI agent and a workflow?

Show Answer

In a workflow, the developer defines the control flow (which steps run and in what order, including conditionals and branches). In an agent, the LLM itself determines the control flow, deciding which tools to call, in what order, and when to stop. The key distinction is who controls action selection: the developer (workflow) or the model (agent).

Q2: Name the four agentic design patterns and briefly describe each.

Show Answer

Reflection: The LLM reviews and iteratively improves its own output. Tool Use: The LLM calls external functions to extend its capabilities. Planning: The LLM decomposes complex tasks into subtasks before execution. Multi-Agent: Multiple LLM instances with different roles collaborate to solve a problem.

Q3: In the ReAct framework, what are the three components of each iteration, and why is the "Thought" step important?

Show Answer

Each ReAct iteration consists of a Thought (reasoning trace), an Action (tool call or response), and an Observation (action result). The Thought step is important because explicit reasoning before each action improves decision quality, provides a chain-of-thought for debugging, and helps the agent maintain focus on the overall task rather than acting reflexively.

Q4: What are the three types of agent memory, and how do they differ in timescale?

Show Answer

Working memory (short-term) holds the current context window contents, lasting for a single agent run. Episodic memory (session-based) stores records of past interactions and can persist across sessions. Semantic memory (long-term) stores factual knowledge and learned procedures, persisting indefinitely via vector stores or knowledge bases.

Q5: Why is token budget management critical for agents, and what is the most effective strategy?

Show Answer

Token budget management is critical because agents accumulate context over many iterations. Every tool call, observation, and reasoning trace adds tokens. Without management, agents hit context limits, lose important early context, or incur excessive costs. The most effective strategy is combining multiple approaches: summarize tool outputs immediately to reduce their size by 50-90%, use a sliding window with summarization for old messages, and impose hard step limits as a safety net.

Exercises

Exercise 20.1.1: Agent vs. Chain vs. Router Conceptual

Explain the difference between a chain, a router, and a full agent. For each, describe a concrete use case where it would be the most appropriate choice.

Answer Sketch

A chain follows a fixed sequence of steps (e.g., summarize then translate). A router selects one path from several based on input (e.g., classify intent then route to the right handler). A full agent decides which actions to take and in what order in a loop, continuing until the task is done (e.g., a research assistant that searches, reads, and synthesizes). Chains suit deterministic workflows; routers suit classification-driven dispatch; agents suit open-ended tasks with unknown step counts.

Exercise 20.1.2: Agentic Design Patterns Conceptual

Andrew Ng identified four agentic design patterns: Reflection, Tool Use, Planning, and Multi-Agent Collaboration. For each pattern, describe one real-world scenario where it provides clear value over a simpler approach.

Answer Sketch

Reflection: a code-writing agent that reviews its own output for bugs before submitting. Tool Use: a customer support agent that queries a CRM database. Planning: a travel agent that decomposes a multi-city itinerary into bookable steps. Multi-Agent Collaboration: a content pipeline where one agent drafts, another fact-checks, and a third edits for style.

Exercise 20.1.3: Breaking a ReAct Loop Coding

A ReAct agent keeps searching for information and never produces a final answer. Write a Python function run_react_loop() that implements a maximum-step limit and a "repeated action" detector that stops the loop if the agent calls the same tool with the same arguments twice in a row.

Answer Sketch

Maintain a list of (tool_name, arguments) tuples. After each action, compare with the previous entry. If they match, inject a prompt like 'You have repeated the same action. Synthesize what you know and provide a final answer.' Also enforce max_steps (e.g., 10) and return whatever partial answer the agent has produced when the limit is reached.

Exercise 20.1.4: Cognitive Architecture State Machine Coding

Implement a minimal three-state agent (PLANNING, EXECUTING, REFLECTING) using a Python dictionary to track state transitions. The agent should plan how to answer a user question, execute the plan step by step, and reflect on whether the result is satisfactory.

Answer Sketch

Use a while loop with a current_state variable. In PLANNING, call the LLM to produce a numbered list of steps. In EXECUTING, iterate through steps and call tools. In REFLECTING, ask the LLM whether the collected results answer the original question. Transition back to PLANNING if reflection says 'no' (with a max-retry limit).

Exercise 20.1.5: Memory Type Classification Conceptual

Classify each of the following as episodic, semantic, or procedural memory: (a) "The user prefers Python over JavaScript." (b) "Last Tuesday the deployment failed because of a missing environment variable." (c) "To deploy to staging, first run migrations, then restart the service."

Answer Sketch

(a) Semantic memory: a distilled fact about the user. (b) Episodic memory: a timestamped record of a specific event. (c) Procedural memory: a learned action sequence that can be replayed.

What Comes Next

In the next section, Section 26.2: Planning & Agentic Reasoning, we explore tool use and function calling, the capability that allows agents to interact with external systems and APIs.

Further Reading

Yao, S. et al. (2023). "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023. The paper that defined the reasoning-plus-acting paradigm for LLM agents. Shows how interleaving thought and action improves task completion. The most-cited reference in the AI agents literature.

Wang, L. et al. (2024). "A Survey on Large Language Model based Autonomous Agents." Frontiers of Computer Science. A comprehensive survey covering agent architectures, capabilities, and applications. Organizes the field into a clear taxonomy. Best single resource for understanding the agent landscape.

Xi, Z. et al. (2023). "The Rise and Potential of Large Language Model Based Agents: A Survey." arXiv preprint. An extensive survey with emphasis on cognitive architectures and social simulation. Covers both single-agent and multi-agent scenarios. Complements the Wang et al. survey with different perspectives.

Sumers, T.R. et al. (2024). "Cognitive Architectures for Language Agents." TMLR. Proposes a formal framework (CoALA) for understanding language agent architectures. Draws on cognitive science to organize agent design patterns. Essential for researchers designing principled agent systems.

Anthropic (2024). "Building Effective Agents." Anthropic's practical guide to building agents with Claude, covering common patterns and best practices. Includes concrete implementation advice. Must-read for teams building production agents with Claude.