Section 22.3: Planning & Agentic Reasoning

"A plan is just a list of things that will not happen in order."
Agent X, Pragmatically Planning AI Agent

Big Picture

Planning is what separates a tool-calling chatbot from a genuine problem solver. The ReAct loop from Section 22.1 handles simple tasks well, but complex goals require the agent to think ahead, decompose work into subtasks, and recover when individual steps fail. This section covers the spectrum of planning strategies, from simple plan-and-execute to tree search methods like LATS, and explains the compute-cost tradeoffs that determine which approach fits each use case. The chain-of-thought reasoning from Chapter 8 provides the cognitive foundation that planning agents build upon.

Prerequisites

This section builds on agent foundations from Section 22.1 and chain-of-thought reasoning from Section 11.2.

1. From Simple Loops to Strategic Planning

A basic ReAct agent operates one step at a time: observe, think, act, repeat. This works well for simple tasks but breaks down when the problem requires coordinating multiple steps, managing dependencies between actions, or recovering from dead ends. Planning gives agents the ability to think ahead, decompose complex tasks into manageable subtasks, and reason about the order in which those subtasks should be executed.

The simplest planning approach is plan-and-execute: the agent first generates a complete plan (a numbered list of steps), then executes each step sequentially, checking results against expectations after each step. If a step fails or produces unexpected output, the agent can revise its plan before continuing. This separation of planning from execution makes the agent's reasoning transparent and debuggable.

More sophisticated approaches treat planning as a search problem. Tree of Thoughts (ToT) explores multiple reasoning paths in parallel, evaluating each path's promise before committing resources. Language Agent Tree Search (LATS) combines Monte Carlo tree search with LLM-based evaluation, treating each potential action sequence as a node in a search tree and using the model to estimate the value of unexplored branches.

Key Insight

The key trade-off in agent planning is compute cost vs. plan quality. A simple plan-and-execute approach uses one LLM call for planning and one per step. Tree search methods like LATS may require dozens of LLM calls to explore the search space. For most practical applications, plan-and-execute with reflection (replan after failures) provides the best cost-quality balance. Reserve tree search for high-stakes decisions where the cost of a wrong action far exceeds the cost of additional planning compute.

Key Insight

Agent planning algorithms sit within a rich tradition of search and planning in AI and operations research. Greedy plan-and-execute is equivalent to a depth-first search with no backtracking: fast but brittle. Plan-and-execute with replanning resembles the "replan on failure" strategy from classical AI planning (STRIPS, 1971). Tree search methods like LATS are direct descendants of Monte Carlo Tree Search (MCTS), the algorithm that powered AlphaGo's victory in 2016. The fundamental tradeoff between exploration (trying diverse strategies) and exploitation (committing to a promising plan) was formalized in decision theory as the multi-armed bandit problem (Robbins, 1952) and later generalized by the UCB (Upper Confidence Bound) family of algorithms. Every agent planning strategy implicitly takes a position on this exploration-exploitation spectrum. Understanding this lineage helps practitioners choose the right planning depth: most business tasks are "easy bandits" where greedy exploitation suffices, while research or creative tasks are "hard bandits" where exploration pays off.

Plan-and-Execute Architecture

Algorithm: Plan-and-Execute with Replanning


Input: task T, tool set Tools, LLM M, max replans R
Output: final answer

1. plan = M("Decompose T into numbered steps") // planning phase
2. results = []
3. for step_idx = 0 to len(plan):
 a. result = execute_step(plan[step_idx], Tools, results)
 b. results.append(result)
 c. // reflection: check if plan still valid
 d. verdict = M("Given results so far, does remaining plan make sense?")
 e. if verdict == "replan" and replans_remaining > 0:
 plan = M("Revise plan given results: " + results)
 replans_remaining -= 1
 // continue with updated plan
4. answer = M("Synthesize final answer from all results")
return answer

Pseudocode 22.3.1: The plan-and-execute algorithm with replanning. The LLM first decomposes a task into numbered steps, executes each one with tools, then reflects after every step to decide whether the remaining plan still makes sense or needs revision. This interleaving of execution and reflection balances forward progress with adaptability.

from langgraph.graph import StateGraph, END
from typing import TypedDict, List, Optional

class PlanExecuteState(TypedDict):
 task: str
 plan: List[str]
 current_step: int
 results: List[str]
 final_answer: Optional[str]

def create_plan(state: PlanExecuteState) -> dict:
 """Generate a multi-step plan for the task."""
 response = llm.invoke(
 f"Create a step-by-step plan to accomplish this task:\n"
 f"{state['task']}\n\n"
 f"Return a numbered list of concrete, actionable steps."
 )
 steps = parse_numbered_list(response.content)
 return {"plan": steps, "current_step": 0, "results": []}

def execute_step(state: PlanExecuteState) -> dict:
 """Execute the current step of the plan."""
 step = state["plan"][state["current_step"]]
 previous = "\n".join(
 f"Step {i+1}: {r}" for i, r in enumerate(state["results"])
 )
 result = agent_executor.invoke(
 f"Execute this step: {step}\n\nPrevious results:\n{previous}"
 )
 return {
 "results": state["results"] + [result],
 "current_step": state["current_step"] + 1,
 }

def should_replan(state: PlanExecuteState) -> str:
 """Check if the plan needs revision after the latest step."""
 if state["current_step"] >= len(state["plan"]):
 return "synthesize"
 # Ask the LLM if the plan still makes sense
 check = llm.invoke(
 f"Given the results so far, does the remaining plan still make sense?\n"
 f"Results: {state['results']}\n"
 f"Remaining steps: {state['plan'][state['current_step']:]}"
 )
 if "replan" in check.content.lower():
 return "replan"
 return "execute"

# Build the graph
graph = StateGraph(PlanExecuteState)
graph.add_node("plan", create_plan)
graph.add_node("execute", execute_step)
graph.add_node("replan", create_plan)
graph.add_node("synthesize", synthesize_answer)
graph.set_entry_point("plan")
graph.add_edge("plan", "execute")
graph.add_conditional_edges("execute", should_replan)
graph.add_edge("replan", "execute")
graph.add_edge("synthesize", END)

Code Fragment 22.3.1: A LangGraph implementation of plan-and-execute with replanning. The StateGraph wires four nodes (plan, execute, replan, synthesize) into a cycle where conditional edges let the agent re-derive its plan mid-execution when intermediate results invalidate the original strategy.

Library Shortcut: PydanticAI in Practice

The same plan-and-execute pattern in 10 lines with PydanticAI (pip install pydantic-ai):


from pydantic_ai import Agent

agent = Agent(
 "openai:gpt-4o",
 system_prompt=(
 "You are a research assistant. Break the user's request "
 "into steps, execute each, and synthesize a final answer."
 ),
)
result = agent.run_sync("Plan a migration from session-based to JWT auth")
print(result.data)

Code Fragment 22.3.2: PydanticAI condensed equivalent of the plan-and-execute pattern. A single Agent with a system prompt that instructs it to decompose, execute, and synthesize handles the same workflow shown in Code Fragments 22.3.1 and 22.3.2, with the framework managing the ReAct loop internally.

When to Use What: Planning Strategies

Simple ReAct loop: Best for exploratory tasks where the next step depends on discoveries made during the current step (debugging, open-ended research, conversational agents). Low overhead, high adaptability.

Plan-and-execute: Best for tasks with clear subtask decomposition and dependencies between steps (data pipelines, migration workflows, report generation). More predictable, easier to debug, but less adaptive to surprises.

Tree search (ToT/LATS): Reserve for high-stakes decisions where the cost of a wrong action far exceeds the additional compute cost (code migrations affecting production, financial analyses, safety-critical domains). Expensive but thorough.

In practice, most production agents start with plan-and-execute and add reflection (replan after failures). Tree search is rarely used outside benchmarks and high-value specialized applications.

2. Tree of Thoughts and LATS

Tree of Thoughts (ToT) extends chain-of-thought prompting by exploring multiple reasoning paths simultaneously. Instead of committing to a single chain of reasoning, the agent generates several candidate next steps, evaluates each one's promise using the LLM as a heuristic, and selects the most promising branch to continue exploring. This is particularly effective for problems where the first approach attempted is unlikely to be optimal, such as creative writing, mathematical proofs, or complex code refactoring.

LATS takes this further by applying Monte Carlo tree search (MCTS) principles. Each node in the tree represents a state (the agent's observations and actions so far). The agent simulates potential action sequences, uses the LLM to evaluate terminal states, and backpropagates those evaluations to guide future exploration. LATS has shown strong results on challenging benchmarks like HumanEval and WebShop, where the search-based approach discovers solutions that greedy single-path agents miss.

Worked Example: LATS Cost vs. Greedy Agent

A greedy ReAct agent solves a coding task in 5 LLM calls at $0.01 each, for a total cost of $0.05. A LATS agent explores a tree with depth 3 and branching factor 3. The number of nodes is the sum of a geometric series: 3⁰ + 3¹ + 3² + 3³ = 1 + 3 + 9 + 27 = 40 nodes. Each node requires one LLM call, so the LATS cost is 40 × $0.01 = $0.40, which is 8× more expensive than the greedy approach. If the task's failure cost (e.g., a production bug) exceeds roughly $2.80 (the break-even point where 8× extra spend is offset by a higher success rate), LATS is economically justified. For routine queries with low failure cost, the greedy agent is the better choice.

The MAP (Multi-Agent Planning) approach distributes planning across multiple agents. A decomposition agent breaks the problem into independent subproblems. Specialist agents solve each subproblem in parallel. A synthesis agent combines the results. This is useful when subproblems have minimal dependencies, enabling parallelism that reduces wall-clock time compared to sequential planning approaches.

Real-World Scenario: LATS for Complex Code Migration

Who: A staff engineer at a fintech company maintaining a 120,000-line Java 8 monolith.

Situation: The company mandated migration to Java 21 within three months to maintain long-term support coverage. The codebase used deprecated APIs, lacked records, and relied heavily on pre-pattern-matching switch statements.

Problem: A greedy single-pass migration agent converted files one at a time but introduced subtle behavioral regressions in 14% of files. Some deprecated API replacements changed exception-handling semantics, and the agent could not detect these issues until integration tests ran much later.

Decision: The team switched to a LATS-based agent that explored multiple migration strategies per file: conservative refactoring (minimal changes preserving exact behavior), idiomatic modernization (records, switch expressions, sealed classes), and a hybrid approach. Each branch was evaluated by running the test suite and measuring code quality metrics before committing.

Result: LATS maintained 100% test coverage across all migrated files. The total migration cost was 8x higher in API spend than the greedy approach, but the team saved an estimated three weeks of manual regression debugging.

Lesson: For high-stakes code transformations where correctness matters more than cost, tree-search agents that evaluate multiple strategies before committing outperform greedy single-pass approaches.

3. Reflection and Self-Correction

Reflection is the agent's ability to evaluate its own outputs and improve them. After completing a task or receiving feedback (from tools, tests, or humans), a reflective agent asks itself: "Did this work? What went wrong? How can I do better?" This self-evaluation step is what separates agents that improve over a task from those that simply execute a fixed strategy regardless of results.

Reflexion (Shinn et al., 2023) formalized this into a three-component architecture: an Actor that takes actions, an Evaluator that assesses the outcome, and a Self-Reflection module that generates verbal feedback. The reflection output is stored in memory and included in the prompt for subsequent attempts, enabling the agent to learn from its mistakes within a single task episode. This approach has shown significant improvements on coding benchmarks, where the agent can learn from failed test cases.

To build intuition for why reflection works: consider a student who writes an essay, then re-reads it with fresh eyes. The re-reading step catches errors that were invisible during writing because the student's mental state has "reset" enough to see the text as a reader would. LLM reflection works similarly. The critic prompt shifts the model's attention from generating to evaluating, engaging different patterns in the model's weights. This is why separate generator and critic roles outperform a single "generate then check" instruction: the role switch creates a meaningful shift in the model's behavior.

Warning

Reflection loops can get stuck in infinite self-criticism cycles where the agent repeatedly revises its output without making meaningful progress. Always set a maximum reflection budget (typically 2 to 3 iterations) and implement a "good enough" threshold that stops reflection when the output meets minimum quality criteria. Monitor reflection loops in production to identify patterns where agents waste tokens on unproductive self-evaluation.

Exercises

Exercise 22.3.1: Plan-and-Execute vs. ReAct Conceptual

Compare the plan-and-execute architecture with a standard ReAct loop. In which scenarios does plan-and-execute outperform ReAct, and when might ReAct be preferable?

Answer Sketch

Plan-and-execute excels when the task has clear subtask decomposition and dependencies between steps (e.g., a multi-step data pipeline). ReAct is better for exploratory tasks where the next step depends on discoveries made in the current step (e.g., debugging). Plan-and-execute is more debuggable because the plan is explicit; ReAct is more adaptive because it re-evaluates after every action.

Exercise 22.3.2: Tree of Thoughts Implementation Coding

Implement a simplified Tree of Thoughts explorer that generates three candidate next steps for a problem, scores each using the LLM, and expands the highest-scoring branch. Test it on the problem: 'Write a function to find the longest palindromic substring.'

Answer Sketch

For each candidate step, prompt the LLM to rate it 1 to 10 on 'likelihood of leading to a correct and efficient solution.' Select the highest-rated branch, generate the next set of candidates, and repeat for a fixed depth (e.g., 3 levels). Compare the final solution to a single-pass chain-of-thought approach.

Exercise 22.3.3: Reflection Budget Conceptual

A Reflexion agent is allowed unlimited self-correction loops and spends 15 iterations refining an answer that was already correct after iteration 2. Propose a concrete stopping criterion that balances thoroughness with efficiency.

Answer Sketch

Use a combination of: (1) a maximum iteration cap (e.g., 3), (2) a 'delta check' that stops if the reflection produces no substantive changes to the answer, and (3) a test-based gate that stops as soon as all validation tests pass. In practice, 2 to 3 reflection rounds capture most of the value; additional rounds show diminishing returns.

Exercise 22.3.4: LATS Cost Analysis Analysis

A standard ReAct agent uses 5 LLM calls per task at $0.01 each. A LATS agent explores a tree of depth 3 with branching factor 3, requiring an LLM call per node. Calculate the cost ratio and discuss when LATS is economically justified.

Answer Sketch

LATS nodes: 1 + 3 + 9 + 27 = 40 calls (or 3^0 + 3^1 + 3^2 + 3^3). Cost ratio: 40 * $0.01 / 5 * $0.01 = 8x more expensive. LATS is justified when the cost of a wrong answer far exceeds the extra compute cost, such as code migrations where a bug could cause production outages, or financial decisions where errors have monetary consequences.

Exercise 22.3.5: Replanning After Failure Coding

Extend the plan-and-execute code from this section to add a replanning step. When a step fails, the agent should generate a revised plan that accounts for the failure. Implement this as a LangGraph conditional edge.

Answer Sketch

Add a should_replan function that checks if the latest result contains an error or unexpected output. If so, route to a replan node that receives the original task, completed steps, and the failure message, then generates a new plan starting from the current state. Add an edge from replan back to execute.

Tip: Log Every Agent Action for Debugging

Log each step of your agent's reasoning: the observation, the chosen action, the tool call parameters, and the result. This trace is essential for debugging failures and is the foundation for building evaluation datasets.

Key Takeaways

ReAct agents interleave thinking and acting one step at a time; plan-and-execute agents create a full plan before acting.
Plan-and-execute architectures excel at complex, multi-step tasks where global task decomposition matters more than reactive flexibility.
Re-planning on failure is essential: even the best initial plan will encounter unexpected observations that require adaptation.

Self-Check

Q1: What is the key difference between a simple ReAct loop and a plan-and-execute architecture?

Show Answer

A ReAct loop interleaves reasoning and action one step at a time, reacting to each observation. A plan-and-execute architecture first generates a full multi-step plan, then executes each step, with the ability to re-plan when assumptions are invalidated.

Q2: Why might a plan-and-execute agent outperform a pure ReAct agent on complex, multi-step tasks?

Show Answer

Plan-and-execute agents can reason about task decomposition upfront, allocate steps efficiently, detect dependencies between subtasks, and avoid the myopic decision-making that can cause ReAct agents to get stuck in local loops or miss the big picture.

What Comes Next

In the next section, Reasoning Models as Agent Backbones, we examine how specialized reasoning models (such as o1 and DeepSeek-R1) improve agent reliability through extended internal chain-of-thought before acting.

References and Further Reading

Planning and Reasoning in LLM Agents

Yao, S., Yu, D., Zhao, J., et al. (2023). "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." NeurIPS 2023.

Introduces the Tree of Thoughts framework that enables LLMs to explore multiple reasoning paths using tree search, significantly improving performance on complex planning tasks.

Paper

Zhou, A., Yan, K., Shlapentokh-Rothman, M., et al. (2024). "Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models." ICML 2024.

Proposes LATS, which combines Monte Carlo Tree Search with LLM-based reasoning, acting, and self-reflection, achieving state-of-the-art results on agentic tasks.

Paper

Shinn, N., Cassano, F., Gopinath, A., et al. (2023). "Reflexion: Language Agents with Verbal Reinforcement Learning." NeurIPS 2023.

Introduces Reflexion, where agents learn from verbal feedback by reflecting on failures and storing insights in an episodic memory buffer.

Paper

Agent Reasoning Surveys

Huang, J., Chen, X., Mishra, S., et al. (2024). "Understanding the Planning of LLM Agents: A Survey." arXiv preprint.

Comprehensive survey of planning methods for LLM agents, covering task decomposition, multi-plan selection, external planner integration, and reflection-based refinement.

Paper

Yao, S., Zhao, J., Yu, D., et al. (2023). "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023.

The foundational paper that interleaves reasoning traces and actions, enabling agents to think step-by-step while interacting with external tools and environments.

Paper

Hao, S., Gu, Y., Ma, H., et al. (2023). "Reasoning with Language Model is Planning with World Model." EMNLP 2023.

Frames LLM reasoning as planning by treating the language model as both a world model and a reasoning agent, bridging classical AI planning with modern LLM capabilities.

Paper