Deep Research & Agentic RAG

Section 32.3

The best researchers do not simply search once. They search, reflect, refine, and search again until the full picture emerges.

RAGRAG, Relentlessly Curious AI Agent
Big Picture

Naive RAG performs a single retrieval step, but complex research questions require multiple rounds of searching, reading, reflecting, and refining. Agentic RAG systems give the LLM the ability to decide what to search for, evaluate whether retrieved results are sufficient, generate follow-up queries, and synthesize findings from multiple sources. Building on the advanced retrieval techniques from Section 35.1, this transforms RAG from a simple retrieve-and-generate pattern into an autonomous research workflow. Section 32.3a extends this with full deep-research architectures (OpenAI/Gemini/Anthropic Deep Research), a competitive-intelligence case study, and a complete LlamaIndex example.

Prerequisites

Agentic RAG combines the retrieval techniques from Section 32.1 with agent design patterns. You should understand basic prompt engineering, as the agent loop relies on well-structured prompts to decide when and how to retrieve information.

A detective at a research desk surrounded by books and documents, following a looping process of searching, reading, thinking, and refining queries toward a glowing lightbulb
Figure 32.3.1: Agentic RAG iterates through search, evaluation, and refinement cycles until the full picture emerges.

32.3.1 From Single-Shot to Iterative Retrieval

Consider the research question: "How do the climate policies of the top 5 GDP countries compare in their approach to carbon taxation, and what evidence exists for the effectiveness of each approach?" This question cannot be answered with a single retrieval step. It requires identifying the top 5 GDP countries, finding each country's climate policy, extracting carbon taxation details, finding effectiveness studies for each approach, and then synthesizing the comparison.

Agentic RAG addresses this by giving the LLM a loop: plan what information is needed, retrieve it, evaluate whether it is sufficient, and either proceed to synthesis or generate follow-up queries. This iterative approach mirrors how a human researcher would tackle such a question, and it directly applies the agentic design patterns covered in Chapter 26.

Fun Fact

Agentic RAG systems can sometimes spiral into what practitioners call "research rabbit holes," where the agent keeps generating follow-up queries that get progressively further from the original question. Setting a maximum iteration count is less about compute cost and more about preventing your AI from writing a dissertation when you asked for a paragraph.

32.3.1.1 Query Decomposition

Key Insight

Query decomposition is where agentic RAG diverges most sharply from traditional RAG. In naive RAG, the user's question goes directly to the retriever. In agentic RAG, the LLM first plans the research strategy, much like a librarian who reads your question, thinks about which sections of the library to visit, and decides the order of lookups before pulling a single book off the shelf.

The first step in agentic RAG is decomposing a complex query into smaller, answerable sub-queries. Each sub-query targets a specific piece of information needed to construct the final answer. The decomposition can be sequential (each sub-query depends on the previous answer) or parallel (sub-queries are independent and can be executed concurrently).

from openai import OpenAI
import json
client = OpenAI()

def decompose_query(query):
    """Break a complex question into sub-queries with explicit dependencies."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system",
             "content": """Decompose the user's research question into sub-queries.
Return JSON with: "sub_queries" (list), "dependencies" (dict of index -> list),
"strategy" ("parallel" or "sequential")."""},
            {"role": "user", "content": query},
        ],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

plan = decompose_query(
    "How do carbon tax policies in the EU and US compare, "
    "and what evidence exists for their effectiveness?")
# Returns sub-queries like:
#   "What are the current carbon tax policies in the EU?"
#   "What are the current carbon tax policies in the US?"
#   "What studies evaluate EU carbon tax effectiveness?"
#   "What studies evaluate US carbon pricing effectiveness?"
Code Fragment 32.3.1: decompose_query asks the LLM to plan the research strategy, returning a list of focused sub-queries plus their dependency graph and execution strategy (parallel vs sequential).

32.3.2 Parallel Search and Multi-Source Retrieval

Once sub-queries are generated, an agentic RAG system can execute searches in parallel across multiple sources. Unlike naive RAG, which searches a single vector store, agentic RAG can simultaneously query document stores, web search APIs, databases, and specialized APIs, then combine results from all sources.

import asyncio

async def search_web(query):
    """Live web search; wire to Tavily, Serper, or Brave Search in production."""
    return []

async def search_documents(query, collection):
    """Query the internal vector store; tag each hit with its source label."""
    results = collection.query(query_texts=[query], n_results=5)
    return [{"text": d, "source": "internal_docs"} for d in results["documents"][0]]

async def search_database(query):
    """Text-to-SQL; see Section 32.4."""
    return []

async def search_one(query, collection):
    """Fan out one sub-query across all three sources concurrently."""
    web, docs, db = await asyncio.gather(
        search_web(query),
        search_documents(query, collection),
        search_database(query),
        return_exceptions=True,
    )
    return {
        "query": query,
        "web": [] if isinstance(web, Exception) else web,
        "docs": [] if isinstance(docs, Exception) else docs,
        "db": [] if isinstance(db, Exception) else db,
    }

async def multi_source_search(sub_queries, collection):
    return await asyncio.gather(*[search_one(q, collection) for q in sub_queries])
Code Fragment 32.3.2: Fan-out retrieval across three sources with asyncio.gather. return_exceptions=True isolates failures so one slow or broken source does not kill the whole query.

32.3.3 Iterative Refinement and Follow-Up Generation

After initial retrieval, the agent evaluates whether the gathered information is sufficient to answer the original question. If gaps remain, the agent generates follow-up queries targeting the missing information. This loop continues until the agent determines it has enough evidence or reaches a maximum iteration limit.

import json

SUFFICIENCY_PROMPT = """Evaluate whether the gathered information is sufficient to
comprehensively answer the question. Return JSON with:
 - "sufficient": true/false
 - "missing": list of what information is still needed
 - "follow_up_queries": list of queries to fill gaps
 - "confidence": 0.0 to 1.0"""

def evaluate_and_refine(original_query, gathered_info, max_iterations=3):
    """Iterate retrieve-evaluate-refine until evidence is sufficient or budget runs out."""
    evaluation = {"sufficient": False, "confidence": 0.0}
    for _ in range(max_iterations):
        resp = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": SUFFICIENCY_PROMPT},
                {"role": "user",
                 "content": f"Question: {original_query}\n\nEvidence:\n{json.dumps(gathered_info, indent=2)}"},
            ],
            response_format={"type": "json_object"},
        )
        evaluation = json.loads(resp.choices[0].message.content)
        if evaluation["sufficient"] or evaluation["confidence"] > 0.85:
            break
        gathered_info.extend(retrieve_for_queries(evaluation["follow_up_queries"]))
    return gathered_info, evaluation
Code Fragment 32.3.3: Iterative evaluate-and-refine loop. The LLM assesses whether gathered information is sufficient and generates follow-up queries to fill gaps; max_iterations caps cost.

32.3.4 Source Credibility Assessment

Not all retrieved sources are equally trustworthy. Agentic RAG systems benefit from explicit credibility assessment before passing sources to the synthesis step. Five key credibility signals:

In practice, assign each source a 0 to 1 credibility score by combining signals (domain whitelist, age in days, citation count where available), then use the score to weight context placement: the most trustworthy evidence goes first in the synthesis prompt, low-credibility sources are either excluded or explicitly tagged.

32.3.5 Synthesis and Report Generation

The final step combines the gathered (and credibility-scored) sources into a coherent answer. Effective synthesis prompts instruct the LLM to: (1) place the most credible evidence first; (2) cite sources inline; (3) handle source disagreement explicitly rather than silently picking one version. When two high-credibility sources contradict each other, the system should present both perspectives with their supporting evidence and let the reader decide. This "epistemic honesty" approach builds far more user trust than confidently presenting a single answer that papers over genuine uncertainty.

Key Insight

The most effective synthesis prompts instruct the LLM to handle source disagreement explicitly rather than silently picking one version. When two high-credibility sources contradict each other, the system should present both perspectives with their supporting evidence and let the reader decide.

Key Takeaways
Self-Check
Q1: Why does a complex research question require iterative retrieval rather than a single retrieval step?
Show Answer
Complex questions often require multiple pieces of information that cannot be captured by a single query. They may need sequential resolution (the answer to one sub-question is needed to formulate the next), information from different source types, and verification through cross-referencing. A single retrieval step would miss these dependencies.
Q2: What is the difference between parallel and sequential query decomposition?
Show Answer
Parallel decomposition produces independent sub-queries that can all be executed simultaneously (e.g., "What is country A's policy?" and "What is country B's policy?"). Sequential decomposition produces dependent sub-queries where each depends on the answer to the previous one (e.g., "Who are the top 5 GDP countries?" must be answered before "What is each country's carbon tax rate?").
Q3: How should a synthesis prompt handle conflicting information from multiple sources?
Show Answer
Rather than silently picking one version, the synthesis should present both perspectives with supporting evidence and source citations. Note which sources agree and disagree, highlight claims supported by only one source vs multiple, prioritize high-credibility sources, and let the reader make the final judgment.

Exercises

Exercise 32.3.1: Single-shot vs. iterative Conceptual

Explain why a single retrieval step is insufficient for complex research questions. Give an example of a question that requires iterative retrieval.

Show Answer

Complex questions have information dependencies: the answer to one sub-question determines what to search next. Example: "How does Company X's revenue growth compare to its main competitors over the last 5 years?" requires first identifying competitors, then finding revenue data for each.

Exercise 32.3.2: Query drift Conceptual

Define query drift in the context of agentic RAG. How can you detect and prevent it?

Show Answer

Query drift occurs when follow-up queries gradually shift away from the original topic. Detect it by computing semantic similarity between each follow-up query and the original question. Prevent it by always including the original question as context when generating follow-up queries, and by setting a maximum drift threshold.

Exercise 32.3.3: Iterative retrieval Coding

Implement a simple agentic RAG loop: (1) decompose a complex question into sub-questions, (2) retrieve for each sub-question, (3) evaluate if the combined information is sufficient, (4) generate follow-up queries if not.

What Comes Next

This section continues in Section 32.3a: Deep Research Architectures & Production Patterns, which compares Naive RAG vs Agentic RAG vs Deep Research, walks through the Plan-Gather-Verify-Refine-Synthesize pipeline used by OpenAI/Gemini/Anthropic Deep Research, and includes a competitive-intelligence case study plus a complete LlamaIndex example.

Further Reading
Asai, A., Wu, Z., Wang, Y., et al. (2024). "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection." ICLR 2024. Trains a single model to decide when to retrieve, what to retrieve, and how to critique the result.
Yan, S.-Q., Gu, J.-C., Zhu, Y., Ling, Z.-H. (2024). "Corrective Retrieval Augmented Generation." arXiv:2402.03367. Adds a lightweight retrieval evaluator that classifies retrieved documents as correct, ambiguous, or incorrect, and triggers web-search fallback when needed.