The best researchers do not simply search once. They search, reflect, refine, and search again until the full picture emerges.
RAG, Relentlessly Curious AI Agent
Naive RAG performs a single retrieval step, but complex research questions require multiple rounds of searching, reading, reflecting, and refining. Agentic RAG systems give the LLM the ability to decide what to search for, evaluate whether retrieved results are sufficient, generate follow-up queries, and synthesize findings from multiple sources. Building on the advanced retrieval techniques from Section 35.1, this transforms RAG from a simple retrieve-and-generate pattern into an autonomous research workflow. Section 32.3a extends this with full deep-research architectures (OpenAI/Gemini/Anthropic Deep Research), a competitive-intelligence case study, and a complete LlamaIndex example.
Prerequisites
Agentic RAG combines the retrieval techniques from Section 32.1 with agent design patterns. You should understand basic prompt engineering, as the agent loop relies on well-structured prompts to decide when and how to retrieve information.
32.3.1 From Single-Shot to Iterative Retrieval
Consider the research question: "How do the climate policies of the top 5 GDP countries compare in their approach to carbon taxation, and what evidence exists for the effectiveness of each approach?" This question cannot be answered with a single retrieval step. It requires identifying the top 5 GDP countries, finding each country's climate policy, extracting carbon taxation details, finding effectiveness studies for each approach, and then synthesizing the comparison.
Agentic RAG addresses this by giving the LLM a loop: plan what information is needed, retrieve it, evaluate whether it is sufficient, and either proceed to synthesis or generate follow-up queries. This iterative approach mirrors how a human researcher would tackle such a question, and it directly applies the agentic design patterns covered in Chapter 26.
Agentic RAG systems can sometimes spiral into what practitioners call "research rabbit holes," where the agent keeps generating follow-up queries that get progressively further from the original question. Setting a maximum iteration count is less about compute cost and more about preventing your AI from writing a dissertation when you asked for a paragraph.
32.3.1.1 Query Decomposition
Query decomposition is where agentic RAG diverges most sharply from traditional RAG. In naive RAG, the user's question goes directly to the retriever. In agentic RAG, the LLM first plans the research strategy, much like a librarian who reads your question, thinks about which sections of the library to visit, and decides the order of lookups before pulling a single book off the shelf.
The first step in agentic RAG is decomposing a complex query into smaller, answerable sub-queries. Each sub-query targets a specific piece of information needed to construct the final answer. The decomposition can be sequential (each sub-query depends on the previous answer) or parallel (sub-queries are independent and can be executed concurrently).
from openai import OpenAI
import json
client = OpenAI()
def decompose_query(query):
"""Break a complex question into sub-queries with explicit dependencies."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system",
"content": """Decompose the user's research question into sub-queries.
Return JSON with: "sub_queries" (list), "dependencies" (dict of index -> list),
"strategy" ("parallel" or "sequential")."""},
{"role": "user", "content": query},
],
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)
plan = decompose_query(
"How do carbon tax policies in the EU and US compare, "
"and what evidence exists for their effectiveness?")
# Returns sub-queries like:
# "What are the current carbon tax policies in the EU?"
# "What are the current carbon tax policies in the US?"
# "What studies evaluate EU carbon tax effectiveness?"
# "What studies evaluate US carbon pricing effectiveness?"
decompose_query asks the LLM to plan the research strategy, returning a list of focused sub-queries plus their dependency graph and execution strategy (parallel vs sequential).32.3.2 Parallel Search and Multi-Source Retrieval
Once sub-queries are generated, an agentic RAG system can execute searches in parallel across multiple sources. Unlike naive RAG, which searches a single vector store, agentic RAG can simultaneously query document stores, web search APIs, databases, and specialized APIs, then combine results from all sources.
import asyncio
async def search_web(query):
"""Live web search; wire to Tavily, Serper, or Brave Search in production."""
return []
async def search_documents(query, collection):
"""Query the internal vector store; tag each hit with its source label."""
results = collection.query(query_texts=[query], n_results=5)
return [{"text": d, "source": "internal_docs"} for d in results["documents"][0]]
async def search_database(query):
"""Text-to-SQL; see Section 32.4."""
return []
async def search_one(query, collection):
"""Fan out one sub-query across all three sources concurrently."""
web, docs, db = await asyncio.gather(
search_web(query),
search_documents(query, collection),
search_database(query),
return_exceptions=True,
)
return {
"query": query,
"web": [] if isinstance(web, Exception) else web,
"docs": [] if isinstance(docs, Exception) else docs,
"db": [] if isinstance(db, Exception) else db,
}
async def multi_source_search(sub_queries, collection):
return await asyncio.gather(*[search_one(q, collection) for q in sub_queries])
asyncio.gather. return_exceptions=True isolates failures so one slow or broken source does not kill the whole query.32.3.3 Iterative Refinement and Follow-Up Generation
After initial retrieval, the agent evaluates whether the gathered information is sufficient to answer the original question. If gaps remain, the agent generates follow-up queries targeting the missing information. This loop continues until the agent determines it has enough evidence or reaches a maximum iteration limit.
import json
SUFFICIENCY_PROMPT = """Evaluate whether the gathered information is sufficient to
comprehensively answer the question. Return JSON with:
- "sufficient": true/false
- "missing": list of what information is still needed
- "follow_up_queries": list of queries to fill gaps
- "confidence": 0.0 to 1.0"""
def evaluate_and_refine(original_query, gathered_info, max_iterations=3):
"""Iterate retrieve-evaluate-refine until evidence is sufficient or budget runs out."""
evaluation = {"sufficient": False, "confidence": 0.0}
for _ in range(max_iterations):
resp = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SUFFICIENCY_PROMPT},
{"role": "user",
"content": f"Question: {original_query}\n\nEvidence:\n{json.dumps(gathered_info, indent=2)}"},
],
response_format={"type": "json_object"},
)
evaluation = json.loads(resp.choices[0].message.content)
if evaluation["sufficient"] or evaluation["confidence"] > 0.85:
break
gathered_info.extend(retrieve_for_queries(evaluation["follow_up_queries"]))
return gathered_info, evaluation
max_iterations caps cost.32.3.4 Source Credibility Assessment
Not all retrieved sources are equally trustworthy. Agentic RAG systems benefit from explicit credibility assessment before passing sources to the synthesis step. Five key credibility signals:
- Source authority: recognized expertise in the domain (peer-reviewed journal > preprint > blog).
- Recency: newer information for time-sensitive topics.
- Consistency: corroboration across multiple independent sources.
- Specificity: concrete data and citations versus vague claims.
- Bias indicators: commercial interests, political slant, or advocacy motivations that might skew the information.
In practice, assign each source a 0 to 1 credibility score by combining signals (domain whitelist, age in days, citation count where available), then use the score to weight context placement: the most trustworthy evidence goes first in the synthesis prompt, low-credibility sources are either excluded or explicitly tagged.
32.3.5 Synthesis and Report Generation
The final step combines the gathered (and credibility-scored) sources into a coherent answer. Effective synthesis prompts instruct the LLM to: (1) place the most credible evidence first; (2) cite sources inline; (3) handle source disagreement explicitly rather than silently picking one version. When two high-credibility sources contradict each other, the system should present both perspectives with their supporting evidence and let the reader decide. This "epistemic honesty" approach builds far more user trust than confidently presenting a single answer that papers over genuine uncertainty.
The most effective synthesis prompts instruct the LLM to handle source disagreement explicitly rather than silently picking one version. When two high-credibility sources contradict each other, the system should present both perspectives with their supporting evidence and let the reader decide.
- Agentic RAG transforms retrieval into research: a plan-retrieve-evaluate loop makes complex multi-faceted questions tractable through iterative decomposition and refinement.
- Query decomposition is the foundation: breaking complex questions into focused sub-queries (parallel or sequential) enables targeted retrieval.
- Multi-source retrieval combines complementary strengths: web search for breadth and recency, document stores for curated depth, databases for structured data;
asyncio.gatherwithreturn_exceptions=Trueisolates source failures. - Source credibility prevents misinformation amplification: score by authority, recency, consistency, specificity, and bias before synthesis.
- Synthesis should be epistemically honest: present source disagreement explicitly rather than papering over it.
Show Answer
Show Answer
Show Answer
Exercises
Explain why a single retrieval step is insufficient for complex research questions. Give an example of a question that requires iterative retrieval.
Show Answer
Complex questions have information dependencies: the answer to one sub-question determines what to search next. Example: "How does Company X's revenue growth compare to its main competitors over the last 5 years?" requires first identifying competitors, then finding revenue data for each.
Define query drift in the context of agentic RAG. How can you detect and prevent it?
Show Answer
Query drift occurs when follow-up queries gradually shift away from the original topic. Detect it by computing semantic similarity between each follow-up query and the original question. Prevent it by always including the original question as context when generating follow-up queries, and by setting a maximum drift threshold.
Implement a simple agentic RAG loop: (1) decompose a complex question into sub-questions, (2) retrieve for each sub-question, (3) evaluate if the combined information is sufficient, (4) generate follow-up queries if not.
What Comes Next
This section continues in Section 32.3a: Deep Research Architectures & Production Patterns, which compares Naive RAG vs Agentic RAG vs Deep Research, walks through the Plan-Gather-Verify-Refine-Synthesize pipeline used by OpenAI/Gemini/Anthropic Deep Research, and includes a competitive-intelligence case study plus a complete LlamaIndex example.