"Data does not speak for itself. It needs an agent who can read, query, plot, and then explain what it all means."
Agent X, Methodically Analytical AI Agent
Research and data analysis agents turn the agentic loop into a systematic investigation process. Unlike simple RAG systems that retrieve and summarize, research agents plan their investigation strategy, execute multi-step information gathering, evaluate source quality, identify gaps, and produce comprehensive reports with citations. Data analysis agents extend this to structured data: they write and execute code, generate visualizations, and iterate on their analysis until the results are sound. This section covers deep research architectures (as seen in OpenAI Deep Research and Gemini Deep Research), data analysis agent patterns, and the quality control mechanisms that separate reliable agents from unreliable ones.
Prerequisites
This section builds on agent foundations from Chapter 26, tool use from Chapter 27, and multi-agent patterns from Chapter 28.
29.3.1 Deep Research Agents
OpenAI's Deep Research feature launched in February 2025 and was internally code-named "GPT Researcher", borrowing from an open-source project of the same name. The deep-research pattern (plan, search, synthesize, reflect) was reportedly inspired by Stuart Russell's textbook chapter on "problem-solving agents"; one team member quipped that the LLM era had finally caught up with the 1995 edition of AIMA.
Research agents automate the process of gathering, analyzing, and synthesizing information from multiple sources. Unlike simple RAG systems that retrieve and summarize, research agents plan their research strategy, execute multi-step information gathering, evaluate source quality, identify gaps in their findings, and produce comprehensive reports with citations. This mirrors how a human researcher works: formulate a question, search for sources, read and evaluate them, identify what is still missing, search again, and synthesize.
The plan-and-execute architecture from Section 29.2 is the natural fit for research agents. The planning phase generates a research outline with specific questions to answer. The execution phase uses search tools (web search, academic search, database queries) to find relevant sources for each question. A synthesis phase combines findings into a coherent report. A reflection phase identifies gaps and triggers additional research cycles. OpenAI's Deep Research and Gemini's Deep Research features implement this pattern at scale.
The Plan-Execute-Reflect Loop
The four phases just described are not a straight pipeline; they are a loop whose continuation depends on what the agent has found so far. The skeleton below makes the control flow explicit: a planner turns the question into a set of sub-questions, the executor retrieves and reads sources for each open sub-question and extracts claims, and a reflection step scores how much of the question is now covered and proposes new sub-questions for whatever remains. The loop repeats until coverage crosses a threshold or a budget (search calls or wall-clock) is exhausted, whichever comes first. The two stopping conditions are both necessary: coverage alone can loop forever on an unanswerable question, and a budget alone would stop a nearly-finished investigation one search short.
# Deep-research control loop: plan sub-questions, retrieve + read +
# extract per sub-question, then reflect to find gaps and replan.
# Stops when coverage clears a threshold or the search budget runs out.
def deep_research(
question: str,
coverage_target: float = 0.8,
max_searches: int = 40,
) -> Report:
open_questions = plan_subquestions(question) # initial decomposition
findings: list[Finding] = []
searches_used = 0
while open_questions and searches_used < max_searches:
sq = open_questions.pop(0)
sources = retrieve(sq) # web/academic/db search tools
searches_used += 1
for src in sources:
claims = read_and_extract(src, sq) # claims grounded in src
cred = credibility(src, findings) # see scoring below
findings.append(Finding(sq, src, claims, cred))
# Reflect: how much of the question is covered, and what is missing?
coverage, gaps = reflect(question, findings)
if coverage >= coverage_target:
break # enough evidence; stop early
open_questions.extend(gaps) # replan: chase the gaps
return synthesize(question, findings) # report with citations
reflect returns both a scalar coverage and a list of gaps (new sub-questions), so the same call decides whether to stop and what to do next if it does not. The while guard combines the coverage target with the max_searches budget, the two stopping conditions discussed above.Quality control is the critical differentiator between good and poor research agents. Effective research agents implement source credibility scoring (preferring academic papers over blog posts, primary sources over secondary ones), cross-reference verification (checking claims against multiple independent sources), recency filtering (prioritizing recent information for fast-moving topics), and explicit uncertainty flagging (noting when findings conflict or when evidence is limited).
Scoring Credibility and Gaps
The two scalars that drive the loop, the per-source credibility and the overall coverage, are easy to assert and harder to make mechanical. Making them concrete is what turns "the agent prefers reliable sources" from a slogan into a number the loop can act on. We score each source's credibility as a weighted sum of three signals: a source-type prior (a peer-reviewed paper or official documentation starts higher than a forum post), a recency term that decays with age (important for fast-moving topics, near-irrelevant for settled ones), and a corroboration count capturing how many independent already-collected sources assert the same claim. Writing p for the source-type prior, r for recency, and c for the corroboration count, all on a 0-to-1 scale:
credibility = 0.5 · p + 0.2 · r + 0.3 · min(c / 3, 1)
The weights say that what the source is matters most (0.5), corroboration matters next (0.3), and recency is a tie-breaker (0.2); the min(c / 3, 1) caps the corroboration bonus so that three independent confirmations saturate it and a fourth adds nothing. Worked example: a peer-reviewed paper (p = 0.9) published 18 months ago on a topic with a two-year half-life (r ≈ 0.6) corroborated by two other sources (c = 2, so min(2/3, 1) = 0.667) scores 0.5 · 0.9 + 0.2 · 0.6 + 0.3 · 0.667 = 0.45 + 0.12 + 0.20 = 0.77. A recent but anonymous blog post (p = 0.3, r = 1.0, c = 0) scores only 0.15 + 0.20 + 0 = 0.35, so the loop trusts the paper roughly twice as much and weights its claims accordingly during synthesis.
Coverage is the complement of the gap score. If the planner has identified a set of sub-questions and we mark a sub-question covered once it has at least one finding whose credibility clears a floor (say 0.5), then the gap score is the uncovered fraction and coverage is one minus it:
gap_score = uncovered_subquestions / total_subquestions, coverage = 1 − gap_score
Worked example: a research outline with 5 sub-questions where 4 have at least one credible finding gives gap_score = 1 / 5 = 0.2 and coverage = 0.8. With the default coverage_target = 0.8 from Code Fragment 29.3.1b, the loop would stop here; lower the target to 0.6 and it stops one sub-question sooner, raise it to 1.0 and it keeps chasing the last gap until it finds a credible source or burns through max_searches. Because both numbers are computed from the same findings list in a single reflect pass, the stopping decision and the gap list it produces are always consistent with each other, which is the property that makes the quality-control step mechanistic rather than asserted.
Who: A solutions architect at a mid-size SaaS company tasked with selecting a vector database for their new semantic search feature.
Situation: The architect needed a comprehensive competitive analysis of the top 5 vector database providers, covering pricing, performance benchmarks, supported index types, cloud integrations, and recent funding. The last manual competitive analysis had taken two analysts an entire week.
Problem: Information was scattered across vendor websites, GitHub repositories, blog posts, benchmark reports, and Crunchbase. No single source provided a complete comparison, and some vendors published benchmark data while others did not, making apples-to-apples comparison difficult.
Decision: The architect deployed a research agent using a plan-execute-reflect loop. The agent generated a research outline, executed 47 web searches, read 23 pages, extracted data into structured comparison tables, and ran a reflection pass that identified two providers lacking public benchmark data (flagged as a gap requiring vendor outreach).
Result: The agent produced a 3,000-word report with comparison tables, sourced claims, and an explicit "limitations" section in 45 minutes. The architect spent an additional 2 hours verifying key claims and adding internal context, for a total of under 3 hours compared to the previous week-long manual process.
Lesson: Research agents provide the most value when they explicitly flag gaps and uncertainties rather than papering over missing data, because the human reviewer can then focus verification effort on the areas that matter most.
Research agents reveal a fundamental asymmetry in intelligence: synthesis is harder than analysis. A single web search is trivial; reading one paper is straightforward. But combining findings from dozens of sources, detecting contradictions, identifying what is missing, and weighting evidence by credibility requires the kind of recursive, self-correcting reasoning that separates genuine research from mere retrieval. This is why the plan-execute-reflect loop is essential: research is not a pipeline with a fixed number of steps, but an expanding search through an information space whose boundaries you discover only by exploring it.
29.3.2 Data Analysis Agents
Data analysis agents combine natural language understanding with code execution to answer questions about data. The user asks a question in plain language ("What was our churn rate by cohort last quarter?"), the agent writes Python or SQL code to analyze the data, executes the code in a sandbox, interprets the results, and presents findings with visualizations. This is the code agent pattern from Section 29.1 specialized for analytical workflows.
The key architectural decision is how the agent accesses data. Direct database access (the agent writes SQL) is the most flexible but requires careful security controls to prevent destructive queries. Pre-loaded DataFrames (the agent writes pandas code against data already loaded in the sandbox) are simpler and safer but limit the agent to the pre-loaded data. API-based access (the agent calls analytics APIs) provides the best security but limits the types of analysis possible. Most production deployments use a combination: SQL for data extraction, pandas for analysis, and matplotlib/plotly for visualization.
# Data analysis agent with sandboxed code execution
from e2b_code_interpreter import Sandbox
def analyze_data(question: str, data_description: str) -> dict:
sandbox = Sandbox()
# Upload the data to the sandbox
sandbox.files.write("/data/sales.csv", sales_data)
# Generate and execute analysis code
code = llm.invoke(
f"Write Python code to answer this question about the data:\n"
f"Question: {question}\n"
f"Data description: {data_description}\n"
f"The data is available at /data/sales.csv\n"
f"Use pandas for analysis and matplotlib for any charts.\n"
f"Save charts to /output/chart.png\n"
f"Print the answer clearly at the end."
)
result = sandbox.run_code(code.content)
return {
"answer": result.text,
"chart": sandbox.files.read("/output/chart.png") if result.text else None,
"code": code.content,
}
A complete data analysis agent in 8 lines with smolagents (pip install smolagents):
Show code
from smolagents import CodeAgent, HfApiModel
agent = CodeAgent(
tools=[], # CodeAgent can write and run pandas/matplotlib natively
model=HfApiModel(),
additional_authorized_imports=["pandas", "matplotlib"],
)
result = agent.run(
"Load /data/sales.csv, compute monthly revenue totals, "
"and plot a bar chart. Save the chart to /output/chart.png."
)
smolagents.29.3.3 Scientific Discovery Agents
At the frontier of research agents are systems designed for scientific discovery: generating hypotheses, designing experiments, analyzing results, and proposing new research directions. These agents are being deployed in drug discovery, materials science, and genomics, where the volume of literature and data exceeds any human's ability to synthesize. FutureHouse's Robin agent, for example, can propose novel protein engineering strategies by synthesizing knowledge across thousands of papers.
Scientific agents face unique challenges around reproducibility, uncertainty quantification, and domain expertise. A research agent that confidently states an incorrect finding could waste months of laboratory work. Production scientific agents therefore implement aggressive uncertainty quantification, require citations for every claim, flag when they are extrapolating beyond their training data, and always present findings as hypotheses to be verified rather than conclusions.
Research agents can produce plausible-sounding but incorrect analyses, especially when they hallucinate sources or misinterpret statistical results. Always verify agent-produced research against primary sources before making decisions based on it. Implement citation verification (check that cited URLs exist and contain the claimed information) and statistical sanity checks (verify that reported numbers are within plausible ranges).
- Deep research agents plan multi-step research strategies, unlike RAG pipelines that execute a single retrieve-generate cycle.
- Key phases include query decomposition, multi-source search, evaluation, synthesis, and gap identification.
- Deep research agents are suited for complex, open-ended questions that require information from multiple sources.
Show Answer
Deep research agents actively plan research strategies, formulate multiple search queries, evaluate and synthesize information from diverse sources, identify knowledge gaps, and iterate until a comprehensive answer is assembled. RAG pipelines execute a single retrieve-generate cycle without strategic planning.
Show Answer
Typically: (1) query decomposition (break complex question into sub-questions), (2) multi-source search (search the web, databases, and documents), (3) evaluation and filtering (assess relevance and reliability), (4) synthesis (combine findings into a coherent report), and (5) gap identification (determine if more research is needed).
Exercises
Describe the architecture of a deep research agent. What distinguishes it from a simple RAG system, and what components are necessary for multi-step research?
Answer Sketch
A deep research agent goes beyond single-query retrieval. It decomposes research questions into sub-questions, searches multiple sources, evaluates and cross-references findings, identifies gaps, and iterates until the question is thoroughly answered. Required components: a planner (decomposes questions), a search tool (web, papers, databases), a note-taking system (accumulates findings), and a synthesizer (produces the final report with citations).
Write a prompt template for a data analysis agent that receives a CSV file path and a natural language question. The agent should generate Python code to analyze the data, execute it in a sandbox, and interpret the results.
Answer Sketch
The prompt should include: (1) instructions to first read column names and data types, (2) generate pandas code for the analysis, (3) execute the code and capture output, (4) interpret numerical results in plain language. Include safety instructions: do not modify the original file, handle missing values, and validate results with sanity checks before reporting.
A research agent cites sources in its report. Design a verification pipeline that checks whether each citation actually supports the claim it is attached to.
Answer Sketch
For each claim-citation pair: (1) retrieve the cited source, (2) extract the relevant passage, (3) use an LLM to evaluate whether the passage supports the claim (supports, contradicts, or is unrelated), (4) flag unsupported claims for human review. Also check: does the cited paper exist? Is the author attribution correct? Is the year correct? This catches hallucinated citations.
Implement a research agent that searches three sources (arXiv, Wikipedia, and a web search engine) for information on a given topic, deduplicates findings, and produces a structured summary with source attribution.
Answer Sketch
Create async tool functions for each source. Search all three in parallel. For each result, extract key claims and tag with the source. Use embedding similarity to identify duplicate claims across sources. Group unique claims into themes. Produce a structured summary with inline citations: 'Claim X (arXiv:2301.xxxxx, also confirmed by Wikipedia).
Discuss the potential and limitations of AI agents for scientific discovery. Can an agent genuinely discover new knowledge, or is it limited to finding patterns in existing literature?
Answer Sketch
Agents can: synthesize findings across papers that human researchers might miss, identify gaps in the literature, generate hypotheses based on pattern recognition, and automate routine analyses. Limitations: agents cannot run physical experiments, they may confuse correlation with causation, they can hallucinate plausible-sounding but incorrect claims, and they lack the deep domain intuition that guides human researchers toward fruitful directions.
What Comes Next
In the next section, Section 29.4: Production Agentic Coding Systems (2026), we survey the named vendor tools (Claude Code, OpenAI Codex, Cursor, Windsurf, Aider, Devin, GitHub Copilot Workspace) that wrap the code-agent patterns from Section 29.1 in shippable form.