Section 33.3: When to Retrieve, When to Reason

"The model knows everything in its weights. Retrieval is for everything else."
RAG, Retrieval-Reasoning AI Agent

Big Picture

Not every multimodal query needs RAG. A frontier VLM like GPT-4o or Gemini 2.5 Pro already encodes vast world knowledge, and many vision tasks ("count the people in this image", "describe the chart") can be answered directly without retrieval. Retrieval pays off when the answer requires specific external content: private documents, time-sensitive data, large-scale visual catalogs. This section provides a decision rubric for choosing between direct reasoning, RAG, and hybrid strategies. We cover what signals tell you the model is hallucinating, how to design a routing layer, and the cost/quality math behind the choice.

Prerequisites

This section assumes the multimodal RAG patterns from Section 33.2, the reasoning patterns (chain-of-thought, deliberation) from Section 26.2, and basic familiarity with the cost-latency-quality trade-off matrix from Section 14.1.

Decision tree diagram: query characteristics on the left (private data, time-sensitive, large catalog, etc.) leading to four leaves (direct VLM, RAG, hybrid, agentic search) — **Figure 33.3.1**: A decision rubric for choosing retrieve vs reason. The four leaves correspond to four production patterns; each has a distinct cost and quality profile.

33.3.1 The Four Patterns

Fun Fact

The Pareto frontier of "when to retrieve" is unusually steep. A direct VLM call can answer 60% of multimodal queries in 1.2 seconds; adding RAG bumps quality to 78% but at 3.5 seconds; adding agentic search reaches 84% at 12 seconds. Whether you choose 60%, 78%, or 84% depends almost entirely on whether your product is a chat box or a billable workflow.

Four production patterns cover the multimodal query space:

Direct VLM: send the query (and any user-provided images) to a frontier multimodal model; trust its trained knowledge. Cheapest at the per-query level but limited to whatever the model has learned.
Static RAG: retrieve top-k from a pre-built multimodal index, splice into the VLM prompt. The pattern from Section 33.2. Use for private or domain-specific corpora.
Hybrid (RAG + reasoning): VLM first decides whether retrieval is needed (a "self-RAG" pattern); only fetches when its confidence is low or the query mentions specific entities.
Agentic search: the VLM uses tools (web search, internal database, image search) iteratively to assemble context, similar to Section 37.1's agent patterns extended to multimodal queries.

33.3.2 When Direct VLM Wins

Skip retrieval and answer with the VLM directly when:

Key Insight: Aha Moment: PlantNet vs. Crop-Disease-RAG

The PlantNet team's 2024 ablation provides the clean contrast. Direct GPT-4V on the iNaturalist 2021-Plants benchmark hits 71.3 percent top-1 accuracy on common species and 12.4 percent on rare species (those with fewer than 50 training examples). Add a SigLIP retrieval layer over the PlantNet 4M-image herbarium and rerun: common species score 72.1 percent (no change, the VLM already knew them) and rare species jump to 64.7 percent. Same VLM, same query format, one retrieval layer: direct wins on the common 80 percent of plants by tying on quality with 100 ms less latency, RAG wins on the rare 20 percent by a 52-point margin. The decision rule is not "RAG always" or "direct always," it is "what fraction of your traffic falls in the long tail?" That single number determines which pattern you reach for, and most teams discover their long-tail fraction only after the first month of production logs.

The query is self-contained: the user uploads an image and asks "what's in this picture?" No retrieval can improve on what the VLM sees directly.
The knowledge is well within the model's training: identifying common objects, places, plants, animals; basic image manipulation requests; general factual visual questions.
Latency budget is tight: an extra retrieval round-trip adds 100 to 500 ms; in real-time applications (voice agents, AR overlays), that may be unaffordable.
Cost matters more than absolute quality: direct VLM is the cheapest pattern per query.

The empirical rule of thumb: if a knowledgeable human could answer your query just by looking at the image, the VLM probably can too. Save retrieval for queries that would stump that human without access to specific documents.

33.3.3 When RAG Wins

Reach for retrieval when the VLM cannot have seen the answer:

The answer is in private content: company documents, proprietary images, internal video archives. The VLM has never seen these.
The content is time-sensitive: news photos, stock charts, current events imagery. The VLM's training cutoff is months to years stale.
You need source attribution: regulatory or audit contexts where you must show which document or image the answer came from.
Specific entities matter: "find me Dr. Smith's report on patient 12345" requires precise retrieval, not the VLM's parametric memory.
The corpus is too large for in-context: a 100k-image catalog cannot fit in any prompt, so retrieval is forced.

Key Insight: Hallucination is the smoking gun

The single most reliable signal that direct VLM is insufficient is hallucination. If your VLM regularly produces plausible-but-wrong answers on your task, you have a knowledge gap that RAG addresses. Specifically: confident answers with no factual grounding, made-up document references, made-up image details. If you see these patterns in spot-checks of your production traffic, add retrieval. If you don't, save the engineering effort.

33.3.4 When Hybrid Wins

Hybrid patterns (the VLM decides whether to retrieve) win when:

Your traffic is mixed: some queries need retrieval, others don't. Routing avoids paying retrieval cost on every query.
You have a budget for false positives: occasionally retrieving when you shouldn't have is fine, as long as you don't miss the cases where retrieval would have helped.
The model can reliably self-assess: modern VLMs (GPT-4o, Gemini 2.5) are fairly good at recognizing "I don't know this" or "I would need to look this up", which makes the routing layer simpler.

The simplest hybrid is a router prompt: ask the VLM "should I retrieve external context to answer this, or do I have enough to answer directly?" before generating the final answer. Cost adds ~100 tokens per query; quality benefit on knowledge-heavy queries is substantial.

# A hybrid router that decides retrieve vs reason per query.
# Uses GPT-4o-mini to classify, then dispatches accordingly.
from openai import OpenAI
oai = OpenAI()

ROUTER_PROMPT = """Decide if answering this query needs retrieval over the user's
private image and document corpus, or if the question can be answered
from general knowledge and the directly attached media.

Respond with exactly one of:
  DIRECT  -- answerable from VLM knowledge and attached media
  RAG     -- needs retrieval from the private corpus
  AGENT   -- needs iterative search across multiple tools

Query: {query}"""

def route_and_answer(query, attachments=None):
    # 1. Decide the path
    classification = oai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user",
                   "content": ROUTER_PROMPT.format(query=query)}],
        max_tokens=10,
    ).choices[0].message.content.strip().upper()
    # 2. Dispatch
    if "DIRECT" in classification:
        return answer_direct(query, attachments)
    elif "RAG" in classification:
        retrieved = multimodal_rag_retrieve(query, k=5)
        return answer_with_context(query, attachments, retrieved)
    else:  # AGENT
        return agent_search(query, attachments)

Code Fragment 33.3.1a: A three-way router. GPT-4o-mini classifies each query as direct, RAG, or agent in a single tiny call. The cost overhead is negligible (~$0.0001 per query); the savings come from not running expensive retrieval on direct-answerable queries.

33.3.5 When Agentic Search Wins

Agentic search is the right choice when:

The query is complex and multi-hop: "Find me images of solar panel installations in Texas that were completed after January 2025 and have published efficiency metrics" requires combining multiple retrievals.
The corpus is partitioned: different tools index different content (web image search, internal photo archive, regulatory database). Agent decides which to use.
The query may require iterative refinement: the VLM looks at first-pass results, recognizes they're insufficient, and reformulates the query.

Agentic search is the most expensive pattern (multiple tool calls per query, longer latency) but produces the highest-quality answers on open-ended tasks. The 2025-2026 production pattern combines Gemini 2.5 Pro's native tool-use with a small set of multimodal-aware tools (image search, document RAG, web search).

33.3.6 The Cost-Quality Frontier

Pattern	Cost per Query	p95 Latency	Quality (knowledge tasks)	Quality (general visual tasks)
Direct VLM (GPT-4o-mini)	~$0.003	500 ms	Poor on private data	Good
Direct VLM (GPT-4o)	~$0.015	800 ms	Mediocre on private data	Excellent
Static RAG	~$0.025	1.5 s	Good	Comparable to direct
Hybrid router	~$0.005 to 0.025	0.7 to 1.7 s	Good (routed)	Excellent (direct path)
Agentic search	~$0.10 to 0.50	5 to 15 s	Excellent	Excellent (slow)

Figure 33.3.2: Pattern comparison, late 2025 numbers. The hybrid router gives the best cost/quality trade-off for mixed traffic; agentic search wins for complex multi-hop queries but is 10x to 100x more expensive than direct.

33.3.7 Debugging the Choice

Picking the right pattern is empirical. A practical workflow:

Start with direct VLM as the baseline.
Collect 100 to 200 representative production queries.
Label each query with the "correct" pattern (a human reviews and decides).
Measure accuracy of direct VLM vs RAG vs hybrid on this labeled set.
Choose the pattern that hits your quality target at the lowest cost.

For production monitoring, instrument the router with the chosen pattern, the user's downstream feedback (thumbs up/down), and any retrieval hit-rate metrics. Over time, you can adapt the router thresholds based on which patterns produce satisfactory outcomes.

Note: A common anti-pattern: RAG on everything

A frequent design mistake in 2024-2025 was applying RAG to every multimodal query "just in case". This adds latency, cost, and the risk of low-quality retrievals contaminating the prompt. If 60% of your queries are answerable directly, the hybrid router pays for itself many times over. Don't reach for RAG until direct-VLM hallucination shows up in your spot checks.

33.3.8 The Router as a Policy

The router is not just a static prompt. In production, it can be:

Rule-based: if query mentions a private entity, RAG; else, direct.
Prompt-based: ask the VLM to classify (the simplest pattern, Code Fragment 33.3.1b).
Confidence-based: run direct VLM first; if its confidence (or self-rated confidence) is low, fall back to RAG.
Learned: a small classifier trained on query features and downstream outcome signals (clicks, thumbs up).

The 2026 production default is prompt-based for systems under 1M queries/day (engineering simplicity) and learned for high-volume systems where the router's accuracy directly affects cost.

Real-World Scenario: A Field Service Assistant

A 2025 industrial-equipment company built a field-service assistant for technicians. Queries fell into three buckets: (a) general visual ID ("what's this part?"), best answered direct; (b) specific equipment lookup ("what's the torque spec for model XJ-7?"), best answered via RAG over internal documentation; (c) troubleshooting ("this error code with this video; what should I check?"), best answered via agentic search across docs, parts catalog, and historic tickets. They implemented a three-way router with a GPT-4o-mini classifier; routing latency added ~120 ms but cut per-query cost by 73% versus naive "RAG on everything." Field technician satisfaction with the assistant rose from 3.1 to 4.6 out of 5.

Key Insight

Choose the multimodal pipeline that matches the query, not the loudest hype. Direct VLM is right for self-contained visual queries; static RAG for private or time-sensitive content; hybrid routers for mixed traffic; agentic search for complex multi-hop queries. The hybrid router is the right default for any production system with non-uniform query types. Hallucination patterns and human spot checks are the practical signals that tell you when to escalate from direct VLM to retrieval-augmented patterns.

Self-Check

Q1: A user uploads a screenshot of an obscure scientific instrument and asks "what is this?" Which pattern wins, and what is the failure mode if you pick wrong?

Show Answer

If the instrument is general-knowledge (a standard piece of lab equipment a knowledgeable human could identify by sight), direct VLM wins: the query is self-contained, latency matters, and retrieval cannot improve on what the VLM already sees. If the instrument is sufficiently obscure that the VLM's training data did not cover it (a niche prototype, an internal device), the failure mode of going direct is a confident hallucination: the VLM produces a plausible-sounding identification that is wrong, with no grounding signal that anything is amiss. The fix in that regime is RAG over a specialist image catalog (vendor catalogs, internal asset databases) so the answer is grounded in actual retrieved instances rather than the model's parametric guess. The smoking gun for "you picked wrong" is hallucination in spot checks; that is the cue to add retrieval.

Q2: Sketch a hybrid router for a customer-support assistant where 70% of queries are about general product behavior and 30% are about specific orders.

Show Answer

A two-branch router fits the traffic mix. Step 1: a GPT-4o-mini classifier with a router prompt that decides DIRECT (general product behavior) vs RAG (specific order or account lookup). Step 2: on DIRECT, send the query plus any attachments to GPT-4o-mini with a system prompt encoding the product manual; on RAG, fetch the user's order details by ID from the internal database, plus top-k FAQ chunks from a vector index, and pass the assembled context to the same generation model. Routing cost is ~$0.0001 per query (~100 tokens at mini-variant prices); the 70 percent direct branch costs ~$0.003 per query and the 30 percent RAG branch ~$0.025, blending to roughly $0.0096 average. This beats running RAG on every query (~$0.025 always) by ~60 percent while preserving quality on the order-lookup tail.

Q3: You instrument your direct-VLM-only system and find 12% of answers are hallucinated. What is the minimum change to fix this, and what does it cost per query?

Show Answer

A 12 percent hallucination rate is the canonical signal that direct VLM has a knowledge gap and the minimum fix is to add retrieval over the content the model is fabricating. Concretely: index the relevant corpus (user documents, internal catalog, time-sensitive data) once with SigLIP 2 or a text embedding model, and at query time retrieve top-k results plus the explicit "Answer using ONLY these..." grounding prompt. The added cost is roughly $0.02 to $0.025 per query for retrieval + slightly larger prompt context, on top of the original ~$0.003 to $0.015 direct cost; latency rises ~500 to 1000 ms. Empirically this is the fix that takes hallucination rates from double-digits down to single-digit percentages (the Q4 case study in section 33.4 shows 18 percent dropping to 6 percent after adding ColPali RAG).

Q4: Why does agentic search produce higher quality on multi-hop queries than static RAG with k=20?

Show Answer

Static RAG with k=20 retrieves once against a single query embedding, so for a multi-hop question (e.g., "Texas solar panel installations completed after January 2025 with published efficiency metrics") the retrieved chunks have to satisfy all constraints simultaneously, and the embedding similarity is dominated by whichever clause is most prominent. Agentic search instead lets the VLM issue multiple retrievals iteratively: first fetch installations in Texas, inspect results, then issue a follow-up retrieval filtered by date, then a third for efficiency metrics, combining evidence across hops. This decomposition matches the structure of the question and surfaces relevant content that no single-shot retrieval ranks highly enough to make the top-20. The trade-off is cost and latency (5 to 15 seconds and $0.10 to $0.50 per query versus ~1.5 s and $0.025 for static RAG), so agentic search is reserved for the complex long tail.

What Comes Next

Section 33.4: Multimodal Reasoning in Production closes the chapter with the full cost/latency/quality matrix and the model-selection guidance for production deployments.

Further Reading

Self-RAG and Routing

Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2024). "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection." ICLR. arXiv:2310.11511

Jiang, Z., Xu, F. F., Gao, L., et al. (2023). "Active Retrieval Augmented Generation." EMNLP. arXiv:2305.06983

Agentic Search

Schick, T., Dwivedi-Yu, J., Dessi, R., et al. (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools." NeurIPS. arXiv:2302.04761

Google DeepMind. (2024). "Project Mariner: Computer-using agents with Gemini." Blog. blog.google/google-mariner-browser-ai-agent

Hallucination Detection

Sun, Z., Shen, S., Cao, S., et al. (2024). "Aligning Large Multimodal Models with Factually Augmented RLHF." arXiv. arXiv:2309.14525

Multimodal RAG Evaluation

Saad-Falcon, J., Khattab, O., Potts, C., & Zaharia, M. (2024). "ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems." NAACL. arXiv:2311.09476