When to Retrieve, When to Reason

Section 33.3

"The model knows everything in its weights. Retrieval is for everything else."

RAGRAG, Retrieval-Reasoning AI Agent
Big Picture

Not every multimodal query needs RAG. A frontier VLM like GPT-4o or Gemini 2.5 Pro already encodes vast world knowledge, and many vision tasks ("count the people in this image", "describe the chart") can be answered directly without retrieval. Retrieval pays off when the answer requires specific external content: private documents, time-sensitive data, large-scale visual catalogs. This section provides a decision rubric for choosing between direct reasoning, RAG, and hybrid strategies. We cover what signals tell you the model is hallucinating, how to design a routing layer, and the cost/quality math behind the choice.

Prerequisites

This section assumes the multimodal RAG patterns from Section 33.2, the reasoning patterns (chain-of-thought, deliberation) from Section 26.2, and basic familiarity with the cost-latency-quality trade-off matrix from Section 14.1.

Decision tree diagram: query characteristics on the left (private data, time-sensitive, large catalog, etc.) leading to four leaves (direct VLM, RAG, hybrid, agentic search)
Figure 33.3.1: A decision rubric for choosing retrieve vs reason. The four leaves correspond to four production patterns; each has a distinct cost and quality profile.

33.3.1 The Four Patterns

Fun Fact

The Pareto frontier of "when to retrieve" is unusually steep. A direct VLM call can answer 60% of multimodal queries in 1.2 seconds; adding RAG bumps quality to 78% but at 3.5 seconds; adding agentic search reaches 84% at 12 seconds. Whether you choose 60%, 78%, or 84% depends almost entirely on whether your product is a chat box or a billable workflow.

Four production patterns cover the multimodal query space:

33.3.2 When Direct VLM Wins

Skip retrieval and answer with the VLM directly when:

Key Insight: Aha Moment: PlantNet vs. Crop-Disease-RAG

The PlantNet team's 2024 ablation provides the clean contrast. Direct GPT-4V on the iNaturalist 2021-Plants benchmark hits 71.3 percent top-1 accuracy on common species and 12.4 percent on rare species (those with fewer than 50 training examples). Add a SigLIP retrieval layer over the PlantNet 4M-image herbarium and rerun: common species score 72.1 percent (no change, the VLM already knew them) and rare species jump to 64.7 percent. Same VLM, same query format, one retrieval layer: direct wins on the common 80 percent of plants by tying on quality with 100 ms less latency, RAG wins on the rare 20 percent by a 52-point margin. The decision rule is not "RAG always" or "direct always," it is "what fraction of your traffic falls in the long tail?" That single number determines which pattern you reach for, and most teams discover their long-tail fraction only after the first month of production logs.

The empirical rule of thumb: if a knowledgeable human could answer your query just by looking at the image, the VLM probably can too. Save retrieval for queries that would stump that human without access to specific documents.

33.3.3 When RAG Wins

Reach for retrieval when the VLM cannot have seen the answer:

Key Insight: Hallucination is the smoking gun

The single most reliable signal that direct VLM is insufficient is hallucination. If your VLM regularly produces plausible-but-wrong answers on your task, you have a knowledge gap that RAG addresses. Specifically: confident answers with no factual grounding, made-up document references, made-up image details. If you see these patterns in spot-checks of your production traffic, add retrieval. If you don't, save the engineering effort.

33.3.4 When Hybrid Wins

Hybrid patterns (the VLM decides whether to retrieve) win when:

The simplest hybrid is a router prompt: ask the VLM "should I retrieve external context to answer this, or do I have enough to answer directly?" before generating the final answer. Cost adds ~100 tokens per query; quality benefit on knowledge-heavy queries is substantial.

# A hybrid router that decides retrieve vs reason per query.
# Uses GPT-4o-mini to classify, then dispatches accordingly.
from openai import OpenAI
oai = OpenAI()

ROUTER_PROMPT = """Decide if answering this query needs retrieval over the user's
private image and document corpus, or if the question can be answered
from general knowledge and the directly attached media.

Respond with exactly one of:
  DIRECT  -- answerable from VLM knowledge and attached media
  RAG     -- needs retrieval from the private corpus
  AGENT   -- needs iterative search across multiple tools

Query: {query}"""

def route_and_answer(query, attachments=None):
    # 1. Decide the path
    classification = oai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user",
                   "content": ROUTER_PROMPT.format(query=query)}],
        max_tokens=10,
    ).choices[0].message.content.strip().upper()
    # 2. Dispatch
    if "DIRECT" in classification:
        return answer_direct(query, attachments)
    elif "RAG" in classification:
        retrieved = multimodal_rag_retrieve(query, k=5)
        return answer_with_context(query, attachments, retrieved)
    else:  # AGENT
        return agent_search(query, attachments)
Code Fragment 33.3.1a: A three-way router. GPT-4o-mini classifies each query as direct, RAG, or agent in a single tiny call. The cost overhead is negligible (~$0.0001 per query); the savings come from not running expensive retrieval on direct-answerable queries.

33.3.5 When Agentic Search Wins

Agentic search is the right choice when:

Agentic search is the most expensive pattern (multiple tool calls per query, longer latency) but produces the highest-quality answers on open-ended tasks. The 2025-2026 production pattern combines Gemini 2.5 Pro's native tool-use with a small set of multimodal-aware tools (image search, document RAG, web search).

33.3.6 The Cost-Quality Frontier

PatternCost per Queryp95 LatencyQuality (knowledge tasks)Quality (general visual tasks)
Direct VLM (GPT-4o-mini)~$0.003500 msPoor on private dataGood
Direct VLM (GPT-4o)~$0.015800 msMediocre on private dataExcellent
Static RAG~$0.0251.5 sGoodComparable to direct
Hybrid router~$0.005 to 0.0250.7 to 1.7 sGood (routed)Excellent (direct path)
Agentic search~$0.10 to 0.505 to 15 sExcellentExcellent (slow)
Figure 33.3.2: Pattern comparison, late 2025 numbers. The hybrid router gives the best cost/quality trade-off for mixed traffic; agentic search wins for complex multi-hop queries but is 10x to 100x more expensive than direct.

33.3.7 Debugging the Choice

Picking the right pattern is empirical. A practical workflow:

  1. Start with direct VLM as the baseline.
  2. Collect 100 to 200 representative production queries.
  3. Label each query with the "correct" pattern (a human reviews and decides).
  4. Measure accuracy of direct VLM vs RAG vs hybrid on this labeled set.
  5. Choose the pattern that hits your quality target at the lowest cost.

For production monitoring, instrument the router with the chosen pattern, the user's downstream feedback (thumbs up/down), and any retrieval hit-rate metrics. Over time, you can adapt the router thresholds based on which patterns produce satisfactory outcomes.

Note: A common anti-pattern: RAG on everything

A frequent design mistake in 2024-2025 was applying RAG to every multimodal query "just in case". This adds latency, cost, and the risk of low-quality retrievals contaminating the prompt. If 60% of your queries are answerable directly, the hybrid router pays for itself many times over. Don't reach for RAG until direct-VLM hallucination shows up in your spot checks.

33.3.8 The Router as a Policy

The router is not just a static prompt. In production, it can be:

The 2026 production default is prompt-based for systems under 1M queries/day (engineering simplicity) and learned for high-volume systems where the router's accuracy directly affects cost.

Real-World Scenario: A Field Service Assistant

A 2025 industrial-equipment company built a field-service assistant for technicians. Queries fell into three buckets: (a) general visual ID ("what's this part?"), best answered direct; (b) specific equipment lookup ("what's the torque spec for model XJ-7?"), best answered via RAG over internal documentation; (c) troubleshooting ("this error code with this video; what should I check?"), best answered via agentic search across docs, parts catalog, and historic tickets. They implemented a three-way router with a GPT-4o-mini classifier; routing latency added ~120 ms but cut per-query cost by 73% versus naive "RAG on everything." Field technician satisfaction with the assistant rose from 3.1 to 4.6 out of 5.

Key Insight

Choose the multimodal pipeline that matches the query, not the loudest hype. Direct VLM is right for self-contained visual queries; static RAG for private or time-sensitive content; hybrid routers for mixed traffic; agentic search for complex multi-hop queries. The hybrid router is the right default for any production system with non-uniform query types. Hallucination patterns and human spot checks are the practical signals that tell you when to escalate from direct VLM to retrieval-augmented patterns.

Self-Check
Q1: A user uploads a screenshot of an obscure scientific instrument and asks "what is this?" Which pattern wins, and what is the failure mode if you pick wrong?
Show Answer
If the instrument is general-knowledge (a standard piece of lab equipment a knowledgeable human could identify by sight), direct VLM wins: the query is self-contained, latency matters, and retrieval cannot improve on what the VLM already sees. If the instrument is sufficiently obscure that the VLM's training data did not cover it (a niche prototype, an internal device), the failure mode of going direct is a confident hallucination: the VLM produces a plausible-sounding identification that is wrong, with no grounding signal that anything is amiss. The fix in that regime is RAG over a specialist image catalog (vendor catalogs, internal asset databases) so the answer is grounded in actual retrieved instances rather than the model's parametric guess. The smoking gun for "you picked wrong" is hallucination in spot checks; that is the cue to add retrieval.
Q2: Sketch a hybrid router for a customer-support assistant where 70% of queries are about general product behavior and 30% are about specific orders.
Show Answer
A two-branch router fits the traffic mix. Step 1: a GPT-4o-mini classifier with a router prompt that decides DIRECT (general product behavior) vs RAG (specific order or account lookup). Step 2: on DIRECT, send the query plus any attachments to GPT-4o-mini with a system prompt encoding the product manual; on RAG, fetch the user's order details by ID from the internal database, plus top-k FAQ chunks from a vector index, and pass the assembled context to the same generation model. Routing cost is ~$0.0001 per query (~100 tokens at mini-variant prices); the 70 percent direct branch costs ~$0.003 per query and the 30 percent RAG branch ~$0.025, blending to roughly $0.0096 average. This beats running RAG on every query (~$0.025 always) by ~60 percent while preserving quality on the order-lookup tail.
Q3: You instrument your direct-VLM-only system and find 12% of answers are hallucinated. What is the minimum change to fix this, and what does it cost per query?
Show Answer
A 12 percent hallucination rate is the canonical signal that direct VLM has a knowledge gap and the minimum fix is to add retrieval over the content the model is fabricating. Concretely: index the relevant corpus (user documents, internal catalog, time-sensitive data) once with SigLIP 2 or a text embedding model, and at query time retrieve top-k results plus the explicit "Answer using ONLY these..." grounding prompt. The added cost is roughly $0.02 to $0.025 per query for retrieval + slightly larger prompt context, on top of the original ~$0.003 to $0.015 direct cost; latency rises ~500 to 1000 ms. Empirically this is the fix that takes hallucination rates from double-digits down to single-digit percentages (the Q4 case study in section 33.4 shows 18 percent dropping to 6 percent after adding ColPali RAG).
Q4: Why does agentic search produce higher quality on multi-hop queries than static RAG with k=20?
Show Answer
Static RAG with k=20 retrieves once against a single query embedding, so for a multi-hop question (e.g., "Texas solar panel installations completed after January 2025 with published efficiency metrics") the retrieved chunks have to satisfy all constraints simultaneously, and the embedding similarity is dominated by whichever clause is most prominent. Agentic search instead lets the VLM issue multiple retrievals iteratively: first fetch installations in Texas, inspect results, then issue a follow-up retrieval filtered by date, then a third for efficiency metrics, combining evidence across hops. This decomposition matches the structure of the question and surfaces relevant content that no single-shot retrieval ranks highly enough to make the top-20. The trade-off is cost and latency (5 to 15 seconds and $0.10 to $0.50 per query versus ~1.5 s and $0.025 for static RAG), so agentic search is reserved for the complex long tail.

What Comes Next

Section 33.4: Multimodal Reasoning in Production closes the chapter with the full cost/latency/quality matrix and the model-selection guidance for production deployments.

Further Reading

Self-RAG and Routing

Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2024). "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection." ICLR. arXiv:2310.11511
Jiang, Z., Xu, F. F., Gao, L., et al. (2023). "Active Retrieval Augmented Generation." EMNLP. arXiv:2305.06983
Schick, T., Dwivedi-Yu, J., Dessi, R., et al. (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools." NeurIPS. arXiv:2302.04761
Google DeepMind. (2024). "Project Mariner: Computer-using agents with Gemini." Blog. blog.google/google-mariner-browser-ai-agent

Hallucination Detection

Sun, Z., Shen, S., Cao, S., et al. (2024). "Aligning Large Multimodal Models with Factually Augmented RLHF." arXiv. arXiv:2309.14525

Multimodal RAG Evaluation

Saad-Falcon, J., Khattab, O., Potts, C., & Zaharia, M. (2024). "ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems." NAACL. arXiv:2311.09476