Multimodal RAG: Image, Audio, Video Retrieval-Augmented Generation

Section 33.2

"A text RAG explains an answer from a document. A multimodal RAG explains an answer from a picture."

RAGRAG, Multimodally-Indexed AI Agent
Big Picture

Multimodal RAG extends the retrieval-augmented generation pattern of Chapter 32 to images, audio, and video. The architecture has three pieces: a multimodal retrieval index (built on the joint embedding spaces of Section 33.1), a multimodal LLM (the VLMs from Section 31.1 and the omni models of Section 37.4), and a query orchestrator that decides what to retrieve, how to chunk it, and how to splice it into the generation prompt. This section covers image-as-context patterns, video chunking, audio retrieval, and the production trade-offs (cost, latency, modality interleaving) that make multimodal RAG one of the most-deployed but least-discussed pieces of 2026 production AI.

Prerequisites

This section assumes the unimodal RAG architecture from Section 32.1, the vector-database patterns from Section 32.1, and the multimodal-embedding fundamentals from Section 19.2.

Multimodal RAG architecture: user query, retrieval over text/image/audio/video index, top-k results spliced into VLM prompt with text query, VLM generates answer
Figure 33.2.1: Multimodal RAG pipeline. The retrieval index spans multiple modalities; the VLM consumes them alongside the text query. The decisions about chunking, modality mixing, and reranking determine production quality.

33.2.1 The Three Cuts of Multimodal RAG

Fun Fact

Multimodal RAG's most common production failure is also the most banal: the system retrieves the right image, then describes a slightly different image because the VLM's captioner is hallucinating against a similar one from training. Teams chasing this bug usually find it after spending a week tuning embeddings, not minutes tuning prompts.

A cartoon buffet with three labelled stations: Image-as-Context with a small camera, Document-with-Images with a stack of PDFs, and Audio-Video RAG with a tiny film reel. A friendly VLM diner robot walks past sampling one plate from each, holding all three on a single tray
Figure 33.2.2: Multimodal RAG is one umbrella term sheltering three different buffet stations. The VLM diner can sample from any of them, but each station has its own chunking, embedding, and prompt-engineering rules.
Key Insight: Worked Example: Three Cuts, One Insurance Claim

Lemonade Insurance's 2024 claims pipeline uses all three multimodal-RAG cuts on the same incoming claim. Image-as-context: the customer uploads a smartphone photo of their water-damaged ceiling, the system retrieves the 8 most-similar past photos from a 470,000-image library indexed with SigLIP, and GPT-4o estimates damage severity by comparing them. Document-with-images: the claims system retrieves the customer's policy PDF and the 3 most relevant pages with diagrams of covered scenarios, then Claude 3.5 Sonnet checks coverage against the highlighted text. Audio-or-video: the customer's 90-second video walk-through is chunked into 5-second segments via Whisper-Large-v3, the segments are indexed with CLAP audio embeddings, and the agent retrieves the moment where the customer says "the leak started after the storm last Tuesday" to confirm the timeline. Three retrieval indexes, three different chunkers, three different prompts, one $4,200 claim approved in 47 seconds. This is the "three cuts" abstraction made tangible: not three competing patterns, but three layers of the same pipeline.

"Multimodal RAG" is an umbrella term for three distinct retrieval patterns:

Each pattern has different chunking, embedding, and prompt-engineering needs, but all share the basic flow: encode-the-corpus, embed-the-query, retrieve-top-k, prompt-the-VLM.

33.2.2 Image-as-Context Pattern

The simplest multimodal RAG: a text query retrieves images, and a VLM uses the retrieved images as visual context. The flow:

  1. Encode all images with SigLIP 2 (or your chosen joint embedding model from Section 33.1). Store in a vector database.
  2. At query time, encode the text query, retrieve top-$k$ images.
  3. Pass the images plus the original query to a VLM (GPT-4o, Gemini 2.5 Pro, Qwen2-VL).
  4. The VLM produces an answer grounded in the retrieved visual context.
# Image-as-context multimodal RAG: SigLIP for retrieval,
# GPT-4o-mini for grounded generation.
from openai import OpenAI
import base64, qdrant_client

qd = qdrant_client.QdrantClient(url=QDRANT_URL)
oai = OpenAI()

def multimodal_rag(query, k=5):
    # 1. Encode the text query into the joint space.
    q_emb = encode_text_siglip([query])[0]
    # 2. Retrieve top-k images from Qdrant.
    hits = qd.search(
        collection_name="images",
        query_vector=q_emb.tolist(),
        limit=k,
    )
    image_payloads = [h.payload["path"] for h in hits]
    # 3. Build a multimodal prompt for the VLM.
    content = [{"type": "text",
                "text": f"Answer using ONLY these {k} images. {query}"}]
    for path in image_payloads:
        with open(path, "rb") as f:
            b64 = base64.b64encode(f.read()).decode()
        content.append({
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{b64}",
                            "detail": "high"},
        })
    # 4. Generate the grounded answer.
    response = oai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": content}],
        max_tokens=400,
    )
    return response.choices[0].message.content, image_payloads
Code Fragment 33.2.1a: End-to-end image-as-context RAG in 25 lines. Two production considerations: (1) the "Answer using ONLY these images" instruction is essential to suppress hallucinations from the VLM's prior knowledge; (2) the "detail": "high" mode increases image tokens by ~6x but is necessary for fine-grained visual content.
Key Insight: Why retrieval beats giving the model everything

Why not just pass all your images to the VLM and let it figure out which are relevant? Cost and quality. A 1024x1024 image at "high" detail consumes ~1100 tokens. A 100-image corpus is 110,000 input tokens per query, $0.5 per request at GPT-4o pricing. With retrieval, you pass 5 to 10 images per query for ~$0.025, a 20x cost saving with comparable quality. The retrieval-then-VLM pattern is the multimodal analog of classical text RAG: cheap retrieval over expensive context.

33.2.3 Document-with-Images RAG

Real-world documents (PDFs, slide decks, technical manuals) interleave text, figures, charts, and tables. A naive approach extracts only text and ignores images, losing critical information. A better approach indexes both text chunks and visual elements, then retrieves jointly.

Three patterns dominate in 2026:

PatternIndexing CostQuery QualityBest for
Text-only (ignore images)LowestPoor on figure-heavy docsPure text corpora
ColPali / ColQwen page-as-imageMedium (VLM per page)State of the artTechnical PDFs, slide decks
Hybrid text + image embeddingMediumGoodMixed corpora at scale
VLM-summarized chunksHighest (VLM at index time)GoodSmall to medium corpora
Figure 33.2.3: Document-with-images RAG patterns, late 2025. ColPali-family methods are the new state of the art for technical document QA because they sidestep OCR entirely; for very large corpora, hybrid text + image embedding scales better.

33.2.4 Video RAG and Chunking Strategies

A video is a temporal sequence of images plus an audio track. The chunking strategy determines what retrieval can find:

Key Insight
Aha Moment: Why Transcript-Only RAG Misses the Demo

Take a 47-minute MLflow conference talk from 2024. Transcript-only RAG (Whisper-large-v3 plus 200-token text chunks indexed with bge-large) answered the query "what command did the speaker run to register the model?" with the wrong timestamp because the speaker said "I will run this command" at 09:12 and then ran the actual command silently at 11:43, with the command name visible only in the terminal capture on screen. Keyframe-only RAG (one frame per 5 seconds embedded with SigLIP) found the right moment (11:43) but could not parse the small terminal text. Hybrid keyframe-plus-transcript RAG retrieved 09:12 from the transcript and 11:43 from the keyframe, then sent both frames plus the surrounding 30 seconds of transcript to GPT-4o, which read the visible command "mlflow models register --name churn-v3 ..." off the keyframe and produced the correct answer. The lesson: each chunking strategy is sensitive to a different signal, so retrieval recall on real video corpora caps at the channel-coverage of your indexer. Hybrid is the production default not because it is cleverer but because it is the only strategy that covers both channels at once.

For long videos (lectures, podcasts), the hybrid pattern with timestamp alignment is the production default. A 1-hour lecture might produce 60 keyframe vectors plus 200 transcript chunk vectors; total index size is small even for huge video libraries.

# Hybrid video chunking: keyframes at scene boundaries + transcript
# segments. Each chunk has a (start_sec, end_sec) range.
import cv2
from faster_whisper import WhisperModel

whisper = WhisperModel("large-v3", device="cuda")

def chunk_video(video_path):
    chunks = []
    # 1. Keyframes via scene-change detection
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    prev_hist = None
    frame_idx = 0
    while True:
        ret, frame = cap.read()
        if not ret: break
        hist = cv2.calcHist([frame], [0,1,2], None,
                            [8,8,8], [0,256]*3)
        if prev_hist is None or \
           cv2.compareHist(prev_hist, hist,
                            cv2.HISTCMP_BHATTACHARYYA) > 0.4:
            t = frame_idx / fps
            chunks.append({"type": "frame", "time": t, "frame": frame})
            prev_hist = hist
        frame_idx += 1
    # 2. Transcript chunks with timestamps
    segments, _ = whisper.transcribe(video_path, word_timestamps=False)
    for seg in segments:
        chunks.append({
            "type": "transcript",
            "start": seg.start,
            "end": seg.end,
            "text": seg.text,
        })
    return chunks
Code Fragment 33.2.2a: Hybrid video chunking. Scene-boundary detection produces visual keyframes; Whisper produces transcript chunks. Each is independently embedded (SigLIP for frames, a text embedding model for transcripts) and indexed with timestamp metadata so the retrieval result can be played back at the matched moment.

33.2.5 Audio RAG

Pure audio RAG (no video) follows a similar pattern, but with one additional design axis: whether to treat audio as "text after transcription" or as a first-class modality with its own embedding. The right answer almost always depends on what the user is searching for, lexical content vs acoustic phenomena, which sets up the three choices below.

For most production audio RAG (call center transcripts, lecture archives), transcript-only retrieval is the right starting point. Add acoustic embeddings only if your use case demands non-linguistic content. Gong.io and Chorus.ai, both 8-figure-ARR conversation-intelligence products as of 2024, run on what is essentially "Whisper plus diarization plus text RAG"; the acoustic-embedding layer is only worth adding when your users start asking the kind of "find the laughter" or "find the angry tone" queries that text cannot serve.

33.2.6 Modality-Aware Reranking

Cross-modal retrieval is noisier than within-modal retrieval. A text-to-image query can return semantically related but spurious matches, the same picture of dogs surfaces for "dog playing" and "dog at the beach". The standard fix is a reranker: a more expensive model (cross-encoder, VLM) scores the top-k retrieved items more carefully.

For image retrieval, a small VLM (Qwen2-VL-2B, Idefics2-8B) reranks by computing a relevance score between the query text and each candidate image. The pattern:

  1. Joint-embedding retrieval returns top-50 candidates (cheap).
  2. VLM reranker scores each candidate against the query (medium-cost).
  3. Top-5 reranked images go to the final generation VLM.

This two-stage retrieve-then-rerank pattern, identical to text RAG (Section 32.1), improves answer quality by 5 to 15 percentage points on most benchmarks at modest extra latency (50 to 200 ms).

Note: Late interaction is the dark horse

ColPali-style late interaction (ColBERT for documents, MaxSim aggregation for page-as-image) is gaining ground over the retrieve-then-rerank pattern. Late interaction stores multiple vectors per item and matches at the token level, giving better precision than single-vector retrieval without requiring a separate reranker. Storage costs are higher (5 to 10x), but for high-precision technical document QA, late interaction is the new state of the art in 2026.

33.2.7 Production Trade-offs: Cost, Latency, Quality

PipelineIndex Cost (1M items)Query CostQuery Latency p95Quality Floor
SigLIP retrieval + GPT-4o-mini~$200 GPU~$0.021.2 sGood for general use
ColPali + Qwen2-VL-7B~$1500 GPU~$0.042.5 sSOTA for technical docs
Hybrid (text + image) + GPT-4o~$500 GPU~$0.061.8 sBest general quality
Video keyframe + transcript~$300 GPU + ASR~$0.031.5 sTime-aligned answers
Figure 33.2.4: Production cost/latency/quality matrix for multimodal RAG pipelines, late 2025 numbers. Adjust for inflation in your fiscal year.
Real-World Scenario: A Pharmaceutical Document Q&A

A 2025 pharma R&D team needed Q&A over a 40,000-page corpus of internal documents heavy with chemical structures, dosing tables, and study charts. Text-only RAG missed 60% of figure-heavy queries. They moved to ColQwen-2.5 with Qwen2-VL-72B for generation. Recall@10 rose from 41% to 88%; the team's literature-review time per query dropped from ~30 minutes to ~2 minutes. Total cost: $4,200 in one-time indexing GPU time plus ~$0.18 per query at production volume of 5000 queries/day.

Key Insight

Multimodal RAG combines joint-embedding retrieval with multimodal LLMs to answer questions grounded in image, audio, and video corpora. Image-as-context is the simplest pattern; ColPali-family page-as-image is the new state of the art for technical document QA; hybrid keyframe-plus-transcript handles video. The retrieve-then-rerank pattern transfers cleanly from text RAG. Pick the pipeline based on the corpus type, the precision requirement, and the cost target. The biggest 2024-2026 shift: OCR is no longer required for technical document QA, native VLM embedding of pages outperforms it.

Self-Check
Q1: Why does the image-as-context prompt include "Answer using ONLY these images" rather than the conventional "Answer the question"? What VLM behavior is being suppressed?
Show Answer
VLMs like GPT-4o and Gemini 2.5 Pro carry a large parametric memory; given a generic prompt they will happily blend retrieved visual context with prior knowledge, producing answers that look grounded but actually came from training data. The "Answer using ONLY these images" instruction explicitly suppresses this prior-knowledge leakage, forcing the model to either ground its claim in the retrieved images or refuse. The behavior being suppressed is the same hallucination pattern that text RAG fixes with explicit grounding instructions; without the directive, the entire point of retrieval (private, time-sensitive, or domain-specific content) is undermined by the model falling back on parametric memory.
Q2: ColPali stores multiple vectors per page rather than a single embedding. What is the late-interaction step at query time, and why does it improve precision?
Show Answer
ColPali (using PaliGemma) and ColQwen (using Qwen2-VL) encode each page as a collection of per-patch (or per-token) vectors rather than a single page-level vector. At query time, the text query is encoded into its own set of token vectors, and the late-interaction step computes a MaxSim score: for each query token, find the most similar page-patch vector, then sum those maxima across query tokens. This is the same trick ColBERT uses for text: instead of forcing the page into one bottlenecked vector, the model preserves token-level granularity and matches each query token to its best evidence patch. Precision improves because fine-grained content (a number in a table, a label on a chart) gets its own vector and can decisively match a query about that detail, where a single-vector embedding would average it away.
Q3: For a 100-hour podcast archive, sketch your chunking and embedding strategy. What does the query latency look like?
Show Answer
For most podcast use cases, transcript-only RAG is the right starting point. Run Whisper large-v3 on each episode to produce time-aligned transcript segments (typical chunk size 200 to 400 tokens with 10 to 20 percent overlap, preserving start/end timestamps), embed each chunk with a text embedding model (e.g., text-embedding-3-large or BGE), and store in a vector database with metadata for episode ID, speaker, and time range. A 100-hour archive at roughly 10K words per hour produces ~100K chunks plus their embeddings, well under a gigabyte. Query latency: ~10 to 30 ms for the embedding step, ~10 to 50 ms for ANN retrieval, ~500 ms to 1 s for VLM/LLM generation; total p95 around 1 second. Add CLAP-based acoustic embeddings only if the use case explicitly needs non-linguistic content like "find the segment with audience laughter."
Q4: You notice that your image-as-context RAG returns visually similar but semantically wrong images. Name two debugging steps and one architectural change that would help.
Show Answer
Two debugging steps: (1) inspect the top-k retrievals offline for a sample of failing queries to confirm the failure mode is retrieval (wrong images) rather than generation (right images, wrong answer); (2) measure modality-gap and hubness statistics on the embedding store, since visually similar but semantically wrong matches often signal a hub-point artifact or a domain-mismatched encoder (CLIP on technical drawings, for example). The architectural change is a retrieve-then-rerank pipeline: retrieve top-50 with the cheap joint-embedding model (SigLIP 2), then rerank with a small VLM (Qwen2-VL-2B, Idefics2-8B) that scores each candidate against the query text. The reranker has access to fine-grained semantics the dual-encoder does not and reliably lifts answer quality by 5 to 15 percentage points at modest extra latency.

What Comes Next

Section 33.3: When to Retrieve, When to Reason covers the decision rubric for choosing between RAG and direct multimodal reasoning, with hybrid strategies that combine both.

Further Reading

ColPali and Page-as-Image

Faysse, M., Sibille, H., Wu, T., et al. (2024). "ColPali: Efficient Document Retrieval with Vision Language Models." ICLR. arXiv:2407.01449
Khattab, O., & Zaharia, M. (2020). "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT." SIGIR. arXiv:2004.12832

Surveys and Tutorials

Yu, S., Tang, C., Xu, B., et al. (2024). "VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents." arXiv. arXiv:2410.10594
Cohere. (2024). "Multimodal Embeddings and Retrieval-Augmented Generation." cohere.com/blog/multimodal-embeddings

Video RAG

Wang, Y., Li, K., Li, X., et al. (2024). "InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding." ECCV. arXiv:2403.15377

Audio RAG

Wu, Y., Chen, K., Zhang, T., et al. (2023). "Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation." ICASSP. arXiv:2211.06687