Section 35.2a: Fusion Retrieval, Multi-Modal RAG & Comparison

Dense retrieval found the meaning. Sparse retrieval found the keyword. Fusion retrieval found both, then bickered about which one mattered. The reranker had to step in.
Rag, Reluctantly Hybrid AI Agent

Big Picture

This section continues Section 35.2, which covered query transformation, HyDE, contextual retrieval, and self-corrective RAG (CRAG, Self-RAG, RAFT). Here we add the remaining advanced-RAG families for production LLM systems: fusion retrieval (RAG-Fusion, query diversification), multi-modal RAG (text plus images, tables, and charts), and a comparison table to help you choose which technique pays off for which agent or chatbot workload.

Prerequisites

This section continues from Section 35.2, which introduced the building blocks of advanced RAG: query rewriting, multi-hop retrieval, and rerankers. Familiarity with dense and sparse retrieval (Chapter 31), the basic RAG pipeline (Chapter 32), and the comparison-table conventions used throughout Part 7 is assumed.

Fun Fact: The BM25 Comeback Tour

Around 2022, the consensus was clear: dense retrieval had defeated BM25 forever. Then someone tried hybrid retrieval, fused dense and sparse scores, and discovered the 1994-vintage algorithm still pulled its weight on out-of-distribution queries. BM25 is now a permanent fixture in production RAG stacks, often quietly outperforming the embedding model it was supposed to replace. Few algorithms have been declared obsolete and then reinstated as senior staff quite so smoothly.

Two retrieval branches (dense and sparse) merging into RRF fusion, then feeding a reranker — Fusion retrieval combines dense (vector) and sparse (BM25) results via reciprocal rank fusion (RRF) before passing the merged list to a reranker.

Fusion retrieval goes beyond combining dense and sparse signals. RAG Fusion (Raudaschl, 2023) generates multiple search queries, retrieves results for each, and applies RRF across all result sets. This approach captures diverse perspectives on the query and is particularly effective for complex, multi-faceted questions.

Multi-modal RAG extends retrieval beyond text to include images, tables, charts, and diagrams. This is essential for domains where critical information is encoded visually, such as scientific papers (figures and plots), financial reports (tables and charts), or technical documentation (architecture diagrams). Vision-language models like GPT-4o and Claude can process both retrieved text and images in their context window.

Warning

Multi-modal RAG introduces several unique challenges: (1) embedding images and text into a shared vector space is still an active research area, with models like CLIP providing only coarse alignment; (2) table extraction from PDFs is error-prone, often requiring specialized tools; (3) the token cost of including images in the context is high (a single image may consume 500+ tokens); and (4) evaluation is more complex because both visual and textual relevance must be assessed.

35.2.5 Comparison of Advanced RAG Techniques

Table 35.2.1b: Advanced RAG techniques at a glance (as of 2026).

Technique	What It Fixes	Latency Cost	Best For
HyDE	Query-document vocabulary gap	+1 LLM call	Technical/domain queries
Multi-Query	Single-perspective retrieval	+1 LLM call, N retrievals	Ambiguous or broad queries
Step-Back	Missing background context	+1 LLM call, 2 retrievals	Specific factual questions
BM25 Hybrid	Missed keyword matches	Minimal (BM25 is fast)	Technical, legal, medical
Cross-Encoder Rerank	Imprecise initial ranking	+N model inferences	High-precision applications
Contextual Retrieval	Context-stripped chunks	Ingestion-time LLM cost	Large document corpora
CRAG / Self-RAG	Blind trust in bad retrieval	+1 to 3 LLM calls	Safety-critical applications
HyPE	Query/document linguistic mismatch	Ingestion-time LLM cost	Stable, high-traffic FAQs
MMR	Redundant top-k chunks	Negligible (one-line switch)	Multi-faceted queries with duplicate sources
RAG-Fusion	Single-query brittleness	+1 LLM call + N retrievals + RRF	Open-ended exploratory questions
Fusion-in-Decoder	Long-tail QA across many passages	N encoder passes per query	Open-domain QA with encoder-decoder models
RAFT	Generator trusts distractors	One-time fine-tune; zero at query time	Narrow-domain QA with stable corpus
CAG	RAG infrastructure overhead	One-time prefill; cached afterwards	Small, stable, high-traffic corpora

Tip: Include Chunk Overlap

When splitting documents, use 10 to 20% overlap between adjacent chunks. This prevents losing context at chunk boundaries. For a 512-token chunk, a 50 to 100 token overlap is a good starting point.

Measuring RAG Improvements with Ragas

After adding advanced retrieval techniques, you need to measure whether the pipeline actually improved. The ragas library (pip install ragas) provides RAG-specific metrics that evaluate both retrieval quality and generation faithfulness. See Section 42.1 for deeper coverage of RAG evaluation.

# pip install ragas
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset
eval_data = Dataset.from_dict({
    "question": ["What is the return policy for electronics?"],
    "answer": ["Electronics can be returned within 30 days..."],
    "contexts": [["Our return policy allows 30-day returns for electronics..."]],
    "ground_truth": ["Electronics: 30-day return window with receipt."],
})
results = evaluate(
    dataset=eval_data,
    metrics=[faithfulness, answer_relevancy, context_precision],
)
print(results) # {'faithfulness': 0.95, 'answer_relevancy': 0.88, ...}

Code Fragment 35.2.7: Evaluating RAG pipeline quality with the Ragas library: faithfulness, answer relevancy, and context precision metrics.

Real-World Scenario

Adding Query Expansion and Reranking to an E-Commerce FAQ Bot

Who: A product engineer at an online marketplace with 15,000 FAQ entries covering returns, shipping, seller policies, and payment issues

Situation: The naive RAG system answered 68% of customer queries correctly, but struggled with ambiguous questions like "what happens if my package never arrived" (which could relate to refund policy, insurance claims, or seller disputes).

Problem: Single-vector retrieval often returned chunks from only one relevant topic, missing the other facets of multi-aspect questions. Customers received incomplete answers and escalated to human agents.

Dilemma: Retrieving more chunks (top-20 instead of top-5) improved coverage but introduced noise, causing the LLM to generate confused or contradictory responses. A cross-encoder reranker improved precision but added 200ms of latency per query.

Decision: They implemented a two-stage pipeline: (1) query expansion using an LLM to generate three alternative phrasings, (2) retrieve top-10 per variant, deduplicate, then (3) rerank the merged set with a lightweight cross-encoder (ms-marco-MiniLM-L-6-v2) to select the final top-5.

How: Query expansion ran asynchronously in parallel. The cross-encoder reranker was quantized to INT8 and served on a GPU endpoint, reducing reranking latency to 40ms for 30 candidates.

Result: Correct answer rate rose from 68% to 86%. Human escalation dropped 28%. Total end-to-end latency was 320ms, within the 500ms SLA.

Lesson: Query expansion and reranking are complementary: expansion increases recall by casting a wider net, while reranking restores precision by filtering out noise before the LLM sees the context.

Research Frontier

Learned sparse retrieval (SPLADE v3, 2024) is narrowing the gap with dense retrieval while maintaining the interpretability and efficiency of sparse methods. Listwise reranking with LLMs (e.g., RankGPT) directly outputs a reranked list rather than scoring documents independently, capturing inter-document relevance signals. Multi-vector retrieval models like ColBERT v2 and ColPali are achieving cross-encoder quality at bi-encoder speeds through late interaction. Research into retrieval-aware training is producing LLMs that are jointly optimized for both generating and utilizing retrieved passages, blurring the line between the retriever and the generator.

Lab: Implement Query Expansion and Cross-Encoder Reranking

Duration: ~60 minutes Advanced

Objective

Upgrade a basic RAG retrieval pipeline with LLM-powered query expansion (to improve recall) and cross-encoder reranking (to improve precision), then measure the impact of each technique.

What You'll Practice

Implementing multi-query expansion using an LLM
Using a cross-encoder reranker to rescore retrieved passages
Combining bi-encoder retrieval with cross-encoder reranking
Measuring retrieval improvements with recall@k

Setup

The following cell installs the required packages and configures the environment for this lab.

Steps

Step 1: Set up the baseline retrieval system

Build a bi-encoder retrieval system as the baseline to improve.

from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
from openai import OpenAI
client = OpenAI()
bi_encoder = SentenceTransformer("all-MiniLM-L6-v2")
documents = [
    "Python supports procedural, object-oriented, and functional programming.",
    "The GIL prevents multiple threads from executing Python bytecode simultaneously.",
    "NumPy provides efficient array operations for scientific computing in Python.",
    "Pandas provides DataFrames for data manipulation and analysis.",
    "Scikit-learn offers tools for data mining and machine learning.",
    "PyTorch and TensorFlow are the dominant deep learning frameworks.",
    "Flask and Django are popular Python web frameworks.",
    "List comprehensions provide concise list creation from iterables.",
    "Virtual environments isolate project dependencies to avoid conflicts.",
    "asyncio enables asynchronous programming for concurrent I/O.",
    "Type hints improve code readability and enable static analysis with mypy.",
    "Python 3.12 introduced performance improvements via adaptive specialization.",
    ]
doc_embs = bi_encoder.encode(documents)
doc_norms = np.linalg.norm(doc_embs, axis=1)
def baseline_search(query, top_k=5):
    qe = bi_encoder.encode(query)
    scores = np.dot(doc_embs, qe) / (doc_norms * np.linalg.norm(qe))
    idx = np.argsort(scores)[::-1][:top_k]
    return [(documents[i], scores[i], i) for i in idx]
query = "What tools should I use for data science in Python?"
print("Baseline results:")
for doc, score, _ in baseline_search(query):
    print(f" [{score:.3f}] {doc[:70]}...")

Code Fragment 35.2.8: Baseline dense-only retrieval against the indexed corpus, used as the control arm for the rerank, hybrid, and multi-query experiments later in this section.

Hint

This baseline uses a single query. It may miss relevant documents if the query wording does not closely match the document text.

Step 2: Implement LLM-powered query expansion

Generate multiple reformulations of the query, then search with all of them.

def expand_query(query, n=3):
    """Generate alternative query phrasings using an LLM."""
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user",
        "content": f"Generate {n} alternative phrasings of this search query. "
        f"One per line, no numbering.\n\nQuery: {query}"}],
        temperature=0.7, max_tokens=200)
    expansions = [q.strip() for q in
        resp.choices[0].message.content.strip().split("\n") if q.strip()]
    return [query] + expansions[:n]
def expanded_search(query, top_k=5):
    """Search with all expanded queries, deduplicate by best score."""
    queries = expand_query(query)
    print(f" Expanded: {queries}")
    # TODO: Search with each query, keep best score per document
    best = {}
    for q in queries:
        for doc, score, idx in baseline_search(q, top_k=top_k):
            if idx not in best or score > best[idx][1]:
                best[idx] = (doc, score, idx)
                ranked = sorted(best.values(), key=lambda x: x[1], reverse=True)
                return ranked[:top_k]
                print("\nExpanded results:")
                for doc, score, _ in expanded_search(query):
                    print(f" [{score:.3f}] {doc[:70]}...")

Output: Expanded: ['What tools should I use for data science in Python?', 'Best Python libraries for data analysis and machine learning', 'Python packages for scientific computing and data processing', 'Which Python frameworks are recommended for data science work?'] Expanded results: [0.634] Scikit-learn offers tools for data mining and machine learning. [0.612] NumPy provides efficient array operations for scientific computing in P... [0.589] Pandas provides DataFrames for data manipulation and analysis. [0.521] PyTorch and TensorFlow are the dominant deep learning frameworks. [0.412] Python supports procedural, object-oriented, and functional programming.

Code Fragment 35.2.9: Defines expand_query and expanded_search

Hint

Track results in a dictionary keyed by document index. For each expanded query, update the entry only if the new score is higher than the existing one.

Step 3: Add cross-encoder reranking

Rescore the top candidates using a cross-encoder, which is more accurate than bi-encoder similarity.

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query, candidates, top_k=3):
    """Rerank candidates using cross-encoder scores."""
    pairs = [(query, doc) for doc, _, _ in candidates]
    scores = reranker.predict(pairs)
    reranked = sorted(zip(scores, candidates), reverse=True)
    return [(doc, float(s), idx) for s, (doc, _, idx) in reranked[:top_k]]
# Compare before and after reranking
baseline = baseline_search(query, top_k=8)
print("Before reranking (top 5):")
for doc, score, _ in baseline[:5]:
    print(f" [{score:.3f}] {doc[:70]}...")
    reranked = rerank(query, baseline, top_k=5)
    print("\nAfter reranking (top 5):")
    for doc, score, _ in reranked:
        print(f" [{score:.3f}] {doc[:70]}...")

Output: Before reranking (top 5): [0.612] NumPy provides efficient array operations for scientific computing in P... [0.589] Pandas provides DataFrames for data manipulation and analysis. [0.574] Scikit-learn offers tools for data mining and machine learning. [0.486] PyTorch and TensorFlow are the dominant deep learning frameworks. [0.412] Python supports procedural, object-oriented, and functional programming. After reranking (top 5): [0.987] Scikit-learn offers tools for data mining and machine learning. [0.981] Pandas provides DataFrames for data manipulation and analysis. [0.972] NumPy provides efficient array operations for scientific computing in P... [0.834] PyTorch and TensorFlow are the dominant deep learning frameworks. [0.312] Python supports procedural, object-oriented, and functional programming.

Code Fragment 35.2.10: A cross-encoder rerank step layered on top of the dense baseline: candidates are re-scored against the query with the more accurate but slower model, then the top-k by reranker score are returned.

Hint

The cross-encoder processes (query, document) pairs jointly, which is slower but more accurate than bi-encoder cosine similarity. Use it to rescore a small set of candidates (8 to 20), not the entire corpus.

Step 4: Measure the improvement

Compare all three approaches on queries with known relevant documents.

eval_queries = {
    "Data science tools in Python": [2, 3, 4],
    "How to speed up Python": [1, 9, 11],
    "Web development in Python": [6],
    "Managing Python dependencies": [8],
    }
def recall_at_k(retrieved, relevant, k):
    return len(set(retrieved[:k]) & set(relevant)) / len(relevant)
for query, relevant in eval_queries.items():
    b = baseline_search(query, 5)
    b_idx = [i for _, _, i in b]
    e = expanded_search(query, 8)
    e_idx = [i for _, _, i in e[:5]]
    r = rerank(query, e, 5)
    r_idx = [i for _, _, i in r]
    print(f"\nQuery: {query}")
    print(f" Baseline R@5: {recall_at_k(b_idx, relevant, 5):.2f}")
    print(f" Expanded R@5: {recall_at_k(e_idx, relevant, 5):.2f}")
    print(f" Reranked R@5: {recall_at_k(r_idx, relevant, 5):.2f}")

Output: Query: Data science tools in Python Baseline R@5: 0.67 Expanded R@5: 1.00 Reranked R@5: 1.00 Query: How to speed up Python Baseline R@5: 0.33 Expanded R@5: 0.67 Reranked R@5: 0.67 Query: Web development in Python Baseline R@5: 1.00 Expanded R@5: 1.00 Reranked R@5: 1.00 Query: Managing Python dependencies Baseline R@5: 1.00 Expanded R@5: 1.00 Reranked R@5: 1.00

Code Fragment 35.2.11: Computes recall@k by counting how many gold-labelled passages appear in the top-k retrieved hits. This is the standard metric for measuring whether the relevant document is reachable at all, separately from whether it ranks first.

Hint

Recall@5 measures the fraction of relevant documents found in the top 5. Expect: baseline ~0.5, expanded ~0.7, expanded+reranked ~0.8.

Expected Output

Query expansion generating 3 to 4 meaningful reformulations per query
Cross-encoder reranking reshuffling results to prioritize relevant documents
Recall@5 improving from ~0.5 (baseline) to ~0.7 (expanded) to ~0.8 (reranked)

Stretch Goals

Implement HyDE: have the LLM generate an ideal answer, embed that instead of the query
Add reciprocal rank fusion (RRF) to merge results from multiple expanded queries
Benchmark the latency overhead of expansion and reranking vs. the quality gain

Complete Solution

from sentence_transformers import SentenceTransformer, CrossEncoder
from openai import OpenAI
import numpy as np
client = OpenAI()
bi = SentenceTransformer("all-MiniLM-L6-v2")
ce = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
docs = [
    "Python supports procedural, OOP, and functional programming.",
    "The GIL prevents multi-threaded Python bytecode execution.",
    "NumPy provides efficient array operations for scientific computing.",
    "Pandas provides DataFrames for data manipulation.",
    "Scikit-learn offers ML and data mining tools.",
    "PyTorch and TensorFlow are deep learning frameworks.",
    "Flask and Django are Python web frameworks.",
    "List comprehensions create lists concisely.",
    "Virtual environments isolate dependencies.",
    "asyncio enables async programming for concurrent I/O.",
    "Type hints improve readability and enable mypy.",
    "Python 3.12 has performance improvements.",
    ]
embs = bi.encode(docs)
norms = np.linalg.norm(embs, axis=1)
def search(q, k=5):
    qe = bi.encode(q)
    s = np.dot(embs, qe)/(norms*np.linalg.norm(qe))
    idx = np.argsort(s)[::-1][:k]
    return [(docs[i],s[i],i) for i in idx]
def expand(q, n=3):
    r = client.chat.completions.create(model="gpt-4o-mini",
        messages=[{"role":"user","content":f"Generate {n} alternative phrasings:\n{q}"}],
        temperature=0.7, max_tokens=200)
    return [q]+[l.strip() for l in r.choices[0].message.content.strip().split("\n") if l.strip()][:n]
def expanded_search(q, k=5):
    best = {}
    for eq in expand(q):
        for d,s,i in search(eq, k):
            if i not in best or s>best[i][1]: best[i]=(d,s,i)
            return sorted(best.values(), key=lambda x:x[1], reverse=True)[:k]
def rerank(q, cands, k=3):
    pairs = [(q,d) for d,_,_ in cands]
    scores = ce.predict(pairs)
    return [(d,float(s),i) for s,(d,_,i) in sorted(zip(scores,cands), reverse=True)[:k]]
def recall(ret, rel, k):
    return len(set(ret[:k])&set(rel))/len(rel)
for q,rel in {"Data science tools":[2,3,4],"Speed up Python":[1,9,11],
    "Web dev in Python":[6],"Managing dependencies":[8]}.items():
    b=[i for _,_,i in search(q,5)]
    e=[i for _,_,i in expanded_search(q,8)[:5]]
    r=[i for _,_,i in rerank(q, expanded_search(q,8), 5)]
    print(f"{q}: base={recall(b,rel,5):.2f} exp={recall(e,rel,5):.2f} rr={recall(r,rel,5):.2f}")

Output: Data science tools: base=0.67 exp=1.00 rr=1.00 Speed up Python: base=0.33 exp=0.67 rr=0.67 Web dev in Python: base=1.00 exp=1.00 rr=1.00 Managing dependencies: base=1.00 exp=1.00 rr=1.00

Code Fragment 35.2.12a: Defines search and expand

Library Shortcut: FAISS for the search-and-expand kernel

Both search and the deduplication loop inside expanded_search are dominated by repeated cosine top-k. A persistent FAISS index trims that to a single C++ call per expansion.

Show code

import faiss

embs32 = embs.astype("float32"); faiss.normalize_L2(embs32)
index = faiss.IndexFlatIP(embs32.shape[1]); index.add(embs32)

def search(q, k=5):
    qe = bi.encode([q]).astype("float32"); faiss.normalize_L2(qe)
    s, i = index.search(qe, k)
    return [(docs[j], float(s[0, p]), int(j)) for p, j in enumerate(i[0])]

Code Fragment 35.2.7b: A reusable search() helper over a normalized FAISS IndexFlatIP: it L2-normalizes the corpus once, then encodes and normalizes each query so inner product equals cosine similarity, returning (document, score, id) triples.

Library Shortcut: FAISS IndexFlatIP for cosine top-k

A normalized FAISS index gives exact cosine search at C++ speed. Build it once at index time, then every query is a single index.search call.

Show code

import faiss, numpy as np

doc_embs = doc_embs.astype("float32"); faiss.normalize_L2(doc_embs)
index = faiss.IndexFlatIP(doc_embs.shape[1]); index.add(doc_embs)

def baseline_search(query, top_k=5):
    qe = bi_encoder.encode([query]).astype("float32"); faiss.normalize_L2(qe)
    scores, idx = index.search(qe, top_k)
    return [(documents[i], scores[0, j], int(i)) for j, i in enumerate(idx[0])]

Code Fragment 35.2.6d: Minimal working example using FAISS IndexFlatIP.

Key Takeaways

Query transformation bridges the vocabulary gap: HyDE, multi-query, and step-back prompting each address different causes of retrieval failure by rewriting the query before it reaches the index.
Hybrid retrieval outperforms single-method retrieval in most scenarios: Combining dense and sparse (BM25) retrieval with Reciprocal Rank Fusion consistently outperforms either method alone in technical domains. The exception is when your corpus consists entirely of short, homogeneous documents where dense retrieval alone may suffice.
Re-ranking is high-impact and low-effort: Adding a cross-encoder re-ranker on top of initial retrieval is one of the highest-ROI improvements you can make to a RAG pipeline.
Contextual retrieval makes chunks self-contained: Prepending LLM-generated context to chunks at ingestion time reduces retrieval failures by 49% (67% with BM25 hybrid).
Self-corrective RAG prevents blind trust: CRAG and Self-RAG evaluate retrieval quality and generation faithfulness, triggering corrective actions when problems are detected.

Self-Check

Q1: How does HyDE improve retrieval compared to directly embedding the user query?

Show Answer

HyDE generates a hypothetical answer to the query using an LLM, then embeds this hypothetical document for retrieval instead of the raw query. The hypothetical answer is longer and more semantically similar to real documents than the short query, bridging the vocabulary and length gap between queries and documents. Even if the hypothetical answer is factually wrong, it uses the same style, terminology, and structure as real documents in the index.

Q2: Why does hybrid retrieval (dense + BM25) outperform either method alone?

Show Answer

Dense retrieval captures semantic similarity (paraphrases, synonyms, conceptual matches) but can miss exact keyword matches. BM25 captures exact term matches and handles rare terms well but misses semantic relationships. By combining both with Reciprocal Rank Fusion (RRF), hybrid retrieval gets the best of both worlds: documents that are semantically relevant and those that contain exact query terms both contribute to the final ranking.

Q3: Why are cross-encoders more accurate than bi-encoders for relevance scoring?

Show Answer

Cross-encoders encode the query and document as a single concatenated input, enabling full token-level attention between query and document tokens. This allows fine-grained interaction and comparison. Bi-encoders encode query and document independently, computing similarity only through a simple dot product or cosine of the final vectors. The trade-off is that cross-encoders are too slow for searching millions of documents, so they are used only for re-ranking a small candidate set (typically 20 to 100 documents).

Q4: How does CRAG differ from standard RAG in handling retrieval failures?

Show Answer

Standard RAG blindly trusts whatever documents are retrieved and generates from them regardless of quality. CRAG adds a retrieval evaluator that classifies each document as correct, ambiguous, or incorrect. If documents are correct, generation proceeds normally. If ambiguous, the query is refined and retrieval is repeated. If incorrect, the system falls back to web search. This three-way branching prevents the model from generating answers grounded in irrelevant or misleading context.

Q5: What problem does contextual retrieval solve, and what is the cost?

Show Answer

Contextual retrieval solves the problem of context-stripped chunks that lose their meaning when isolated from surrounding text. It prepends each chunk with an LLM-generated contextual summary (50 to 100 tokens) describing the document, section, and the chunk's role. This makes chunks self-contained for embedding and retrieval. The cost is an additional LLM call per chunk during ingestion (not at query time), which can be significant for large corpora but is a one-time expense.

Exercises

Exercise 18.3.1: Query transformation Conceptual

A user asks "Why is my app slow?" The RAG system retrieves nothing relevant from the performance optimization docs. Describe two query transformation strategies that could fix this.

Show Answer

(a) Query expansion: generate synonyms like "performance issues," "latency problems," "slow response time." (b) HyDE: generate a hypothetical answer about app performance optimization, then embed that answer for retrieval.

Exercise 18.4.1: HyDE intuition Conceptual

Explain why generating a hypothetical answer and embedding that (HyDE) can outperform directly embedding the question. When might HyDE backfire?

Show Answer

Questions and answers occupy different regions of embedding space. The hypothetical answer is closer to actual answer passages than the question is. HyDE backfires when the LLM generates an incorrect hypothetical answer, leading retrieval away from the correct documents.

Exercise 18.4.2: Reranking cost Conceptual

A cross-encoder reranker processes 50 candidate documents per query in 200ms. If your system handles 100 QPS, what is the total compute cost of reranking? How would you reduce it?

Show Answer

100 QPS multiplied by 200ms = 20 seconds of GPU compute per second, requiring at least 20 GPUs. Reduce cost by: (a) using a distilled cross-encoder, (b) reducing candidates to top-20, (c) caching frequent queries, (d) using a lighter reranker like ColBERT.

Exercise 18.4.3: Self-corrective RAG Conceptual

Explain the three branches of CRAG (Corrective RAG). What triggers each branch, and what action does the system take?

Show Answer

Correct: retrieved documents are relevant, proceed to generation. Incorrect: documents are irrelevant, perform web search or query transformation to find better sources. Ambiguous: documents are partially relevant, extract useful portions and supplement with additional retrieval.

Exercise 18.4.4: Dense vs. sparse Conceptual

You search a medical knowledge base for "treatment for type 2 diabetes." Dense search returns general diabetes articles, while BM25 returns articles that mention "type 2" specifically. Explain why hybrid retrieval would outperform either alone in this case.

Show Answer

Dense search captures semantic similarity (diabetes treatment concepts) but may miss the "type 2" specificity. BM25 matches "type 2" exactly but may miss semantically related treatments described differently. Hybrid combines both signals, retrieving documents that are both semantically relevant and contain the precise terms.

Exercise 18.4.5: Query expansion Coding

Implement multi-query retrieval: given a user question, use an LLM to generate 3 reformulations, retrieve for each, and deduplicate results. Compare recall against single-query retrieval.

Exercise 18.3.7: Cross-encoder reranking Coding

Add a cross-encoder reranker (e.g., cross-encoder/ms-marco-MiniLM-L-6-v2) to your RAG pipeline. Retrieve top-50 with bi-encoder, rerank to top-5 with cross-encoder. Measure improvement in answer quality.

Exercise 18.3.8: Contextual retrieval Coding

Implement Anthropic's contextual retrieval pattern: for each chunk, use an LLM to generate a brief context summary, prepend it to the chunk text, then re-embed. Compare retrieval quality before and after.

Exercise 18.3.9: CRAG loop Coding

Implement a simplified self-corrective RAG loop: after retrieval, use an LLM to evaluate whether the retrieved documents are relevant. If not, transform the query and retry (up to 3 attempts).

What Comes Next

In the next section, Section 35.3: RAG with Knowledge Graphs, we examine RAG with knowledge graphs, combining structured and unstructured retrieval for richer context.

Further Reading

Nogueira, R. & Cho, K. (2019). "Passage Re-ranking with BERT." arXiv preprint. Pioneering work on using BERT cross-encoders for passage re-ranking. Demonstrates significant improvements over traditional retrieval methods. Foundational reading for anyone implementing re-ranking in RAG.

Ma, X. et al. (2023). "Query Rewriting for Retrieval-Augmented Large Language Models." EMNLP 2023. Shows how LLMs can rewrite user queries to improve retrieval quality. A practical technique that consistently boosts RAG performance. Recommended for teams optimizing retrieval pipelines.

Asai, A. et al. (2023). "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection." arXiv preprint. Introduces a self-reflective RAG approach where the model decides when to retrieve and critiques its own outputs. A key advance in adaptive retrieval strategies. Essential for researchers exploring autonomous RAG.

Yan, S. et al. (2024). "Corrective Retrieval Augmented Generation." arXiv preprint. Proposes corrective mechanisms that detect and fix retrieval errors before generation. Addresses a critical failure mode in production RAG systems. Valuable for engineers building robust pipelines.

Glass, M. et al. (2022). "Re2G: Retrieve, Rerank, Generate." NAACL 2022. Presents a unified framework combining retrieval, re-ranking, and generation stages. Demonstrates the value of multi-stage processing. Useful reference for designing advanced RAG architectures.

Cross-Encoder Models on Hugging Face. A collection of pretrained cross-encoder models for semantic similarity and re-ranking. Provides ready-to-use models that integrate easily into RAG systems. Ideal for practitioners who want to add re-ranking quickly.

Zhang, T. et al. (2024). "RAFT: Adapting Language Model to Domain Specific RAG." arXiv preprint. Introduces distractor-aware fine-tuning with chain-of-thought answers used in Section 35.2.3.3. Combine with the LlamaIndex RAFT dataset cookbook for an end-to-end training pipeline.

Carbonell, J. & Goldstein, J. (1998). "The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries." SIGIR 1998. DOI. The original Maximal Marginal Relevance paper behind Section 35.2.1.5 and the LangChain search_type="mmr" switch.

Raudaschl, A. (2023). "Forget RAG, the Future is RAG-Fusion." Coined the RAG-Fusion name for the multi-query + RRF pattern in Section 35.2.1.6. Includes the original LangChain reference implementation.

Izacard, G. & Grave, E. (2021). "Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering" (Fusion-in-Decoder). EACL 2021. The encoder-decoder architecture behind Section 35.2.1.7 that cross-attends over per-passage encodings instead of concatenating all passages into a single prompt.

Prerequisites

35.2.4 Fusion Retrieval and Multi-Modal RAG

35.2.4.1 Multi-Modal RAG

35.2.5 Comparison of Advanced RAG Techniques

Measuring RAG Improvements with Ragas

Objective

What You'll Practice

Setup

Steps

Step 1: Set up the baseline retrieval system

Step 2: Implement LLM-powered query expansion

Step 3: Add cross-encoder reranking

Step 4: Measure the improvement

Expected Output

Stretch Goals

Exercises

What Comes Next