Fusion Retrieval, Multi-Modal RAG & Comparison

Section 35.2a

Dense retrieval found the meaning. Sparse retrieval found the keyword. Fusion retrieval found both, then bickered about which one mattered. The reranker had to step in.

RagRag, Reluctantly Hybrid AI Agent
Big Picture

This section continues Section 35.2, which covered query transformation, HyDE, contextual retrieval, and self-corrective RAG (CRAG, Self-RAG, RAFT). Here we add the remaining advanced-RAG families for production LLM systems: fusion retrieval (RAG-Fusion, query diversification), multi-modal RAG (text plus images, tables, and charts), and a comparison table to help you choose which technique pays off for which agent or chatbot workload.

Prerequisites

This section continues from Section 35.2, which introduced the building blocks of advanced RAG: query rewriting, multi-hop retrieval, and rerankers. Familiarity with dense and sparse retrieval (Chapter 31), the basic RAG pipeline (Chapter 32), and the comparison-table conventions used throughout Part 7 is assumed.

Fun Fact: The BM25 Comeback Tour

Around 2022, the consensus was clear: dense retrieval had defeated BM25 forever. Then someone tried hybrid retrieval, fused dense and sparse scores, and discovered the 1994-vintage algorithm still pulled its weight on out-of-distribution queries. BM25 is now a permanent fixture in production RAG stacks, often quietly outperforming the embedding model it was supposed to replace. Few algorithms have been declared obsolete and then reinstated as senior staff quite so smoothly.

Two retrieval branches (dense and sparse) merging into RRF fusion, then feeding a reranker
Fusion retrieval combines dense (vector) and sparse (BM25) results via reciprocal rank fusion (RRF) before passing the merged list to a reranker.

35.2.4 Fusion Retrieval and Multi-Modal RAG

Fusion retrieval goes beyond combining dense and sparse signals. RAG Fusion (Raudaschl, 2023) generates multiple search queries, retrieves results for each, and applies RRF across all result sets. This approach captures diverse perspectives on the query and is particularly effective for complex, multi-faceted questions.

35.2.4.1 Multi-Modal RAG

Multi-modal RAG extends retrieval beyond text to include images, tables, charts, and diagrams. This is essential for domains where critical information is encoded visually, such as scientific papers (figures and plots), financial reports (tables and charts), or technical documentation (architecture diagrams). Vision-language models like GPT-4o and Claude can process both retrieved text and images in their context window.

Warning

Multi-modal RAG introduces several unique challenges: (1) embedding images and text into a shared vector space is still an active research area, with models like CLIP providing only coarse alignment; (2) table extraction from PDFs is error-prone, often requiring specialized tools; (3) the token cost of including images in the context is high (a single image may consume 500+ tokens); and (4) evaluation is more complex because both visual and textual relevance must be assessed.

35.2.5 Comparison of Advanced RAG Techniques

Table 35.2.1b: Advanced RAG techniques at a glance (as of 2026).
Technique What It Fixes Latency Cost Best For
HyDE Query-document vocabulary gap +1 LLM call Technical/domain queries
Multi-Query Single-perspective retrieval +1 LLM call, N retrievals Ambiguous or broad queries
Step-Back Missing background context +1 LLM call, 2 retrievals Specific factual questions
BM25 Hybrid Missed keyword matches Minimal (BM25 is fast) Technical, legal, medical
Cross-Encoder Rerank Imprecise initial ranking +N model inferences High-precision applications
Contextual Retrieval Context-stripped chunks Ingestion-time LLM cost Large document corpora
CRAG / Self-RAG Blind trust in bad retrieval +1 to 3 LLM calls Safety-critical applications
HyPE Query/document linguistic mismatch Ingestion-time LLM cost Stable, high-traffic FAQs
MMR Redundant top-k chunks Negligible (one-line switch) Multi-faceted queries with duplicate sources
RAG-Fusion Single-query brittleness +1 LLM call + N retrievals + RRF Open-ended exploratory questions
Fusion-in-Decoder Long-tail QA across many passages N encoder passes per query Open-domain QA with encoder-decoder models
RAFT Generator trusts distractors One-time fine-tune; zero at query time Narrow-domain QA with stable corpus
CAG RAG infrastructure overhead One-time prefill; cached afterwards Small, stable, high-traffic corpora
Tip: Include Chunk Overlap

When splitting documents, use 10 to 20% overlap between adjacent chunks. This prevents losing context at chunk boundaries. For a 512-token chunk, a 50 to 100 token overlap is a good starting point.

Measuring RAG Improvements with Ragas

After adding advanced retrieval techniques, you need to measure whether the pipeline actually improved. The ragas library (pip install ragas) provides RAG-specific metrics that evaluate both retrieval quality and generation faithfulness. See Section 42.1 for deeper coverage of RAG evaluation.

# pip install ragas
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset
eval_data = Dataset.from_dict({
    "question": ["What is the return policy for electronics?"],
    "answer": ["Electronics can be returned within 30 days..."],
    "contexts": [["Our return policy allows 30-day returns for electronics..."]],
    "ground_truth": ["Electronics: 30-day return window with receipt."],
})
results = evaluate(
    dataset=eval_data,
    metrics=[faithfulness, answer_relevancy, context_precision],
)
print(results) # {'faithfulness': 0.95, 'answer_relevancy': 0.88, ...}
Code Fragment 35.2.7: Evaluating RAG pipeline quality with the Ragas library: faithfulness, answer relevancy, and context precision metrics.
Real-World Scenario
Adding Query Expansion and Reranking to an E-Commerce FAQ Bot

Who: A product engineer at an online marketplace with 15,000 FAQ entries covering returns, shipping, seller policies, and payment issues

Situation: The naive RAG system answered 68% of customer queries correctly, but struggled with ambiguous questions like "what happens if my package never arrived" (which could relate to refund policy, insurance claims, or seller disputes).

Problem: Single-vector retrieval often returned chunks from only one relevant topic, missing the other facets of multi-aspect questions. Customers received incomplete answers and escalated to human agents.

Dilemma: Retrieving more chunks (top-20 instead of top-5) improved coverage but introduced noise, causing the LLM to generate confused or contradictory responses. A cross-encoder reranker improved precision but added 200ms of latency per query.

Decision: They implemented a two-stage pipeline: (1) query expansion using an LLM to generate three alternative phrasings, (2) retrieve top-10 per variant, deduplicate, then (3) rerank the merged set with a lightweight cross-encoder (ms-marco-MiniLM-L-6-v2) to select the final top-5.

How: Query expansion ran asynchronously in parallel. The cross-encoder reranker was quantized to INT8 and served on a GPU endpoint, reducing reranking latency to 40ms for 30 candidates.

Result: Correct answer rate rose from 68% to 86%. Human escalation dropped 28%. Total end-to-end latency was 320ms, within the 500ms SLA.

Lesson: Query expansion and reranking are complementary: expansion increases recall by casting a wider net, while reranking restores precision by filtering out noise before the LLM sees the context.

Research Frontier

Learned sparse retrieval (SPLADE v3, 2024) is narrowing the gap with dense retrieval while maintaining the interpretability and efficiency of sparse methods. Listwise reranking with LLMs (e.g., RankGPT) directly outputs a reranked list rather than scoring documents independently, capturing inter-document relevance signals. Multi-vector retrieval models like ColBERT v2 and ColPali are achieving cross-encoder quality at bi-encoder speeds through late interaction. Research into retrieval-aware training is producing LLMs that are jointly optimized for both generating and utilizing retrieved passages, blurring the line between the retriever and the generator.

Lab: Implement Query Expansion and Cross-Encoder Reranking
Duration: ~60 minutes Advanced

Objective

Upgrade a basic RAG retrieval pipeline with LLM-powered query expansion (to improve recall) and cross-encoder reranking (to improve precision), then measure the impact of each technique.

What You'll Practice

  • Implementing multi-query expansion using an LLM
  • Using a cross-encoder reranker to rescore retrieved passages
  • Combining bi-encoder retrieval with cross-encoder reranking
  • Measuring retrieval improvements with recall@k

Setup

The following cell installs the required packages and configures the environment for this lab.

Steps

Step 1: Set up the baseline retrieval system

Build a bi-encoder retrieval system as the baseline to improve.

from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
from openai import OpenAI
client = OpenAI()
bi_encoder = SentenceTransformer("all-MiniLM-L6-v2")
documents = [
    "Python supports procedural, object-oriented, and functional programming.",
    "The GIL prevents multiple threads from executing Python bytecode simultaneously.",
    "NumPy provides efficient array operations for scientific computing in Python.",
    "Pandas provides DataFrames for data manipulation and analysis.",
    "Scikit-learn offers tools for data mining and machine learning.",
    "PyTorch and TensorFlow are the dominant deep learning frameworks.",
    "Flask and Django are popular Python web frameworks.",
    "List comprehensions provide concise list creation from iterables.",
    "Virtual environments isolate project dependencies to avoid conflicts.",
    "asyncio enables asynchronous programming for concurrent I/O.",
    "Type hints improve code readability and enable static analysis with mypy.",
    "Python 3.12 introduced performance improvements via adaptive specialization.",
    ]
doc_embs = bi_encoder.encode(documents)
doc_norms = np.linalg.norm(doc_embs, axis=1)
def baseline_search(query, top_k=5):
    qe = bi_encoder.encode(query)
    scores = np.dot(doc_embs, qe) / (doc_norms * np.linalg.norm(qe))
    idx = np.argsort(scores)[::-1][:top_k]
    return [(documents[i], scores[i], i) for i in idx]
query = "What tools should I use for data science in Python?"
print("Baseline results:")
for doc, score, _ in baseline_search(query):
    print(f" [{score:.3f}] {doc[:70]}...")
Code Fragment 35.2.8: Baseline dense-only retrieval against the indexed corpus, used as the control arm for the rerank, hybrid, and multi-query experiments later in this section.
Hint

This baseline uses a single query. It may miss relevant documents if the query wording does not closely match the document text.

Step 2: Implement LLM-powered query expansion

Generate multiple reformulations of the query, then search with all of them.

def expand_query(query, n=3):
    """Generate alternative query phrasings using an LLM."""
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user",
        "content": f"Generate {n} alternative phrasings of this search query. "
        f"One per line, no numbering.\n\nQuery: {query}"}],
        temperature=0.7, max_tokens=200)
    expansions = [q.strip() for q in
        resp.choices[0].message.content.strip().split("\n") if q.strip()]
    return [query] + expansions[:n]
def expanded_search(query, top_k=5):
    """Search with all expanded queries, deduplicate by best score."""
    queries = expand_query(query)
    print(f" Expanded: {queries}")
    # TODO: Search with each query, keep best score per document
    best = {}
    for q in queries:
        for doc, score, idx in baseline_search(q, top_k=top_k):
            if idx not in best or score > best[idx][1]:
                best[idx] = (doc, score, idx)
                ranked = sorted(best.values(), key=lambda x: x[1], reverse=True)
                return ranked[:top_k]
                print("\nExpanded results:")
                for doc, score, _ in expanded_search(query):
                    print(f" [{score:.3f}] {doc[:70]}...")
Output: Expanded: ['What tools should I use for data science in Python?', 'Best Python libraries for data analysis and machine learning', 'Python packages for scientific computing and data processing', 'Which Python frameworks are recommended for data science work?'] Expanded results: [0.634] Scikit-learn offers tools for data mining and machine learning. [0.612] NumPy provides efficient array operations for scientific computing in P... [0.589] Pandas provides DataFrames for data manipulation and analysis. [0.521] PyTorch and TensorFlow are the dominant deep learning frameworks. [0.412] Python supports procedural, object-oriented, and functional programming.
Code Fragment 35.2.9: Defines expand_query and expanded_search
Hint

Track results in a dictionary keyed by document index. For each expanded query, update the entry only if the new score is higher than the existing one.

Step 3: Add cross-encoder reranking

Rescore the top candidates using a cross-encoder, which is more accurate than bi-encoder similarity.

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query, candidates, top_k=3):
    """Rerank candidates using cross-encoder scores."""
    pairs = [(query, doc) for doc, _, _ in candidates]
    scores = reranker.predict(pairs)
    reranked = sorted(zip(scores, candidates), reverse=True)
    return [(doc, float(s), idx) for s, (doc, _, idx) in reranked[:top_k]]
# Compare before and after reranking
baseline = baseline_search(query, top_k=8)
print("Before reranking (top 5):")
for doc, score, _ in baseline[:5]:
    print(f" [{score:.3f}] {doc[:70]}...")
    reranked = rerank(query, baseline, top_k=5)
    print("\nAfter reranking (top 5):")
    for doc, score, _ in reranked:
        print(f" [{score:.3f}] {doc[:70]}...")
Output: Before reranking (top 5): [0.612] NumPy provides efficient array operations for scientific computing in P... [0.589] Pandas provides DataFrames for data manipulation and analysis. [0.574] Scikit-learn offers tools for data mining and machine learning. [0.486] PyTorch and TensorFlow are the dominant deep learning frameworks. [0.412] Python supports procedural, object-oriented, and functional programming. After reranking (top 5): [0.987] Scikit-learn offers tools for data mining and machine learning. [0.981] Pandas provides DataFrames for data manipulation and analysis. [0.972] NumPy provides efficient array operations for scientific computing in P... [0.834] PyTorch and TensorFlow are the dominant deep learning frameworks. [0.312] Python supports procedural, object-oriented, and functional programming.
Code Fragment 35.2.10: A cross-encoder rerank step layered on top of the dense baseline: candidates are re-scored against the query with the more accurate but slower model, then the top-k by reranker score are returned.
Hint

The cross-encoder processes (query, document) pairs jointly, which is slower but more accurate than bi-encoder cosine similarity. Use it to rescore a small set of candidates (8 to 20), not the entire corpus.

Step 4: Measure the improvement

Compare all three approaches on queries with known relevant documents.

eval_queries = {
    "Data science tools in Python": [2, 3, 4],
    "How to speed up Python": [1, 9, 11],
    "Web development in Python": [6],
    "Managing Python dependencies": [8],
    }
def recall_at_k(retrieved, relevant, k):
    return len(set(retrieved[:k]) & set(relevant)) / len(relevant)
for query, relevant in eval_queries.items():
    b = baseline_search(query, 5)
    b_idx = [i for _, _, i in b]
    e = expanded_search(query, 8)
    e_idx = [i for _, _, i in e[:5]]
    r = rerank(query, e, 5)
    r_idx = [i for _, _, i in r]
    print(f"\nQuery: {query}")
    print(f" Baseline R@5: {recall_at_k(b_idx, relevant, 5):.2f}")
    print(f" Expanded R@5: {recall_at_k(e_idx, relevant, 5):.2f}")
    print(f" Reranked R@5: {recall_at_k(r_idx, relevant, 5):.2f}")
Output: Query: Data science tools in Python Baseline R@5: 0.67 Expanded R@5: 1.00 Reranked R@5: 1.00 Query: How to speed up Python Baseline R@5: 0.33 Expanded R@5: 0.67 Reranked R@5: 0.67 Query: Web development in Python Baseline R@5: 1.00 Expanded R@5: 1.00 Reranked R@5: 1.00 Query: Managing Python dependencies Baseline R@5: 1.00 Expanded R@5: 1.00 Reranked R@5: 1.00
Code Fragment 35.2.11: Computes recall@k by counting how many gold-labelled passages appear in the top-k retrieved hits. This is the standard metric for measuring whether the relevant document is reachable at all, separately from whether it ranks first.
Hint

Recall@5 measures the fraction of relevant documents found in the top 5. Expect: baseline ~0.5, expanded ~0.7, expanded+reranked ~0.8.

Expected Output

  • Query expansion generating 3 to 4 meaningful reformulations per query
  • Cross-encoder reranking reshuffling results to prioritize relevant documents
  • Recall@5 improving from ~0.5 (baseline) to ~0.7 (expanded) to ~0.8 (reranked)

Stretch Goals

  • Implement HyDE: have the LLM generate an ideal answer, embed that instead of the query
  • Add reciprocal rank fusion (RRF) to merge results from multiple expanded queries
  • Benchmark the latency overhead of expansion and reranking vs. the quality gain
Complete Solution
from sentence_transformers import SentenceTransformer, CrossEncoder
from openai import OpenAI
import numpy as np
client = OpenAI()
bi = SentenceTransformer("all-MiniLM-L6-v2")
ce = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
docs = [
    "Python supports procedural, OOP, and functional programming.",
    "The GIL prevents multi-threaded Python bytecode execution.",
    "NumPy provides efficient array operations for scientific computing.",
    "Pandas provides DataFrames for data manipulation.",
    "Scikit-learn offers ML and data mining tools.",
    "PyTorch and TensorFlow are deep learning frameworks.",
    "Flask and Django are Python web frameworks.",
    "List comprehensions create lists concisely.",
    "Virtual environments isolate dependencies.",
    "asyncio enables async programming for concurrent I/O.",
    "Type hints improve readability and enable mypy.",
    "Python 3.12 has performance improvements.",
    ]
embs = bi.encode(docs)
norms = np.linalg.norm(embs, axis=1)
def search(q, k=5):
    qe = bi.encode(q)
    s = np.dot(embs, qe)/(norms*np.linalg.norm(qe))
    idx = np.argsort(s)[::-1][:k]
    return [(docs[i],s[i],i) for i in idx]
def expand(q, n=3):
    r = client.chat.completions.create(model="gpt-4o-mini",
        messages=[{"role":"user","content":f"Generate {n} alternative phrasings:\n{q}"}],
        temperature=0.7, max_tokens=200)
    return [q]+[l.strip() for l in r.choices[0].message.content.strip().split("\n") if l.strip()][:n]
def expanded_search(q, k=5):
    best = {}
    for eq in expand(q):
        for d,s,i in search(eq, k):
            if i not in best or s>best[i][1]: best[i]=(d,s,i)
            return sorted(best.values(), key=lambda x:x[1], reverse=True)[:k]
def rerank(q, cands, k=3):
    pairs = [(q,d) for d,_,_ in cands]
    scores = ce.predict(pairs)
    return [(d,float(s),i) for s,(d,_,i) in sorted(zip(scores,cands), reverse=True)[:k]]
def recall(ret, rel, k):
    return len(set(ret[:k])&set(rel))/len(rel)
for q,rel in {"Data science tools":[2,3,4],"Speed up Python":[1,9,11],
    "Web dev in Python":[6],"Managing dependencies":[8]}.items():
    b=[i for _,_,i in search(q,5)]
    e=[i for _,_,i in expanded_search(q,8)[:5]]
    r=[i for _,_,i in rerank(q, expanded_search(q,8), 5)]
    print(f"{q}: base={recall(b,rel,5):.2f} exp={recall(e,rel,5):.2f} rr={recall(r,rel,5):.2f}")
Output: Data science tools: base=0.67 exp=1.00 rr=1.00 Speed up Python: base=0.33 exp=0.67 rr=0.67 Web dev in Python: base=1.00 exp=1.00 rr=1.00 Managing dependencies: base=1.00 exp=1.00 rr=1.00
Code Fragment 35.2.12a: Defines search and expand
Library Shortcut: FAISS for the search-and-expand kernel

Both search and the deduplication loop inside expanded_search are dominated by repeated cosine top-k. A persistent FAISS index trims that to a single C++ call per expansion.

Show code
import faiss

embs32 = embs.astype("float32"); faiss.normalize_L2(embs32)
index = faiss.IndexFlatIP(embs32.shape[1]); index.add(embs32)

def search(q, k=5):
    qe = bi.encode([q]).astype("float32"); faiss.normalize_L2(qe)
    s, i = index.search(qe, k)
    return [(docs[j], float(s[0, p]), int(j)) for p, j in enumerate(i[0])]
Code Fragment 35.2.7b: A reusable search() helper over a normalized FAISS IndexFlatIP: it L2-normalizes the corpus once, then encodes and normalizes each query so inner product equals cosine similarity, returning (document, score, id) triples.
Library Shortcut: FAISS IndexFlatIP for cosine top-k

A normalized FAISS index gives exact cosine search at C++ speed. Build it once at index time, then every query is a single index.search call.

Show code
import faiss, numpy as np

doc_embs = doc_embs.astype("float32"); faiss.normalize_L2(doc_embs)
index = faiss.IndexFlatIP(doc_embs.shape[1]); index.add(doc_embs)

def baseline_search(query, top_k=5):
    qe = bi_encoder.encode([query]).astype("float32"); faiss.normalize_L2(qe)
    scores, idx = index.search(qe, top_k)
    return [(documents[i], scores[0, j], int(i)) for j, i in enumerate(idx[0])]
Code Fragment 35.2.6d: Minimal working example using FAISS IndexFlatIP.
Key Takeaways
Self-Check
Q1: How does HyDE improve retrieval compared to directly embedding the user query?
Show Answer
HyDE generates a hypothetical answer to the query using an LLM, then embeds this hypothetical document for retrieval instead of the raw query. The hypothetical answer is longer and more semantically similar to real documents than the short query, bridging the vocabulary and length gap between queries and documents. Even if the hypothetical answer is factually wrong, it uses the same style, terminology, and structure as real documents in the index.
Q2: Why does hybrid retrieval (dense + BM25) outperform either method alone?
Show Answer
Dense retrieval captures semantic similarity (paraphrases, synonyms, conceptual matches) but can miss exact keyword matches. BM25 captures exact term matches and handles rare terms well but misses semantic relationships. By combining both with Reciprocal Rank Fusion (RRF), hybrid retrieval gets the best of both worlds: documents that are semantically relevant and those that contain exact query terms both contribute to the final ranking.
Q3: Why are cross-encoders more accurate than bi-encoders for relevance scoring?
Show Answer
Cross-encoders encode the query and document as a single concatenated input, enabling full token-level attention between query and document tokens. This allows fine-grained interaction and comparison. Bi-encoders encode query and document independently, computing similarity only through a simple dot product or cosine of the final vectors. The trade-off is that cross-encoders are too slow for searching millions of documents, so they are used only for re-ranking a small candidate set (typically 20 to 100 documents).
Q4: How does CRAG differ from standard RAG in handling retrieval failures?
Show Answer
Standard RAG blindly trusts whatever documents are retrieved and generates from them regardless of quality. CRAG adds a retrieval evaluator that classifies each document as correct, ambiguous, or incorrect. If documents are correct, generation proceeds normally. If ambiguous, the query is refined and retrieval is repeated. If incorrect, the system falls back to web search. This three-way branching prevents the model from generating answers grounded in irrelevant or misleading context.
Q5: What problem does contextual retrieval solve, and what is the cost?
Show Answer
Contextual retrieval solves the problem of context-stripped chunks that lose their meaning when isolated from surrounding text. It prepends each chunk with an LLM-generated contextual summary (50 to 100 tokens) describing the document, section, and the chunk's role. This makes chunks self-contained for embedding and retrieval. The cost is an additional LLM call per chunk during ingestion (not at query time), which can be significant for large corpora but is a one-time expense.

Exercises

Exercise 18.3.1: Query transformation Conceptual

A user asks "Why is my app slow?" The RAG system retrieves nothing relevant from the performance optimization docs. Describe two query transformation strategies that could fix this.

Show Answer

(a) Query expansion: generate synonyms like "performance issues," "latency problems," "slow response time." (b) HyDE: generate a hypothetical answer about app performance optimization, then embed that answer for retrieval.

Exercise 18.4.1: HyDE intuition Conceptual

Explain why generating a hypothetical answer and embedding that (HyDE) can outperform directly embedding the question. When might HyDE backfire?

Show Answer

Questions and answers occupy different regions of embedding space. The hypothetical answer is closer to actual answer passages than the question is. HyDE backfires when the LLM generates an incorrect hypothetical answer, leading retrieval away from the correct documents.

Exercise 18.4.2: Reranking cost Conceptual

A cross-encoder reranker processes 50 candidate documents per query in 200ms. If your system handles 100 QPS, what is the total compute cost of reranking? How would you reduce it?

Show Answer

100 QPS multiplied by 200ms = 20 seconds of GPU compute per second, requiring at least 20 GPUs. Reduce cost by: (a) using a distilled cross-encoder, (b) reducing candidates to top-20, (c) caching frequent queries, (d) using a lighter reranker like ColBERT.

Exercise 18.4.3: Self-corrective RAG Conceptual

Explain the three branches of CRAG (Corrective RAG). What triggers each branch, and what action does the system take?

Show Answer

Correct: retrieved documents are relevant, proceed to generation. Incorrect: documents are irrelevant, perform web search or query transformation to find better sources. Ambiguous: documents are partially relevant, extract useful portions and supplement with additional retrieval.

Exercise 18.4.4: Dense vs. sparse Conceptual

You search a medical knowledge base for "treatment for type 2 diabetes." Dense search returns general diabetes articles, while BM25 returns articles that mention "type 2" specifically. Explain why hybrid retrieval would outperform either alone in this case.

Show Answer

Dense search captures semantic similarity (diabetes treatment concepts) but may miss the "type 2" specificity. BM25 matches "type 2" exactly but may miss semantically related treatments described differently. Hybrid combines both signals, retrieving documents that are both semantically relevant and contain the precise terms.

Exercise 18.4.5: Query expansion Coding

Implement multi-query retrieval: given a user question, use an LLM to generate 3 reformulations, retrieve for each, and deduplicate results. Compare recall against single-query retrieval.

Exercise 18.3.7: Cross-encoder reranking Coding

Add a cross-encoder reranker (e.g., cross-encoder/ms-marco-MiniLM-L-6-v2) to your RAG pipeline. Retrieve top-50 with bi-encoder, rerank to top-5 with cross-encoder. Measure improvement in answer quality.

Exercise 18.3.8: Contextual retrieval Coding

Implement Anthropic's contextual retrieval pattern: for each chunk, use an LLM to generate a brief context summary, prepend it to the chunk text, then re-embed. Compare retrieval quality before and after.

Exercise 18.3.9: CRAG loop Coding

Implement a simplified self-corrective RAG loop: after retrieval, use an LLM to evaluate whether the retrieved documents are relevant. If not, transform the query and retry (up to 3 attempts).

What Comes Next

In the next section, Section 35.3: RAG with Knowledge Graphs, we examine RAG with knowledge graphs, combining structured and unstructured retrieval for richer context.

Further Reading
Nogueira, R. & Cho, K. (2019). "Passage Re-ranking with BERT." arXiv preprint. Pioneering work on using BERT cross-encoders for passage re-ranking. Demonstrates significant improvements over traditional retrieval methods. Foundational reading for anyone implementing re-ranking in RAG.
Ma, X. et al. (2023). "Query Rewriting for Retrieval-Augmented Large Language Models." EMNLP 2023. Shows how LLMs can rewrite user queries to improve retrieval quality. A practical technique that consistently boosts RAG performance. Recommended for teams optimizing retrieval pipelines.
Asai, A. et al. (2023). "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection." arXiv preprint. Introduces a self-reflective RAG approach where the model decides when to retrieve and critiques its own outputs. A key advance in adaptive retrieval strategies. Essential for researchers exploring autonomous RAG.
Yan, S. et al. (2024). "Corrective Retrieval Augmented Generation." arXiv preprint. Proposes corrective mechanisms that detect and fix retrieval errors before generation. Addresses a critical failure mode in production RAG systems. Valuable for engineers building robust pipelines.
Glass, M. et al. (2022). "Re2G: Retrieve, Rerank, Generate." NAACL 2022. Presents a unified framework combining retrieval, re-ranking, and generation stages. Demonstrates the value of multi-stage processing. Useful reference for designing advanced RAG architectures.
Cross-Encoder Models on Hugging Face. A collection of pretrained cross-encoder models for semantic similarity and re-ranking. Provides ready-to-use models that integrate easily into RAG systems. Ideal for practitioners who want to add re-ranking quickly.
Zhang, T. et al. (2024). "RAFT: Adapting Language Model to Domain Specific RAG." arXiv preprint. Introduces distractor-aware fine-tuning with chain-of-thought answers used in Section 35.2.3.3. Combine with the LlamaIndex RAFT dataset cookbook for an end-to-end training pipeline.
Carbonell, J. & Goldstein, J. (1998). "The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries." SIGIR 1998. DOI. The original Maximal Marginal Relevance paper behind Section 35.2.1.5 and the LangChain search_type="mmr" switch.
Raudaschl, A. (2023). "Forget RAG, the Future is RAG-Fusion." Coined the RAG-Fusion name for the multi-query + RRF pattern in Section 35.2.1.6. Includes the original LangChain reference implementation.
Izacard, G. & Grave, E. (2021). "Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering" (Fusion-in-Decoder). EACL 2021. The encoder-decoder architecture behind Section 35.2.1.7 that cross-attends over per-passage encodings instead of concatenating all passages into a single prompt.