Dense retrieval found the meaning. Sparse retrieval found the keyword. Fusion retrieval found both, then bickered about which one mattered. The reranker had to step in.
Rag, Reluctantly Hybrid AI Agent
This section continues Section 35.2, which covered query transformation, HyDE, contextual retrieval, and self-corrective RAG (CRAG, Self-RAG, RAFT). Here we add the remaining advanced-RAG families for production LLM systems: fusion retrieval (RAG-Fusion, query diversification), multi-modal RAG (text plus images, tables, and charts), and a comparison table to help you choose which technique pays off for which agent or chatbot workload.
Prerequisites
This section continues from Section 35.2, which introduced the building blocks of advanced RAG: query rewriting, multi-hop retrieval, and rerankers. Familiarity with dense and sparse retrieval (Chapter 31), the basic RAG pipeline (Chapter 32), and the comparison-table conventions used throughout Part 7 is assumed.
Around 2022, the consensus was clear: dense retrieval had defeated BM25 forever. Then someone tried hybrid retrieval, fused dense and sparse scores, and discovered the 1994-vintage algorithm still pulled its weight on out-of-distribution queries. BM25 is now a permanent fixture in production RAG stacks, often quietly outperforming the embedding model it was supposed to replace. Few algorithms have been declared obsolete and then reinstated as senior staff quite so smoothly.
35.2.4 Fusion Retrieval and Multi-Modal RAG
Fusion retrieval goes beyond combining dense and sparse signals. RAG Fusion (Raudaschl, 2023) generates multiple search queries, retrieves results for each, and applies RRF across all result sets. This approach captures diverse perspectives on the query and is particularly effective for complex, multi-faceted questions.
35.2.4.1 Multi-Modal RAG
Multi-modal RAG extends retrieval beyond text to include images, tables, charts, and diagrams. This is essential for domains where critical information is encoded visually, such as scientific papers (figures and plots), financial reports (tables and charts), or technical documentation (architecture diagrams). Vision-language models like GPT-4o and Claude can process both retrieved text and images in their context window.
Multi-modal RAG introduces several unique challenges: (1) embedding images and text into a shared vector space is still an active research area, with models like CLIP providing only coarse alignment; (2) table extraction from PDFs is error-prone, often requiring specialized tools; (3) the token cost of including images in the context is high (a single image may consume 500+ tokens); and (4) evaluation is more complex because both visual and textual relevance must be assessed.
35.2.5 Comparison of Advanced RAG Techniques
| Technique | What It Fixes | Latency Cost | Best For |
|---|---|---|---|
| HyDE | Query-document vocabulary gap | +1 LLM call | Technical/domain queries |
| Multi-Query | Single-perspective retrieval | +1 LLM call, N retrievals | Ambiguous or broad queries |
| Step-Back | Missing background context | +1 LLM call, 2 retrievals | Specific factual questions |
| BM25 Hybrid | Missed keyword matches | Minimal (BM25 is fast) | Technical, legal, medical |
| Cross-Encoder Rerank | Imprecise initial ranking | +N model inferences | High-precision applications |
| Contextual Retrieval | Context-stripped chunks | Ingestion-time LLM cost | Large document corpora |
| CRAG / Self-RAG | Blind trust in bad retrieval | +1 to 3 LLM calls | Safety-critical applications |
| HyPE | Query/document linguistic mismatch | Ingestion-time LLM cost | Stable, high-traffic FAQs |
| MMR | Redundant top-k chunks | Negligible (one-line switch) | Multi-faceted queries with duplicate sources |
| RAG-Fusion | Single-query brittleness | +1 LLM call + N retrievals + RRF | Open-ended exploratory questions |
| Fusion-in-Decoder | Long-tail QA across many passages | N encoder passes per query | Open-domain QA with encoder-decoder models |
| RAFT | Generator trusts distractors | One-time fine-tune; zero at query time | Narrow-domain QA with stable corpus |
| CAG | RAG infrastructure overhead | One-time prefill; cached afterwards | Small, stable, high-traffic corpora |
When splitting documents, use 10 to 20% overlap between adjacent chunks. This prevents losing context at chunk boundaries. For a 512-token chunk, a 50 to 100 token overlap is a good starting point.
Measuring RAG Improvements with Ragas
After adding advanced retrieval techniques, you need to measure whether the pipeline actually improved. The ragas library (pip install ragas) provides RAG-specific metrics that evaluate both retrieval quality and generation faithfulness. See Section 42.1 for deeper coverage of RAG evaluation.
# pip install ragas
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset
eval_data = Dataset.from_dict({
"question": ["What is the return policy for electronics?"],
"answer": ["Electronics can be returned within 30 days..."],
"contexts": [["Our return policy allows 30-day returns for electronics..."]],
"ground_truth": ["Electronics: 30-day return window with receipt."],
})
results = evaluate(
dataset=eval_data,
metrics=[faithfulness, answer_relevancy, context_precision],
)
print(results) # {'faithfulness': 0.95, 'answer_relevancy': 0.88, ...}
Who: A product engineer at an online marketplace with 15,000 FAQ entries covering returns, shipping, seller policies, and payment issues
Situation: The naive RAG system answered 68% of customer queries correctly, but struggled with ambiguous questions like "what happens if my package never arrived" (which could relate to refund policy, insurance claims, or seller disputes).
Problem: Single-vector retrieval often returned chunks from only one relevant topic, missing the other facets of multi-aspect questions. Customers received incomplete answers and escalated to human agents.
Dilemma: Retrieving more chunks (top-20 instead of top-5) improved coverage but introduced noise, causing the LLM to generate confused or contradictory responses. A cross-encoder reranker improved precision but added 200ms of latency per query.
Decision: They implemented a two-stage pipeline: (1) query expansion using an LLM to generate three alternative phrasings, (2) retrieve top-10 per variant, deduplicate, then (3) rerank the merged set with a lightweight cross-encoder (ms-marco-MiniLM-L-6-v2) to select the final top-5.
How: Query expansion ran asynchronously in parallel. The cross-encoder reranker was quantized to INT8 and served on a GPU endpoint, reducing reranking latency to 40ms for 30 candidates.
Result: Correct answer rate rose from 68% to 86%. Human escalation dropped 28%. Total end-to-end latency was 320ms, within the 500ms SLA.
Lesson: Query expansion and reranking are complementary: expansion increases recall by casting a wider net, while reranking restores precision by filtering out noise before the LLM sees the context.
Learned sparse retrieval (SPLADE v3, 2024) is narrowing the gap with dense retrieval while maintaining the interpretability and efficiency of sparse methods. Listwise reranking with LLMs (e.g., RankGPT) directly outputs a reranked list rather than scoring documents independently, capturing inter-document relevance signals. Multi-vector retrieval models like ColBERT v2 and ColPali are achieving cross-encoder quality at bi-encoder speeds through late interaction. Research into retrieval-aware training is producing LLMs that are jointly optimized for both generating and utilizing retrieved passages, blurring the line between the retriever and the generator.
Objective
Upgrade a basic RAG retrieval pipeline with LLM-powered query expansion (to improve recall) and cross-encoder reranking (to improve precision), then measure the impact of each technique.
What You'll Practice
- Implementing multi-query expansion using an LLM
- Using a cross-encoder reranker to rescore retrieved passages
- Combining bi-encoder retrieval with cross-encoder reranking
- Measuring retrieval improvements with recall@k
Setup
The following cell installs the required packages and configures the environment for this lab.
Steps
Step 1: Set up the baseline retrieval system
Build a bi-encoder retrieval system as the baseline to improve.
from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
from openai import OpenAI
client = OpenAI()
bi_encoder = SentenceTransformer("all-MiniLM-L6-v2")
documents = [
"Python supports procedural, object-oriented, and functional programming.",
"The GIL prevents multiple threads from executing Python bytecode simultaneously.",
"NumPy provides efficient array operations for scientific computing in Python.",
"Pandas provides DataFrames for data manipulation and analysis.",
"Scikit-learn offers tools for data mining and machine learning.",
"PyTorch and TensorFlow are the dominant deep learning frameworks.",
"Flask and Django are popular Python web frameworks.",
"List comprehensions provide concise list creation from iterables.",
"Virtual environments isolate project dependencies to avoid conflicts.",
"asyncio enables asynchronous programming for concurrent I/O.",
"Type hints improve code readability and enable static analysis with mypy.",
"Python 3.12 introduced performance improvements via adaptive specialization.",
]
doc_embs = bi_encoder.encode(documents)
doc_norms = np.linalg.norm(doc_embs, axis=1)
def baseline_search(query, top_k=5):
qe = bi_encoder.encode(query)
scores = np.dot(doc_embs, qe) / (doc_norms * np.linalg.norm(qe))
idx = np.argsort(scores)[::-1][:top_k]
return [(documents[i], scores[i], i) for i in idx]
query = "What tools should I use for data science in Python?"
print("Baseline results:")
for doc, score, _ in baseline_search(query):
print(f" [{score:.3f}] {doc[:70]}...")
Hint
This baseline uses a single query. It may miss relevant documents if the query wording does not closely match the document text.
Step 2: Implement LLM-powered query expansion
Generate multiple reformulations of the query, then search with all of them.
def expand_query(query, n=3):
"""Generate alternative query phrasings using an LLM."""
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user",
"content": f"Generate {n} alternative phrasings of this search query. "
f"One per line, no numbering.\n\nQuery: {query}"}],
temperature=0.7, max_tokens=200)
expansions = [q.strip() for q in
resp.choices[0].message.content.strip().split("\n") if q.strip()]
return [query] + expansions[:n]
def expanded_search(query, top_k=5):
"""Search with all expanded queries, deduplicate by best score."""
queries = expand_query(query)
print(f" Expanded: {queries}")
# TODO: Search with each query, keep best score per document
best = {}
for q in queries:
for doc, score, idx in baseline_search(q, top_k=top_k):
if idx not in best or score > best[idx][1]:
best[idx] = (doc, score, idx)
ranked = sorted(best.values(), key=lambda x: x[1], reverse=True)
return ranked[:top_k]
print("\nExpanded results:")
for doc, score, _ in expanded_search(query):
print(f" [{score:.3f}] {doc[:70]}...")
Hint
Track results in a dictionary keyed by document index. For each expanded query, update the entry only if the new score is higher than the existing one.
Step 3: Add cross-encoder reranking
Rescore the top candidates using a cross-encoder, which is more accurate than bi-encoder similarity.
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query, candidates, top_k=3):
"""Rerank candidates using cross-encoder scores."""
pairs = [(query, doc) for doc, _, _ in candidates]
scores = reranker.predict(pairs)
reranked = sorted(zip(scores, candidates), reverse=True)
return [(doc, float(s), idx) for s, (doc, _, idx) in reranked[:top_k]]
# Compare before and after reranking
baseline = baseline_search(query, top_k=8)
print("Before reranking (top 5):")
for doc, score, _ in baseline[:5]:
print(f" [{score:.3f}] {doc[:70]}...")
reranked = rerank(query, baseline, top_k=5)
print("\nAfter reranking (top 5):")
for doc, score, _ in reranked:
print(f" [{score:.3f}] {doc[:70]}...")
rerank step layered on top of the dense baseline: candidates are re-scored against the query with the more accurate but slower model, then the top-k by reranker score are returned.Hint
The cross-encoder processes (query, document) pairs jointly, which is slower but more accurate than bi-encoder cosine similarity. Use it to rescore a small set of candidates (8 to 20), not the entire corpus.
Step 4: Measure the improvement
Compare all three approaches on queries with known relevant documents.
eval_queries = {
"Data science tools in Python": [2, 3, 4],
"How to speed up Python": [1, 9, 11],
"Web development in Python": [6],
"Managing Python dependencies": [8],
}
def recall_at_k(retrieved, relevant, k):
return len(set(retrieved[:k]) & set(relevant)) / len(relevant)
for query, relevant in eval_queries.items():
b = baseline_search(query, 5)
b_idx = [i for _, _, i in b]
e = expanded_search(query, 8)
e_idx = [i for _, _, i in e[:5]]
r = rerank(query, e, 5)
r_idx = [i for _, _, i in r]
print(f"\nQuery: {query}")
print(f" Baseline R@5: {recall_at_k(b_idx, relevant, 5):.2f}")
print(f" Expanded R@5: {recall_at_k(e_idx, relevant, 5):.2f}")
print(f" Reranked R@5: {recall_at_k(r_idx, relevant, 5):.2f}")
Hint
Recall@5 measures the fraction of relevant documents found in the top 5. Expect: baseline ~0.5, expanded ~0.7, expanded+reranked ~0.8.
Expected Output
- Query expansion generating 3 to 4 meaningful reformulations per query
- Cross-encoder reranking reshuffling results to prioritize relevant documents
- Recall@5 improving from ~0.5 (baseline) to ~0.7 (expanded) to ~0.8 (reranked)
Stretch Goals
- Implement HyDE: have the LLM generate an ideal answer, embed that instead of the query
- Add reciprocal rank fusion (RRF) to merge results from multiple expanded queries
- Benchmark the latency overhead of expansion and reranking vs. the quality gain
Complete Solution
from sentence_transformers import SentenceTransformer, CrossEncoder
from openai import OpenAI
import numpy as np
client = OpenAI()
bi = SentenceTransformer("all-MiniLM-L6-v2")
ce = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
docs = [
"Python supports procedural, OOP, and functional programming.",
"The GIL prevents multi-threaded Python bytecode execution.",
"NumPy provides efficient array operations for scientific computing.",
"Pandas provides DataFrames for data manipulation.",
"Scikit-learn offers ML and data mining tools.",
"PyTorch and TensorFlow are deep learning frameworks.",
"Flask and Django are Python web frameworks.",
"List comprehensions create lists concisely.",
"Virtual environments isolate dependencies.",
"asyncio enables async programming for concurrent I/O.",
"Type hints improve readability and enable mypy.",
"Python 3.12 has performance improvements.",
]
embs = bi.encode(docs)
norms = np.linalg.norm(embs, axis=1)
def search(q, k=5):
qe = bi.encode(q)
s = np.dot(embs, qe)/(norms*np.linalg.norm(qe))
idx = np.argsort(s)[::-1][:k]
return [(docs[i],s[i],i) for i in idx]
def expand(q, n=3):
r = client.chat.completions.create(model="gpt-4o-mini",
messages=[{"role":"user","content":f"Generate {n} alternative phrasings:\n{q}"}],
temperature=0.7, max_tokens=200)
return [q]+[l.strip() for l in r.choices[0].message.content.strip().split("\n") if l.strip()][:n]
def expanded_search(q, k=5):
best = {}
for eq in expand(q):
for d,s,i in search(eq, k):
if i not in best or s>best[i][1]: best[i]=(d,s,i)
return sorted(best.values(), key=lambda x:x[1], reverse=True)[:k]
def rerank(q, cands, k=3):
pairs = [(q,d) for d,_,_ in cands]
scores = ce.predict(pairs)
return [(d,float(s),i) for s,(d,_,i) in sorted(zip(scores,cands), reverse=True)[:k]]
def recall(ret, rel, k):
return len(set(ret[:k])&set(rel))/len(rel)
for q,rel in {"Data science tools":[2,3,4],"Speed up Python":[1,9,11],
"Web dev in Python":[6],"Managing dependencies":[8]}.items():
b=[i for _,_,i in search(q,5)]
e=[i for _,_,i in expanded_search(q,8)[:5]]
r=[i for _,_,i in rerank(q, expanded_search(q,8), 5)]
print(f"{q}: base={recall(b,rel,5):.2f} exp={recall(e,rel,5):.2f} rr={recall(r,rel,5):.2f}")
Both search and the deduplication loop inside expanded_search are dominated by repeated cosine top-k. A persistent FAISS index trims that to a single C++ call per expansion.
Show code
import faiss
embs32 = embs.astype("float32"); faiss.normalize_L2(embs32)
index = faiss.IndexFlatIP(embs32.shape[1]); index.add(embs32)
def search(q, k=5):
qe = bi.encode([q]).astype("float32"); faiss.normalize_L2(qe)
s, i = index.search(qe, k)
return [(docs[j], float(s[0, p]), int(j)) for p, j in enumerate(i[0])]
search() helper over a normalized FAISS IndexFlatIP: it L2-normalizes the corpus once, then encodes and normalizes each query so inner product equals cosine similarity, returning (document, score, id) triples.A normalized FAISS index gives exact cosine search at C++ speed. Build it once at index time, then every query is a single index.search call.
Show code
import faiss, numpy as np
doc_embs = doc_embs.astype("float32"); faiss.normalize_L2(doc_embs)
index = faiss.IndexFlatIP(doc_embs.shape[1]); index.add(doc_embs)
def baseline_search(query, top_k=5):
qe = bi_encoder.encode([query]).astype("float32"); faiss.normalize_L2(qe)
scores, idx = index.search(qe, top_k)
return [(documents[i], scores[0, j], int(i)) for j, i in enumerate(idx[0])]
FAISS IndexFlatIP.- Query transformation bridges the vocabulary gap: HyDE, multi-query, and step-back prompting each address different causes of retrieval failure by rewriting the query before it reaches the index.
- Hybrid retrieval outperforms single-method retrieval in most scenarios: Combining dense and sparse (BM25) retrieval with Reciprocal Rank Fusion consistently outperforms either method alone in technical domains. The exception is when your corpus consists entirely of short, homogeneous documents where dense retrieval alone may suffice.
- Re-ranking is high-impact and low-effort: Adding a cross-encoder re-ranker on top of initial retrieval is one of the highest-ROI improvements you can make to a RAG pipeline.
- Contextual retrieval makes chunks self-contained: Prepending LLM-generated context to chunks at ingestion time reduces retrieval failures by 49% (67% with BM25 hybrid).
- Self-corrective RAG prevents blind trust: CRAG and Self-RAG evaluate retrieval quality and generation faithfulness, triggering corrective actions when problems are detected.
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
Exercises
A user asks "Why is my app slow?" The RAG system retrieves nothing relevant from the performance optimization docs. Describe two query transformation strategies that could fix this.
Show Answer
(a) Query expansion: generate synonyms like "performance issues," "latency problems," "slow response time." (b) HyDE: generate a hypothetical answer about app performance optimization, then embed that answer for retrieval.
Explain why generating a hypothetical answer and embedding that (HyDE) can outperform directly embedding the question. When might HyDE backfire?
Show Answer
Questions and answers occupy different regions of embedding space. The hypothetical answer is closer to actual answer passages than the question is. HyDE backfires when the LLM generates an incorrect hypothetical answer, leading retrieval away from the correct documents.
A cross-encoder reranker processes 50 candidate documents per query in 200ms. If your system handles 100 QPS, what is the total compute cost of reranking? How would you reduce it?
Show Answer
100 QPS multiplied by 200ms = 20 seconds of GPU compute per second, requiring at least 20 GPUs. Reduce cost by: (a) using a distilled cross-encoder, (b) reducing candidates to top-20, (c) caching frequent queries, (d) using a lighter reranker like ColBERT.
Explain the three branches of CRAG (Corrective RAG). What triggers each branch, and what action does the system take?
Show Answer
Correct: retrieved documents are relevant, proceed to generation. Incorrect: documents are irrelevant, perform web search or query transformation to find better sources. Ambiguous: documents are partially relevant, extract useful portions and supplement with additional retrieval.
You search a medical knowledge base for "treatment for type 2 diabetes." Dense search returns general diabetes articles, while BM25 returns articles that mention "type 2" specifically. Explain why hybrid retrieval would outperform either alone in this case.
Show Answer
Dense search captures semantic similarity (diabetes treatment concepts) but may miss the "type 2" specificity. BM25 matches "type 2" exactly but may miss semantically related treatments described differently. Hybrid combines both signals, retrieving documents that are both semantically relevant and contain the precise terms.
Implement multi-query retrieval: given a user question, use an LLM to generate 3 reformulations, retrieve for each, and deduplicate results. Compare recall against single-query retrieval.
Add a cross-encoder reranker (e.g., cross-encoder/ms-marco-MiniLM-L-6-v2) to your RAG pipeline. Retrieve top-50 with bi-encoder, rerank to top-5 with cross-encoder. Measure improvement in answer quality.
Implement Anthropic's contextual retrieval pattern: for each chunk, use an LLM to generate a brief context summary, prepend it to the chunk text, then re-embed. Compare retrieval quality before and after.
Implement a simplified self-corrective RAG loop: after retrieval, use an LLM to evaluate whether the retrieved documents are relevant. If not, transform the query and retry (up to 3 attempts).
What Comes Next
In the next section, Section 35.3: RAG with Knowledge Graphs, we examine RAG with knowledge graphs, combining structured and unstructured retrieval for richer context.
Further Reading
search_type="mmr" switch.