RAG Indexing, Evaluation & Long-Context Tradeoff

Section 32.2

The best students are not the ones who memorize the most. They are the ones who know exactly which book to open and which page to turn to.

RAGRAG, Well-Read AI Agent
Big Picture

RAG bridges the gap between what an LLM knows and what it needs to know. Rather than encoding all knowledge in model parameters, RAG retrieves relevant documents at inference time and injects them into the prompt. This simple idea yields enormous practical benefits: reduced hallucination, up-to-date information, domain-specific expertise, and full source attribution. Understanding the fundamental architecture, its failure modes, and when to choose RAG over fine-tuning is the foundation for everything else in this chapter. The embedding and vector database infrastructure from Chapter 31 provides the retrieval backbone that RAG depends on.

Prerequisites

This section continues directly from Section 32.1, which introduces RAG architecture and failure modes. You should be comfortable with the embedding pipeline from Section 31.1, vector index choices from Section 31.2, and the chunking strategies from Section 31.4. The evaluation methodology here builds on offline-eval patterns from Section 42.1.

This continuation of Section 32.1 moves from "what RAG is" to "how to make RAG work in production." It covers the indexing choices you must make for large corpora, how to measure whether your RAG pipeline is actually retrieving the right things, and the practical comparison between RAG and the increasingly capable long-context windows of frontier models.

See Also

For framework-level RAG pipelines (LangChain, LlamaIndex), see Section 36.2.

32.2.1 Indexing Strategies for Large Corpora

Fun Fact

The acronym RAG was coined by Patrick Lewis and co-authors at FAIR in 2020, in a paper that almost did not get the name. Early drafts called it "retrieval-augmented sequence-to-sequence," which is technically correct and rhetorically fatal. The three-letter version stuck so completely that "do RAG" is now both a verb and an entire startup category, and Lewis himself has joked in talks that he had no idea he was naming an industry.

When your knowledge base contains millions of documents, naive flat indexing becomes impractical. Several indexing strategies help maintain retrieval quality at scale.

32.2.1.1 Hierarchical Indexing

Hierarchical indexing creates multiple levels of abstraction. At the top level, document summaries are indexed. When a query matches a summary, the system then searches within that document's chunks for specific passages. This two-stage approach dramatically reduces the search space while maintaining recall. Figure 32.2.3 illustrates how hierarchical indexing narrows the search space.

Diagram: Hierarchical Indexing
Figure 32.2.3a: Hierarchical indexing narrows search from document summaries to section-level chunks, reducing search space while maintaining precision.

32.2.1.2 Metadata Filtering

Adding metadata to chunks enables pre-retrieval filtering that narrows the search space before vector similarity is computed. Common metadata fields include document type, creation date, author, department, language, and topic tags. This filtering can be combined with vector search for efficient hybrid retrieval.

32.2.1.3 RAPTOR: Recursive Cluster-and-Summarize Trees

Naive hierarchical indexing assumes the corpus already exposes a clean document outline (book chapter, then section, then paragraph). Many real corpora do not. A pile of support tickets, a dump of research papers, or a podcast transcript archive has no built-in hierarchy at all. RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval, Sarthi et al., ICLR 2024) builds the hierarchy automatically by recursively clustering and summarizing the chunks themselves.

The construction algorithm is a tight three-step loop. First, embed every leaf chunk with a sentence-embedding model. Second, cluster those embeddings (RAPTOR uses soft Gaussian mixture clustering over UMAP-reduced embeddings, so a chunk can belong to multiple parent topics). Third, for each cluster, call an LLM to write a short summary of the chunks inside, then treat that summary as a new node and re-embed it. Repeat: cluster the summary embeddings, summarize again, embed again, until the top of the tree contains only a handful of nodes that cover the entire corpus.

At retrieval time, RAPTOR offers two traversal modes. Tree traversal starts at the root, picks the top branches whose summary is most similar to the query, and descends layer by layer until reaching leaf chunks. Collapsed-tree retrieval ignores the hierarchy at query time and treats every node (leaf and summary) as a flat candidate pool, retrieving the top-$k$ by cosine similarity. The collapsed variant turns out to win in most benchmarks because it lets the retriever mix abstract summary nodes with concrete leaf chunks in a single context, which is exactly what multi-hop questions need. RAPTOR reports a 20% absolute improvement on the QuALITY long-document QA benchmark over the strongest flat-chunk baseline.

Key insight

RAPTOR generalizes the LlamaIndex-style "summary of summaries" pattern into a principled algorithm: the corpus is now a tree of generated abstractions, not just the documents the author wrote. The cost is one extra LLM call per cluster at indexing time; the benefit is that questions like "what are the overarching themes across these 1,000 papers" become answerable without loading every paper into context.

RAPTOR recursive tree: leaf chunks at the bottom, then clustered LLM summaries at level 1, then summary-of-summaries at level 2, with a root node at the top. Each level reduces node count by roughly 5x.
Figure 32.2.4: The RAPTOR tree built bottom-up from raw chunks. Each upper-layer node is an LLM-written summary of the cluster below it. Tree traversal at query time picks a small subtree; collapsed-tree retrieval treats all nodes as a single flat candidate pool.
Algorithm 32.2.1: RAPTOR Tree Construction and Traversal
Algorithm: BUILD-RAPTOR-TREE
Input:  leaf chunks C = {c_1, ..., c_n}, embedding model E,
        summarizer LLM S, max tree height H
Output: tree T with leaves C and summary nodes above

  layer_0 := [(c_i, E(c_i)) for c_i in C]
  for h = 1 .. H:
      clusters := SOFT-CLUSTER(layer_{h-1}, method = GMM-over-UMAP)
      layer_h  := []
      for cluster K in clusters:
          text_K   := concatenate text of nodes in K
          summary  := S("Summarize:", text_K)
          layer_h.append((summary, E(summary)))
      if |layer_h| <= 2: break    # stop at near-root
  Return T (all nodes across all layers)

Algorithm: SEARCH-RAPTOR-COLLAPSED
Input:  query q, tree T, top-k count k
Output: top-k context chunks (mix of leaves and summaries)

  pool   := all nodes in T (leaves and every summary level)
  scores := [cos(E(q), node.vec) for node in pool]
  Return TOP-K(pool, scores, k)

Algorithm: SEARCH-RAPTOR-TREE-TRAVERSAL
Input:  query q, tree T, branch width b, levels to descend L
Output: context nodes along the best root-to-leaf path

  frontier := root.children
  for l = 1 .. L:
      keep     := TOP-K(frontier, cos(E(q), .), b)
      frontier := union of (n.children for n in keep)
  Return keep ∪ (descended leaves)

Source: Sarthi et al., "RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval," ICLR 2024 (arXiv:2401.18059). The official implementation defaults to collapsed-tree retrieval because empirically it wins on multi-hop benchmarks: mixing summary nodes with leaf chunks in one context lets the generator do both the high-level synthesis and the detail grounding in a single pass.

32.2.2 Evaluation and Common Failure Modes

Evaluating a RAG system requires measuring both retrieval quality and generation quality independently. The RAG triad framework assesses three dimensions: context relevance (did we retrieve the right documents?), groundedness (does the answer stick to the retrieved context?), and answer relevance (does the answer address the original question?).

32.2.2.1 Common Failure Modes

Key Insight: Retrieval-only metrics: recall@k, MRR, and nDCG

Before any LLM is even invoked, you can score the retriever as a pure information-retrieval (IR) system using the three classical ranking metrics. Each requires only a small labeled set of (query, gold_doc_ids) pairs (typically 100 to 500 examples). For a single query, let the retriever return a ranked list of $k$ candidates and let $\text{rel}_i \in \{0,1\}$ indicate whether the $i$-th candidate is in the gold set.

The three metrics answer different questions: recall@k says "is the right answer reachable?", MRR says "is the top answer right?", and nDCG@k says "is the whole ranking good?". A production RAG diagnostic dashboard typically tracks all three at $k \in \{1, 5, 10, 20\}$, plus the RAG-triad generation-side scores below, because a failure in any of them points to a different bug (chunker, embedder, reranker, generator).

The RAG triad above is the intrinsic evaluation track: every metric scores the system against the retrieved context rather than against a gold answer, which means no labeled test set is required. The complementary extrinsic track scores generated answers against human-written reference answers using the classical NLG metrics covered in depth in Section 42.1. BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) compare n-gram overlap between candidate and reference and were originally designed for machine translation and summarization respectively; BERTScore (Zhang et al., 2020) replaces the n-gram match with cosine similarity over contextual embeddings and correlates better with human judgments for free-form answers. For RAG specifically, BLEU and ROUGE are too rigid (the same correct answer can be phrased many ways), so practitioners default to either BERTScore for surface-form similarity or LLM-as-judge correctness scores from RAGAS or ARES (Section 32.4.8.2) for semantic correctness. The intrinsic-vs-extrinsic split matters because they fail in different ways: an unfaithful answer scores high on BLEU if it happens to share n-grams with the reference, and a faithful answer scores low if it phrases the truth differently.

Note

Tools like RAGAS (Retrieval Augmented Generation Assessment), TruLens, and DeepEval provide automated metrics for evaluating RAG pipelines. RAGAS computes faithfulness, answer relevance, and context precision scores using LLM-as-judge approaches. For production systems, a combination of automated metrics and human evaluation on a golden test set provides the most reliable quality signal.

Production Pattern
Production Example: Named RAG Stacks in the Wild

The naive RAG pattern (chunk + embed + retrieve + generate) underpins many shipped products. Notion AI Q&A uses an internal RAG stack against the user's workspace pages with OpenAI embeddings and GPT-4. Glean, valued at $7.2B in 2025, runs enterprise RAG over Slack, Google Drive, and Confluence with a hybrid BM25-plus-dense retriever. GitHub Copilot Workspace's "ask the repo" feature is RAG over an indexed codebase. Mendable.ai (which powers the docs search on LangChain.com, Coinbase Pay, and Hugging Face) is a hosted RAG product. The stack is similar across all of these: an embedding model (OpenAI text-embedding-3-large or Voyage), a vector DB (Pinecone, Weaviate, or pgvector), and a generation step with a frontier model.

32.2.3 RAG vs. Long Context Windows

The rapid expansion of context windows raises a fundamental question: if a model can accept 1 million tokens or more in a single prompt, is RAG still necessary? Models like Gemini 1.5 Pro (1M tokens), Claude 3 (200K tokens), and GPT-4o (128K tokens) can process entire codebases, full books, or large document collections in a single request. Some practitioners have concluded that "just stuff everything in the context" eliminates the need for retrieval infrastructure. This conclusion, while tempting, overlooks several critical factors that keep RAG relevant even in an era of enormous context windows.

32.2.3.1 Why RAG Still Matters

Five distinct advantages keep RAG relevant even when the context window is large enough to hold your entire corpus. We walk through each in turn.

Cost and economics. Long-context inference is expensive. Processing 1M tokens through a frontier model costs orders of magnitude more than embedding a query and retrieving 5 relevant chunks. For a production system handling thousands of queries per hour against a 10M-token knowledge base, the cost difference between "retrieve then generate" and "load everything into context" can be 100x or greater. RAG allows you to pay embedding costs once at indexing time, then pay only for the retrieved subset at query time. For most enterprise workloads, this economic advantage is decisive.

Latency. Prefill time scales with context length. Processing 1M tokens takes seconds to tens of seconds even on high-end hardware, whereas a RAG system can retrieve relevant chunks and generate a response with a context of a few thousand tokens in under a second. For interactive applications where users expect sub-second responses, RAG provides a latency profile that long-context approaches cannot match. The time-to-first-token for a 1M token prompt is fundamentally constrained by the compute required to process the full attention matrix.

Freshness and dynamic knowledge. RAG systems can incorporate new information the moment it is indexed. When a document is updated, the RAG pipeline re-chunks and re-embeds it, making the new content immediately available to queries. Long-context approaches require re-loading the entire corpus for every request, and they cannot incorporate changes that occur between requests without rebuilding the full prompt. For knowledge bases that change hourly or daily (support tickets, news feeds, regulatory updates), RAG provides a natural mechanism for staying current.

Privacy and access control. RAG pipelines can enforce document-level access controls during retrieval. A user with clearance for Department A's documents sees only Department A's chunks in their context, even though the full knowledge base spans all departments. With a long-context approach, access control requires either maintaining separate prompts per user role or implementing complex filtering logic before context assembly. RAG's retrieval stage is a natural enforcement point for per-document permissions.

Accuracy at scale. Research on the "lost in the middle" problem (discussed in Section 4) shows that models struggle to attend equally to all parts of very long contexts. Information placed in the middle of a 100K+ token context is less likely to be used accurately than information placed at the beginning or end. RAG sidesteps this problem by placing only the most relevant chunks in context, ensuring they are in a position where the model attends to them reliably. Empirical studies have shown that for needle-in-a-haystack retrieval tasks, RAG with good retrieval outperforms naive long-context approaches once the corpus exceeds roughly 50K tokens.

32.2.3.2 When Long Context Wins

Long context does offer genuine advantages in specific scenarios. For tasks requiring holistic understanding of a single large document (such as summarizing an entire book, analyzing patterns across a full codebase, or answering questions that require synthesizing information scattered across many pages), long-context models can outperform RAG because they have access to the complete document and its structure. RAG's chunking process inevitably loses cross-chunk relationships and document-level structure. Additionally, for small, static knowledge bases (under 50K tokens), simply including the full content in context avoids the complexity of building and maintaining a retrieval pipeline.

32.2.3.3 Hybrid Approaches

The most effective production systems increasingly combine both approaches. A retrieve-then-read hybrid uses RAG to narrow a large corpus to the most relevant documents, then passes those complete documents (not just chunks) into a long-context model. This combines RAG's precision in finding relevant material with the long-context model's ability to reason across the full document structure. Another pattern uses RAG as a first pass, then applies a long-context model to re-rank and synthesize across the top retrieved results.

Google's Gemini team has documented a context caching approach that offers a middle ground. Frequently accessed documents are cached in the model's context at a reduced per-token rate, allowing repeated queries against the same corpus without re-processing the full context each time. This reduces the cost disadvantage of long-context approaches while preserving their holistic understanding advantage. However, caching is currently limited to a single session and does not solve the freshness or access control challenges.

32.2.3.4 Cache-Augmented Generation (CAG)

The academic name for the same middle-ground pattern is Cache-Augmented Generation (CAG). The recipe is: take the documents you would have retrieved, run them through the model once at indexing time, and persist the resulting KV cache to disk. At query time, the LLM loads that pre-computed cache, attends to it, and generates an answer without ever rerunning the (expensive) prefill over those documents. Conceptually, the documents have been baked into "soft tokens" that live in the attention key-value store rather than in the prompt.

CAG is a sensible production sweet spot for corpora that are small enough to fit in the context window of the chosen model (typically a few hundred thousand tokens), change rarely, and serve high query volume. Examples include the public API reference for a SaaS product, the engineering wiki for a single team, or a regulatory rulebook. The retrieval step disappears entirely (so no chunker, no embedder, no vector index to operate), latency drops because prefill is skipped, and you keep the LLM's holistic reasoning over the full corpus. The cost is that any document update invalidates the cache and forces a full re-prefill, so CAG is the wrong pattern for fast-moving knowledge bases (where RAG wins) or for multi-tenant access control (where per-user prompt slicing is harder than per-user retrieval filtering).

Gemini context caching, Anthropic prompt caching, and OpenAI's response-cache hints are the three production realizations of CAG you will meet most often. Each charges a discounted per-token rate (typically 10% to 25% of the input price) for cached tokens, which is exactly the economic mechanism that makes CAG viable. Think of CAG as the third corner of a triangle with RAG and naive long-context: RAG retrieves and prefills a small context per query, long-context prefills a large context per query, CAG prefills a large context once and reuses it for many queries.

The per-query cost difference between naive long-context and CAG is easy to write down. Let $C_{\text{prefill}}$ be the cost of running the corpus through the model and $C_{\text{generate}}$ the cost of a single long-context call (which includes corpus prefill plus question prefill plus decode). Then a CAG call avoids the corpus prefill on every cached hit:

$$ C_{\text{CAG}} \;=\; C_{\text{generate}} - C_{\text{prefill}} \;+\; \alpha \cdot C_{\text{prefill}}, $$

where $\alpha \in [0.10, 0.25]$ is the cached-token discount charged by the provider. For $N$ queries against the same corpus the amortized cost is $C_{\text{prefill}} + N \cdot (C_{\text{generate}} - C_{\text{prefill}} + \alpha C_{\text{prefill}})$, which converges to $\alpha C_{\text{prefill}} + (C_{\text{generate}} - C_{\text{prefill}})$ per query as $N$ grows large, an order of magnitude cheaper than the naive $N \cdot C_{\text{generate}}$ baseline.

CAG architecture: at indexing time the full corpus is prefilled once through the LLM, producing a KV cache that is persisted; at query time only the new user question is prefilled and concatenated with the cached keys and values during attention, skipping the cost of re-reading the corpus.
Figure 32.2.5: CAG separates the expensive corpus prefill (run once at indexing time) from the cheap per-query work (load the cached keys and values, prefill only the new question, decode the answer). The "soft tokens" framing reflects that the cached K and V tensors are exactly what the model would have computed by reading the documents word by word.

In practice the cache is exposed as a server-side handle that the client passes back on every request. The following snippet shows the Anthropic API form, where a single cache_control marker on the corpus message turns one prompt into an indexing call and every subsequent prompt into a cached-prefix lookup:

import anthropic

client = anthropic.Anthropic()
CORPUS = open("api_reference.md").read()  # 120K tokens, changes weekly

def ask(question: str) -> str:
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=512,
        system=[
            {
                "type": "text",
                "text": "You answer questions strictly from the provided reference.",
            },
            {
                "type": "text",
                "text": CORPUS,
                # First call: pay full input price, server stores KV cache for ~5 min.
                # Later calls within the TTL: cached read at ~10% of input price.
                "cache_control": {"type": "ephemeral"},
            },
        ],
        messages=[{"role": "user", "content": question}],
    )
    usage = response.usage
    print(f"cache_creation={usage.cache_creation_input_tokens}, "
          f"cache_read={usage.cache_read_input_tokens}")
    return response.content[0].text

print(ask("What is the rate limit for POST /v1/messages?"))  # warm path
print(ask("How do I list models?"))                          # cached path
Output: cache_creation=121480, cache_read=0 The POST /v1/messages endpoint is rate-limited per organization based on tier: tier-1 keys get 50 requests per minute and 40,000 input tokens per minute, scaling to 4,000 RPM and 400,000 ITPM at tier 4. Rate-limit headers (anthropic-ratelimit-requests-remaining and anthropic-ratelimit-tokens-remaining) are returned on every response. cache_creation=0, cache_read=121480 The Anthropic API exposes GET /v1/models, which returns a paginated list of available model identifiers with their display names, context window, and per-token pricing tier.
Code Fragment 32.2.1: Cache-Augmented Generation via Anthropic prompt caching. The cache_control marker on the corpus message instructs the server to persist the KV cache for that prefix. The usage.cache_read_input_tokens field confirms that subsequent calls reuse it: corpus prefill cost is paid once, then amortized across every question in the cache window. Gemini context caching and OpenAI prompt caching expose the same pattern with slightly different SDK calls.
Tip

If your corpus is under 200K tokens, updated weekly, and serves more than ten queries per minute, prototype the CAG variant before the RAG variant. The infrastructure savings (no vector DB, no chunking, no embedding service) often outweigh the slightly higher per-query token cost, and you avoid an entire category of retrieval-failure bugs.

Key Insight

RAG and long context are complementary, not competing, paradigms. RAG excels at efficient, cost-effective retrieval from massive, dynamic knowledge bases with access controls. Long context excels at holistic reasoning over moderate-sized document sets. The best production systems use RAG to select relevant material and long context to reason deeply over it. As context windows continue to grow and costs decline, the boundary between these approaches will shift, but the fundamental advantages of retrieval (cost, latency, freshness, privacy) ensure RAG remains a core architectural pattern.

Table 32.2.3b: Dimension Comparison (as of 2026).
Dimension RAG Long Context CAG Hybrid
Cost per query Low (small context) High (full corpus) Low (cached prefill) Medium
Latency Fast (sub-second) Slow (seconds) Fast (prefill skipped) Moderate
Freshness Real-time updates Static per request Cache invalidation on edit Real-time
Access control Natural enforcement Complex to implement Hard (one cache per tenant) Natural enforcement
Holistic reasoning Limited by chunking Strong Strong Strong
Corpus scale Unlimited Limited by window Limited by window Unlimited
Lab: Build a Complete RAG Pipeline from Scratch
Duration: ~75 minutes Intermediate

Objective

Build a complete Retrieval-Augmented Generation pipeline: ingest documents, embed and index them, retrieve relevant passages for a query, and generate a grounded answer with source citations.

What You'll Practice

  • Document ingestion and chunking for a RAG knowledge base
  • Building a vector index with sentence-transformers
  • Implementing retrieval with cosine similarity search
  • Crafting RAG prompts that ground the LLM in retrieved context
  • Adding source citations to generated answers

Setup

The following cell installs the required packages and configures the environment for this lab.

You will need an OpenAI API key (or any OpenAI-compatible endpoint).

Steps

Step 1: Create the knowledge base

Define a set of documents simulating an internal company wiki.

knowledge_base = [
    {"id": "doc1", "title": "Vacation Policy",
    "text": "All full-time employees receive 20 days of paid vacation per year. Vacation days accrue monthly at 1.67 days per month. Unused days carry over up to 10 days maximum. Requests for 3+ consecutive days need 2 weeks advance notice."},
    {"id": "doc2", "title": "Remote Work Policy",
    "text": "Employees may work remotely up to 3 days per week with manager approval. Core hours are 10 AM to 3 PM local time. A stable internet connection and dedicated workspace are required. International remote work requires HR and Legal approval."},
    {"id": "doc3", "title": "Health Benefits",
    "text": "Three health plans available: Basic (80% coverage, $1500 deductible), Standard (90%, $750), Premium (95%, $250). Dental and vision included in Standard and Premium. Open enrollment occurs annually in November."},
    {"id": "doc4", "title": "Expense Reimbursement",
    "text": "Submit expenses within 30 days. Receipts required over $25. Meals reimbursed up to $75/day during travel. Domestic flights: economy class. International flights over 6 hours: business class. Hotels up to $200/night."},
    {"id": "doc5", "title": "Performance Reviews",
    "text": "Reviews conducted in June and December. Self-assessment required before each review. Manager rating on 1 to 5 scale. Rating 4+ qualifies for bonus. Two consecutive ratings below 2 triggers a performance improvement plan."},
    ]
print(f"Knowledge base: {len(knowledge_base)} documents")
for doc in knowledge_base:
    print(f" [{doc['id']}] {doc['title']} ({len(doc['text'])} chars)")
Code Fragment 32.2.2: Defining a five-document mock company wiki (vacation, remote work, benefits, expenses, reviews) that will seed the RAG knowledge base.
Hint

In production, these documents would come from files, databases, or web scraping, and each would be chunked into smaller pieces. For this lab, each document is already a reasonable chunk size.

Step 2: Build the vector index

Encode all documents into embeddings for similarity search.

from sentence_transformers import SentenceTransformer
import numpy as np
embed_model = SentenceTransformer("all-MiniLM-L6-v2")
doc_texts = [doc['text'] for doc in knowledge_base]
doc_embeddings = embed_model.encode(doc_texts)
doc_norms = np.linalg.norm(doc_embeddings, axis=1)
print(f"Index built: {doc_embeddings.shape}")
Output: Index built: (5, 384)
Code Fragment 32.2.3c: Encodes every policy document with all-MiniLM-L6-v2 and pre-computes the L2 norms once. Caching the norms now means later cosine-similarity calls only need a dot product, not a fresh norm computation per query.
Hint

Pre-computing doc_norms avoids recomputing them on every query, making search faster.

Step 3: Implement retrieval

Build a function that finds the most relevant documents for a query.

import numpy as np
def retrieve(query, top_k=2):
    """Retrieve top-k documents by cosine similarity."""
    q_emb = embed_model.encode(query)
    scores = np.dot(doc_embeddings, q_emb) / (doc_norms * np.linalg.norm(q_emb))
    top_idx = np.argsort(scores)[::-1][:top_k]
    return [(knowledge_base[i], scores[i]) for i in top_idx]
# Test against four representative queries from the lab's policy KB.
for query in ["How many vacation days do I get?", "Can I work from home?",
    "What health insurance options exist?", "How do I submit expenses?"]:
    results = retrieve(query, top_k=2)
    print(f"\nQ: {query}")
    for doc, score in results:
        print(f" [{score:.3f}] {doc['title']}")
Output: Q: How many vacation days do I get? [0.712] Vacation Policy [0.384] Performance Reviews Q: Can I work from home? [0.681] Remote Work Policy [0.312] Vacation Policy Q: What health insurance options exist? [0.745] Health Benefits [0.298] Expense Reimbursement Q: How do I submit expenses? [0.723] Expense Reimbursement [0.341] Remote Work Policy
Code Fragment 32.2.9: A retrieve helper that runs cosine similarity over the indexed policy chunks and returns the top-k matches. The printed scores show how cleanly the right policy dominates for each well-formed query.
Hint

Retrieval should return the vacation policy for vacation questions, remote work policy for WFH questions, and so on. If results look wrong, check that embeddings were computed correctly.

Step 4: Build the RAG generation pipeline

Combine retrieval with LLM generation, grounding answers in retrieved context.

from openai import OpenAI
client = OpenAI()
def rag_answer(query, top_k=2):
    """Retrieve context and generate a grounded answer."""
    retrieved = retrieve(query, top_k)
    # Build context with source labels
    context = "\n\n".join(
        f"[Source {i+1}: {doc['title']}]\n{doc['text']}"
        for i, (doc, _) in enumerate(retrieved))
    system = ("You are a helpful HR assistant. Answer ONLY from the provided "
        "context. Cite sources with [Source N]. If the context does not "
        "cover the question, say so clearly.")
    user_msg = f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer:"
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "system", "content": system},
        {"role": "user", "content": user_msg}],
        temperature=0.3, max_tokens=300)
    answer = response.choices[0].message.content
    sources = [doc['title'] for doc, _ in retrieved]
    return answer, sources
# Test the full pipeline
for q in ["How many vacation days do I get?", "Can I work from home?",
    "What health plans are available?", "How do I expense meals?"]:
    answer, sources = rag_answer(q)
    print(f"\nQ: {q}\nA: {answer}\nSources: {sources}\n{'='*50}")
Output: Q: How many vacation days do I get? A: All full-time employees receive 20 days of paid vacation per year, accruing at 1.67 days per month [Source 1]. Unused days carry over up to a maximum of 10 days. Sources: ['Vacation Policy', 'Performance Reviews'] ================================================== ...
Code Fragment 32.2.10: The full RAG loop: rag_answer stitches retrieved chunks into a citation-tagged context, instructs gpt-4o-mini to answer "ONLY from the provided context", and returns both the grounded answer and the source titles for attribution.
Hint

The system prompt is critical: instruct the model to answer ONLY from context, cite sources, and admit uncertainty. Temperature 0.3 reduces hallucination risk.

Step 5: Test with out-of-scope queries

Verify the system handles unanswerable questions gracefully.

out_of_scope = [
    "What is the company's stock price?",
    "Who is the CEO?",
    "What programming languages does engineering use?",
    ]
print("=== Out-of-Scope Query Test ===")
for q in out_of_scope:
    answer, sources = rag_answer(q)
    print(f"\nQ: {q}\nA: {answer}")
Output: === Out-of-Scope Query Test === Q: What is the company's stock price? A: I don't have enough information in the provided context to answer this question. The documents cover HR policies but not financial data. Q: Who is the CEO? A: The provided context does not contain information about the company's leadership or CEO. Q: What programming languages does engineering use? A: I don't have information about engineering tools or programming languages in the available context.
Code Fragment 32.2.11: Out-of-scope query test: the RAG system correctly refuses to answer questions whose grounding documents are absent from the retrieved context.
Hint

A well-prompted RAG system should say "I don't have enough information" rather than hallucinating. If the model invents answers, strengthen the system prompt.

Expected Output

  • Retrieval returning the correct document for each in-scope query
  • Generated answers citing sources and referencing facts from retrieved documents
  • Out-of-scope queries receiving "I don't have enough information" responses

Stretch Goals

  • Add a cross-encoder reranking step between retrieval and generation
  • Implement query expansion: generate alternative phrasings before searching
  • Add a confidence score based on retrieval similarity scores
Complete Solution
from sentence_transformers import SentenceTransformer
from openai import OpenAI
import numpy as np
client = OpenAI()
embed_model = SentenceTransformer("all-MiniLM-L6-v2")
kb = [
    {"id":"doc1","title":"Vacation Policy","text":"20 days PTO per year. Accrue 1.67/month. Carry over max 10 days. 2 weeks notice for 3+ days."},
    {"id":"doc2","title":"Remote Work","text":"Up to 3 days/week remote with manager approval. Core hours 10AM-3PM. International needs HR/Legal."},
    {"id":"doc3","title":"Health Benefits","text":"Basic (80%/$1500), Standard (90%/$750), Premium (95%/$250). Dental/vision in Standard+. November enrollment."},
    {"id":"doc4","title":"Expenses","text":"Submit within 30 days. Receipts over $25. Meals $75/day. Domestic economy, international biz class 6hr+. Hotels $200/night."},
    {"id":"doc5","title":"Reviews","text":"June and December. Self-assessment required. 1-5 scale. 4+ for bonus. Two sub-2 ratings triggers PIP."},
    ]
texts = [d['text'] for d in kb]
embs = embed_model.encode(texts)
norms = np.linalg.norm(embs, axis=1)
def retrieve(q, k=2):
    qe = embed_model.encode(q)
    scores = np.dot(embs, qe) / (norms * np.linalg.norm(qe))
    idx = np.argsort(scores)[::-1][:k]
    return [(kb[i], scores[i]) for i in idx]
def rag(q, k=2):
    ret = retrieve(q, k)
    ctx = "\n\n".join(f"[Source {i+1}: {d['title']}]\n{d['text']}" for i,(d,_) in enumerate(ret))
    r = client.chat.completions.create(model="gpt-4o-mini",
        messages=[{"role":"system","content":"Answer ONLY from context. Cite [Source N]. Say if info insufficient."},
        {"role":"user","content":f"Context:\n{ctx}\n\nQ: {q}\nA:"}],
        temperature=0.3, max_tokens=300)
    return r.choices[0].message.content, [d['title'] for d,_ in ret]
for q in ["How many vacation days?","Can I work remotely?","Health plan options?","Who is the CEO?"]:
    a, s = rag(q)
    print(f"Q: {q}\nA: {a}\nSources: {s}\n{'='*50}")
Output: Q: How many vacation days? A: Full-time employees receive 20 days of paid vacation per year [Source 1]. Sources: ['Vacation Policy', 'Reviews'] ================================================== Q: Can I work remotely? A: Yes, up to 3 days per week with manager approval [Source 1]. Sources: ['Remote Work', 'Vacation Policy'] ================================================== ...
Code Fragment 32.2.12: Defining 6hr
Tip: Chunk by Semantic Boundaries

Split documents at paragraph or section boundaries rather than fixed token counts. A 512-token chunk that splits a sentence in half creates two useless fragments. Use heading detection or sentence boundaries as natural split points.

Key Takeaways
Real-World Scenario
Building a RAG Pipeline for Internal Documentation

Who: A DevOps engineer and an ML engineer at a 500-person software company

Situation: Engineers spent an average of 45 minutes per day searching across Confluence, Notion, and Slack for answers to internal process questions (deployment procedures, on-call runbooks, architecture decisions).

Problem: A vanilla LLM chatbot hallucinated confidently about internal procedures it had never seen. Simply stuffing documents into the prompt exceeded the context window and cost $0.12 per query.

Dilemma: Fine-tuning the model on internal docs would bake in stale information (procedures changed weekly). Pure keyword search missed semantic matches (e.g., "how to roll back a deploy" vs. "revert production release").

Decision: They built a naive RAG pipeline: embed all documentation into a vector store, retrieve the top 5 chunks per query, and pass them as context to GPT-4 with explicit grounding instructions.

How: Documents were chunked at 400 tokens with 50-token overlap, embedded with text-embedding-3-small, and stored in Qdrant. The system prompt instructed the model to cite chunk sources and say "I don't know" when context was insufficient.

Result: Engineers reported 70% fewer failed searches. Average query cost dropped to $0.008. The grounding instruction reduced hallucinations from 34% to 4% on a 200-question evaluation set.

Lesson: A well-constructed naive RAG pipeline with clear grounding instructions solves the majority of internal knowledge retrieval problems before you need any advanced techniques.

Self-Check
Q1: What are the three fundamental limitations of parametric knowledge in LLMs that RAG addresses?
Show Answer
Knowledge cutoff (the model has no information after its training date), incompleteness (no model can memorize every fact, especially rare or domain-specific information), and unverifiability (generated claims cannot be traced back to specific source documents). RAG addresses all three by grounding generation in retrieved evidence.
Q2: Why is chunk overlap important in the ingestion pipeline?
Show Answer
Chunk overlap (typically 10 to 15% of chunk size) prevents important information from being split across chunk boundaries. Without overlap, a sentence or concept that spans two chunks would be incomplete in both, reducing retrieval quality. The overlap ensures continuity so that key passages remain intact within at least one chunk.
Q3: What is the "lost-in-the-middle" problem and how does it affect RAG design?
Show Answer
LLMs attend more strongly to information at the beginning and end of their context window, with reduced attention to content in the middle (a U-shaped attention pattern). For RAG, this means: limit retrieved documents to 3 to 5 chunks, place the most relevant document first, and consider reranking by relevance before insertion to avoid burying critical information in the middle.
Q4: When should you choose RAG over fine-tuning for adapting an LLM?
Show Answer
Choose RAG when: knowledge changes frequently, source citations are required, the corpus is large (thousands of documents), the task involves factual Q&A or research, hallucination must be minimized with evidence, and you need to avoid the cost and delay of retraining. Fine-tuning is better for stable knowledge, style or format adaptation, and minimal-latency scenarios. The best systems often combine both.
Q5: What are the three dimensions of the RAG triad evaluation framework?
Show Answer
The RAG triad evaluates: (1) Context relevance, which measures whether the right documents were retrieved; (2) Groundedness (or faithfulness), which measures whether the generated answer stays faithful to the retrieved context; and (3) Answer relevance, which measures whether the answer actually addresses the user's original question. Measuring all three independently helps diagnose whether failures originate in retrieval, generation, or both.
Research Frontier

Self-RAG (Asai et al., 2024) trains the LLM to decide when to retrieve, what to retrieve, and how to use retrieved passages, eliminating the need for a separate retrieval orchestrator. Corrective RAG (CRAG) adds a lightweight evaluator that assesses retrieved document relevance and triggers web search as a fallback when the initial retrieval is insufficient. Speculative RAG parallels speculative decoding by generating a draft response without retrieval and then verifying claims against retrieved evidence.

Research into long-context models vs. RAG is exploring whether models with 1M+ token windows can replace RAG entirely for certain use cases, though evidence suggests RAG still outperforms long-context stuffing for precision-critical applications.

Exercises

Exercise 18.1.1: RAG motivation Conceptual

An LLM confidently answers a question about a company's Q3 2024 earnings, but the information is wrong. How does RAG prevent this failure mode? What failure modes does RAG introduce instead?

Show Answer

RAG grounds the LLM in retrieved documents, reducing hallucination. It introduces new failure modes: (a) relevant document not retrieved (retrieval failure), (b) correct document retrieved but LLM ignores it (attention failure), (c) outdated or incorrect documents in the knowledge base (data quality failure).

Exercise 18.1.2: Four stages Conceptual

Name the four stages of a naive RAG pipeline and describe what happens at each stage.

Show Answer

(1) Query encoding: convert user query to an embedding vector. (2) Retrieval: find the most similar document chunks from the vector store. (3) Augmentation: combine retrieved chunks with the query in a prompt template. (4) Generation: pass the augmented prompt to the LLM for answer generation.

Exercise 18.1.3: RAG vs. fine-tuning Conceptual

A legal firm wants their LLM to answer questions about their proprietary case database. Should they use RAG or fine-tuning? What if the database changes weekly?

Show Answer

RAG is the clear choice. It handles weekly updates by re-indexing new documents without retraining. Fine-tuning would require retraining every week, is expensive, and risks catastrophic forgetting. RAG also provides citations to specific cases.

Exercise 18.2.1: Context window management Coding

You retrieve 10 relevant chunks, but they total 8000 tokens and your prompt template plus instructions use 2000 tokens. Your model has a 4096-token context window. What strategies can you use?

Show Answer

Options: (a) reduce to top-K chunks by relevance score, (b) summarize retrieved chunks before inserting, (c) use a model with a larger context window, (d) implement a reranker to select only the most relevant chunks, (e) truncate lower-ranked chunks.

Exercise 18.2.2: RAG vs. long context Conceptual

Models now support 128K+ context windows. Does this make RAG obsolete? Argue both sides and state your conclusion.

Show Answer

Against RAG: long context could ingest entire knowledge bases, eliminating retrieval errors. For RAG: long context is expensive (cost scales linearly), attention degrades over long sequences ("lost in the middle"), and dynamic knowledge bases still need a retrieval mechanism. Conclusion: RAG remains necessary for large-scale, dynamic, cost-sensitive applications, while long context helps with small, static corpora.

Exercise 18.2.5: Naive RAG pipeline Coding

Build a minimal RAG pipeline: load a 10-page PDF, chunk it, embed with sentence-transformers, store in ChromaDB, and answer 5 questions. Compare answers with and without retrieval.

Exercise 18.2.3: Prompt template comparison Conceptual

Using the same retrieval results, experiment with three prompt templates: (a) "Answer based on the context: {context}", (b) a structured template with numbered sources, (c) a template that instructs the model to cite sources. Compare answer quality and citation accuracy.

Exercise 18.2.4: Failure mode analysis Discussion

Deliberately create five queries that expose RAG failure modes: irrelevant retrieval, partial retrieval, conflicting sources, insufficient context, and hallucination despite context. Document each failure and propose a mitigation strategy.

Exercise 18.1.9: Hierarchical indexing Coding

Implement a two-level index: document summaries as the first level, section chunks as the second level. Compare retrieval precision against a flat single-level index on the same corpus.

What's Next?

In the next section, Section 32.3: Vector Stores and Embedding Models in RAG, we go a layer deeper into the retrieval stack: how vector stores actually index embeddings, which embedding models you should pick for which corpus, and how their interaction sets the ceiling on RAG quality.

Further Reading
Lewis, P. et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020. The foundational RAG paper that introduced the retrieve-then-generate paradigm. Useful for understanding how retrieval and generation components interact. Start here if you are new to RAG.
Gao, Y. et al. (2024). "Retrieval-Augmented Generation for Large Language Models: A Survey." arXiv preprint. A comprehensive survey covering RAG taxonomies, techniques, and evaluation methods. Provides an excellent map of the RAG landscape as of 2024. Ideal for practitioners seeking a broad overview.
Ram, O. et al. (2023). "In-Context Retrieval-Augmented Language Models." TACL. Explores how retrieval can be integrated into in-context learning without fine-tuning. Demonstrates strong performance on knowledge-intensive tasks. Recommended for researchers studying zero-shot RAG.
Es, S. et al. (2024). "RAGAs: Automated Evaluation of Retrieval Augmented Generation." arXiv preprint. Introduces automated metrics for evaluating RAG pipelines, including faithfulness and answer relevancy. A practical framework that has become the standard for RAG evaluation. Must-read for anyone building production RAG.
LangChain RAG Tutorial. Official tutorial covering end-to-end RAG implementation with LangChain. Includes code examples for document loading, chunking, and retrieval. Best starting point for hands-on RAG development.
LlamaIndex: Build RAG Applications. Comprehensive documentation for the LlamaIndex framework with a focus on data ingestion and indexing. Offers advanced features like query engines and response synthesizers. Recommended for complex RAG architectures.
Sarthi, P. et al. (2024). "RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval." ICLR 2024. Introduces the recursive embed-cluster-summarize tree construction algorithm and the tree-traversal vs. collapsed-tree retrieval modes referenced in Section 32.2.1.3. Required reading if the corpus has no built-in hierarchy.