Part V: Retrieval and Conversation
Chapter 20: Retrieval-Augmented Generation

RAG Frameworks & Orchestration

A good framework should make the easy things trivial and the hard things possible, especially when retrieval is involved.

RAG RAG, Framework-Savvy AI Agent
Big Picture

RAG frameworks transform weeks of plumbing into hours of configuration. Building a production RAG system from raw API calls requires wiring together embedding models, vector stores, retrievers, rerankers, prompt templates, and LLM calls. Frameworks like LangChain, LlamaIndex, and Haystack provide pre-built abstractions for these components, letting you swap implementations without rewriting your pipeline. Understanding each framework's philosophy, strengths, and trade-offs is essential for choosing the right tool (or deciding to go without one entirely). The production patterns from Section 10.3 apply equally when building RAG pipelines with these frameworks.

Prerequisites

This section surveys practical RAG frameworks, so familiarity with the RAG architecture from Section 20.1 and the advanced retrieval strategies in Section 20.2 is essential. Experience with Python and basic API usage from Section 10.1 will help you follow the code examples. The framework comparison here prepares you for building production applications discussed in Section 28.1.

1. Why Use a RAG Framework?

A minimal RAG pipeline requires at least five distinct operations: loading documents, splitting them into chunks, computing embeddings, storing vectors in a database, and orchestrating retrieval with LLM generation. Each of these steps has multiple implementation choices (sentence splitters vs. recursive splitters, OpenAI embeddings vs. Cohere embeddings, Pinecone vs. Chroma vs. pgvector). Without a framework, every component switch requires rewriting integration code. The hybrid ML and LLM pipeline principles from Chapter 10 apply here as well, since RAG systems combine traditional retrieval with LLM generation.

RAG frameworks solve this by providing a common interface layer. A retriever is a retriever regardless of whether it queries Pinecone or Weaviate underneath. A text splitter is a text splitter whether it uses token counts or recursive character boundaries. This abstraction brings three concrete benefits: faster prototyping, easier component swapping during evaluation, and a shared vocabulary that simplifies team communication.

However, frameworks also introduce complexity. They add layers of abstraction that can obscure what is actually happening, they impose opinions about pipeline structure that may not match your needs, and they evolve rapidly, sometimes introducing breaking changes. The decision to adopt a framework should weigh these trade-offs against the complexity of your specific use case.

Fun Fact

LangChain's GitHub repository accumulated over 400 open issues about breaking changes in its first year alone. The framework moved so fast that tutorials written in January were often obsolete by March. This is both a testament to the pace of innovation and a cautionary tale about coupling your production code to a rapidly evolving abstraction layer.

Figure 20.6.1 shows the framework abstraction stack.

Your Application Code Framework API (LangChain / LlamaIndex / Haystack) Retrievers Embedders LLM Wrappers Splitters Pinecone, Chroma, Weaviate, pgvector OpenAI, Cohere, HuggingFace, Voyage GPT-4, Claude, Llama, Mistral Recursive, Semantic, Token, Sentence HIGH LOW ABSTRACTION Frameworks let you swap any green box without touching the red layer
Figure 20.6.1: RAG framework abstraction stack. Frameworks provide a uniform interface layer that decouples your application from specific vendor implementations. You can swap any green (service) box without touching the red (application) layer.

2. LangChain

LangChain is the most widely adopted framework for LLM application development. Originally built around the concept of "chains" (sequential pipelines of operations), it has evolved into a comprehensive ecosystem with separate packages for core abstractions (langchain-core), community integrations (langchain-community), and the orchestration runtime (langgraph). For RAG specifically, LangChain provides document loaders, text splitters, embedding models, vector stores, retrievers, and output parsers as composable building blocks.

2.1 Core Concepts

LangChain's architecture revolves around several key abstractions. Document loaders ingest data from PDFs, web pages, databases, and dozens of other sources into a uniform Document object. Text splitters break documents into chunks with configurable size and overlap. Retrievers provide a standard interface for fetching relevant documents, whether from a vector store, a BM25 index, or a custom API. Chains wire these components together into executable pipelines.

2.2 LCEL (LangChain Expression Language)

Key Insight

LCEL's pipe operator is not just syntactic sugar. Because every component implements the Runnable interface, you get streaming, batching, and async execution for free, without modifying any of the component code. This means you can prototype a pipeline with synchronous calls and deploy it with streaming by changing a single method call from invoke to stream.

LCEL is LangChain's declarative composition syntax, introduced to replace imperative chain construction. Using the pipe operator (|), LCEL lets you compose components into readable pipelines that support streaming, batching, and async execution out of the box. Each component in an LCEL pipeline implements the Runnable interface, meaning it has invoke, stream, batch, and ainvoke methods automatically. Code Fragment 20.6.2 below puts this into practice.

Example 1: RAG Pipeline with LangChain LCEL

This snippet builds a retrieval-augmented generation pipeline using LangChain Expression Language (LCEL).


# implement format_docs
# Key operations: embedding lookup, results display, retrieval pipeline
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Initialize components
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(
 collection_name="docs",
 embedding_function=embeddings,
 persist_directory="./chroma_db"
)
retriever = vectorstore.as_retriever(
 search_type="similarity",
 search_kwargs={"k": 5}
)
llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Define the prompt template
template = """Answer the question based only on the following context:

{context}

Question: {question}

Provide a detailed answer. If the context does not contain
enough information, say so explicitly."""

prompt = ChatPromptTemplate.from_template(template)

# Helper to format retrieved documents
def format_docs(docs):
 return "\n\n".join(doc.page_content for doc in docs)

# LCEL pipeline: pipe operator composes Runnables
rag_chain = (
 {
 "context": retriever | format_docs,
 "question": RunnablePassthrough()
 }
 | prompt
 | llm
 | StrOutputParser()
)

# Invoke the pipeline
answer = rag_chain.invoke("What are the key benefits of RAG?")
print(answer)

# Streaming is automatic with LCEL
for chunk in rag_chain.stream("Explain hybrid search approaches"):
 print(chunk, end="", flush=True)
RAG (Retrieval-Augmented Generation) offers several key benefits: (1) It grounds LLM responses in factual, up-to-date documents, reducing hallucination. (2) It enables the model to reference domain-specific knowledge without retraining. (3) Retrieved sources can be cited, making answers verifiable and trustworthy.
Code Fragment 20.6.1: implement format_docs

2.3 Memory and Conversation

For conversational RAG, LangChain provides memory chapters that persist chat history across turns. The simplest is ConversationBufferMemory, which stores all messages. For long conversations, ConversationSummaryMemory uses an LLM to compress earlier turns into a summary, keeping the context window manageable. In the newer LangGraph paradigm, state management replaces these memory classes with explicit graph state that flows between nodes, providing more control over how conversation context evolves.

Note

LangChain has undergone significant architectural changes since its early days. The original monolithic langchain package has been split into langchain-core (stable interfaces), langchain-community (third-party integrations), and vendor-specific packages like langchain-openai. For complex agent workflows, langgraph is now the recommended approach over legacy chain classes. When reading tutorials or documentation, check the version carefully, as patterns from six months ago may already be deprecated.

3. LlamaIndex

LlamaIndex (formerly GPT Index) takes a data-centric approach to RAG. While LangChain provides general-purpose LLM application primitives, LlamaIndex focuses specifically on connecting LLMs with external data. Its core philosophy is that different data structures and query patterns require different index types, and the framework should help you choose and combine them.

3.1 Index Types

LlamaIndex offers several index types, each optimized for different query patterns. VectorStoreIndex is the most common, storing embeddings for semantic similarity search. SummaryIndex (formerly ListIndex) stores all nodes and iterates through them sequentially, useful when you need to process every document. TreeIndex builds a hierarchical tree of summaries, enabling top-down traversal for broad questions. KeywordTableIndex extracts keywords from each node and uses keyword matching for retrieval.

3.2 Query Engines and Response Synthesizers

A query engine in LlamaIndex combines a retriever with a response synthesizer. The retriever fetches relevant nodes (chunks), and the response synthesizer determines how to construct the final answer from those nodes. LlamaIndex provides several synthesis strategies: compact (stuff all context into one prompt), refine (iteratively refine the answer by processing one chunk at a time), and tree_summarize (recursively summarize groups of chunks in a tree structure). The choice of synthesizer affects both answer quality and token consumption. Code Fragment 20.6.2 below puts this into practice.

Example 2: RAG Pipeline with LlamaIndex

This snippet builds a RAG pipeline using LlamaIndex with a vector store index and query engine.


# Implementation example
# Key operations: embedding lookup, results display, chunking strategy
from llama_index.core import (
 VectorStoreIndex,
 SimpleDirectoryReader,
 Settings,
 StorageContext
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Configure global settings
Settings.llm = OpenAI(model="gpt-4o", temperature=0)
Settings.embed_model = OpenAIEmbedding(model_name="text-embedding-3-small")
Settings.node_parser = SentenceSplitter(
 chunk_size=1024,
 chunk_overlap=200
)

# Load documents from a directory
documents = SimpleDirectoryReader("./data").load_data()
print(f"Loaded {len(documents)} documents")

# Build the vector index (embeds and stores automatically)
index = VectorStoreIndex.from_documents(documents)

# Create a query engine with custom parameters
query_engine = index.as_query_engine(
 similarity_top_k=5,
 response_mode="compact", # or "refine", "tree_summarize"
 streaming=True
)

# Query with streaming response
response = query_engine.query(
 "What are the key benefits of RAG?"
)

# Stream the response
response.print_response_stream()

# Access source nodes for citations
for node in response.source_nodes:
 print(f"\nSource: {node.metadata.get('file_name', 'unknown')}")
 print(f"Score: {node.score:.4f}")
 print(f"Text: {node.text[:200]}...")
Loaded 12 documents RAG provides grounded, verifiable responses by retrieving relevant context from a knowledge base before generation. Key benefits include reduced hallucination, access to current information, and source attribution... Source: rag_overview.txt Score: 0.8734 Text: Retrieval-Augmented Generation combines the strengths of information retrieval systems with large language models... Source: benefits_comparison.txt Score: 0.8412 Text: The primary advantages of RAG over pure generation include factual grounding, reduced hallucination rates, and...
Code Fragment 20.6.2: Implementation example

3.3 Routers and Multi-Index Queries

One of LlamaIndex's distinctive features is its routing system. A RouterQueryEngine selects which sub-query engine to use based on the question. For example, you might route factual questions to a vector index, summary questions to a tree index, and comparison questions to a SQL query engine. This enables a single application to handle diverse query types by dispatching each question to the most appropriate retrieval strategy.

4. Haystack by deepset

Haystack takes a pipeline-first approach to NLP and RAG applications. Developed by deepset, it models every workflow as a directed graph of components. Each component has typed inputs and outputs, and pipelines are validated at construction time to ensure that component connections are compatible. This design philosophy emphasizes explicit data flow, type safety, and reproducibility.

4.1 Pipeline-Based Architecture

In Haystack, a pipeline is a directed acyclic graph (DAG) where each node is a component that performs a specific operation. Components declare their input and output types using Python dataclasses, and the pipeline validates that connected components have compatible types. This strict typing catches configuration errors at build time rather than at runtime, which is valuable for complex production pipelines with many components. Code Fragment 20.6.2 below puts this into practice.

Example 3: RAG Pipeline with Haystack

This snippet builds a RAG pipeline using Haystack's component-based pipeline API.


# Implementation example
# Key operations: embedding lookup, results display, retrieval pipeline
from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.embedders import (
 SentenceTransformersDocumentEmbedder,
 SentenceTransformersTextEmbedder
)
from haystack.components.writers import DocumentWriter
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack_integrations.document_stores.chroma import (
 ChromaDocumentStore
)
from haystack_integrations.components.retrievers.chroma import (
 ChromaEmbeddingRetriever
)

# ---- Indexing Pipeline ----
document_store = ChromaDocumentStore()

indexing_pipeline = Pipeline()
indexing_pipeline.add_component("converter", TextFileToDocument())
indexing_pipeline.add_component("splitter", DocumentSplitter(
 split_by="sentence", split_length=3, split_overlap=1
))
indexing_pipeline.add_component("embedder",
 SentenceTransformersDocumentEmbedder()
)
indexing_pipeline.add_component("writer",
 DocumentWriter(document_store=document_store)
)

# Connect components explicitly
indexing_pipeline.connect("converter", "splitter")
indexing_pipeline.connect("splitter", "embedder")
indexing_pipeline.connect("embedder", "writer")

# Run indexing
indexing_pipeline.run({
 "converter": {"sources": ["./data/doc1.txt", "./data/doc2.txt"]}
})

# ---- Query Pipeline ----
template = """Given the following context, answer the question.

Context:
{% for doc in documents %}
 {{ doc.content }}
{% endfor %}

Question: {{ question }}
Answer:"""

query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder",
 SentenceTransformersTextEmbedder()
)
query_pipeline.add_component("retriever",
 ChromaEmbeddingRetriever(document_store=document_store)
)
query_pipeline.add_component("prompt_builder",
 PromptBuilder(template=template)
)
query_pipeline.add_component("llm", OpenAIGenerator(model="gpt-4o"))

query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
query_pipeline.connect("retriever.documents", "prompt_builder.documents")
query_pipeline.connect("prompt_builder", "llm")

# Run the query pipeline
result = query_pipeline.run({
 "text_embedder": {"text": "What are the key benefits of RAG?"},
 "prompt_builder": {"question": "What are the key benefits of RAG?"}
})
print(result["llm"]["replies"][0])
RAG systems ground LLM outputs in retrieved documents, providing factual accuracy and source traceability. The key benefits include reduced hallucination, domain adaptability without retraining, and the ability to cite specific sources in generated responses.
Code Fragment 20.6.3: Implementation example
Key Insight

Haystack's explicit component wiring may feel verbose compared to LangChain's LCEL pipes, but it provides a major advantage for production systems: the pipeline graph can be serialized to YAML, versioned in Git, and reconstructed identically across environments. This makes Haystack pipelines highly reproducible and easy to audit, which matters in regulated industries where you must document exactly how your system processes data.

5. Framework Comparison

Each framework reflects a different philosophy about how RAG applications should be built. LangChain prioritizes breadth of integrations and developer velocity, LlamaIndex focuses on data-aware retrieval strategies, and Haystack emphasizes pipeline clarity and production robustness. The following table summarizes the key differences. Figure 20.6.2 compares the composition models used by each framework.

Dimension Comparison
Dimension LangChain LlamaIndex Haystack
Philosophy General-purpose LLM toolkit Data-centric RAG framework Pipeline-first NLP framework
Primary strength Breadth of integrations (700+) Index types and query routing Type-safe pipeline composition
Composition model LCEL pipes, LangGraph Query engines, routers DAG pipelines with typed I/O
Learning curve Moderate (many concepts) Lower for RAG tasks Lower (explicit data flow)
Agent support LangGraph (strong) AgentRunner (growing) Agent components (newer)
Production tooling LangSmith tracing, LangServe Observability callbacks Hayhooks, pipeline YAML
Community size Largest (90k+ GitHub stars) Large (35k+ GitHub stars) Growing (17k+ GitHub stars)
API stability Frequent changes (improving) More stable core API Stable since Haystack 2.0
Best for Prototyping, diverse use cases Complex data retrieval Production NLP pipelines
LangChain (LCEL) Retriever | Prompt | LLM | Parser Pipe-based chaining LlamaIndex QueryEngine Retriever Synthesizer VectorIndex Tree-based query routing Haystack Embedder Retriever PromptBuilder Generator Explicit DAG wiring Implicit flow Concise, flexible Hierarchical Data-aware routing Explicit graph Typed, reproducible
Figure 20.6.2: Each framework uses a different composition model. LangChain chains Runnables with pipe operators, LlamaIndex routes queries through a hierarchy of engines and indices, and Haystack wires typed components into an explicit DAG.

6. When to Use a Framework vs. Building from Scratch

Frameworks are not always the right choice. For simple RAG pipelines (embed, retrieve, generate), the overhead of learning and maintaining a framework may exceed the effort of writing the integration code yourself. The decision depends on several factors: pipeline complexity, team size, iteration speed requirements, and the need for component swapping. Code Fragment 20.6.4 below puts this into practice.

6.1 Choose a Framework When

6.2 Build from Scratch When

Example 4: Minimal RAG Without a Framework

This snippet implements a minimal RAG pipeline using only the OpenAI SDK and a simple in-memory vector store.


# implement embed, retrieve, generate
# Key operations: embedding lookup, results display, retrieval pipeline
import openai
import chromadb

# Direct API calls: no framework needed
client = openai.OpenAI()
chroma = chromadb.PersistentClient(path="./chroma_db")
collection = chroma.get_or_create_collection(
 name="docs",
 metadata={"hnsw:space": "cosine"}
)

def embed(text: str) -> list[float]:
 """Get embedding from OpenAI."""
 response = client.embeddings.create(
 model="text-embedding-3-small",
 input=text
 )
 return response.data[0].embedding

def retrieve(query: str, k: int = 5) -> list[str]:
 """Retrieve top-k relevant documents."""
 results = collection.query(
 query_embeddings=[embed(query)],
 n_results=k
 )
 return results["documents"][0]

def generate(query: str, context_docs: list[str]) -> str:
 """Generate answer using retrieved context."""
 context = "\n\n".join(context_docs)
 response = client.chat.completions.create(
 model="gpt-4o",
 messages=[
 {"role": "system", "content": (
 "Answer based on the provided context. "
 "If the context is insufficient, say so."
 )},
 {"role": "user", "content": (
 f"Context:\n{context}\n\nQuestion: {query}"
 )}
 ],
 temperature=0
 )
 return response.choices[0].message.content

# The entire RAG pipeline in three function calls
query = "What are the key benefits of RAG?"
docs = retrieve(query)
answer = generate(query, docs)
print(answer)
The key benefits of RAG include: grounding LLM responses in retrieved factual documents to reduce hallucination, enabling access to current or proprietary information without model retraining, and providing source citations for verifiable and trustworthy answers.
Code Fragment 20.6.4: implement embed, retrieve, generate
Warning

Be cautious about deep framework coupling. If you use LangChain's custom prompt classes, LlamaIndex's specialized node postprocessors, and framework-specific serialization formats throughout your codebase, migrating to a different framework (or to raw API calls) becomes expensive. As a safeguard, keep your core business logic in plain Python functions that accept and return standard types (strings, lists, dictionaries). Use the framework for orchestration and wiring, not for your domain logic. This layered approach lets you swap the orchestration layer without rewriting your retrieval and generation logic. Figure 20.6.3 provides a decision tree for choosing the right approach.

How complex is your pipeline? Simple Build from scratch Embed + Retrieve + Generate Direct API calls, minimal deps Complex What is your primary need? Rapid prototyping Data-centric retrieval Production stability LangChain Most integrations Fastest to prototype LlamaIndex Best index variety Smart query routing Haystack Type-safe pipelines Serializable, auditable Pro Tip: Start with a framework, then extract Use a framework to prototype and evaluate. Once your pipeline is stable, replace framework wiring with direct API calls if latency or control matters.
Figure 20.6.3: Decision tree for choosing between building from scratch and selecting a RAG framework. Pipeline complexity and primary needs drive the decision. Pro tip: start with a framework to prototype and evaluate; once your pipeline is stable, replace framework wiring with direct API calls if latency or control matters.

Lab: Comparing Frameworks Side by Side

The best way to evaluate frameworks is to implement the same pipeline in each one and compare the developer experience. In this lab, we build identical RAG pipelines in LangChain and LlamaIndex, then measure lines of code, setup complexity, retrieval quality, and response latency. Code Fragment 20.6.5 below puts this into practice.

Example 5: Side-by-Side Comparison Test Harness

This snippet provides a test harness that runs the same query through multiple RAG pipelines and compares their outputs.


# implement build_langchain_rag, build_llamaindex_rag
# Key operations: embedding lookup, results display, retrieval pipeline
import time
import json

# --- LangChain Implementation ---
def build_langchain_rag(docs_path: str):
 from langchain_community.document_loaders import DirectoryLoader
 from langchain_text_splitters import RecursiveCharacterTextSplitter
 from langchain_openai import OpenAIEmbeddings, ChatOpenAI
 from langchain_community.vectorstores import Chroma
 from langchain_core.prompts import ChatPromptTemplate
 from langchain_core.output_parsers import StrOutputParser
 from langchain_core.runnables import RunnablePassthrough

 # Load and split
 loader = DirectoryLoader(docs_path, glob="**/*.txt")
 splitter = RecursiveCharacterTextSplitter(
 chunk_size=1024, chunk_overlap=200
 )
 chunks = splitter.split_documents(loader.load())

 # Index
 vectorstore = Chroma.from_documents(
 chunks, OpenAIEmbeddings(model="text-embedding-3-small")
 )

 # Build chain
 template = """Context: {context}\n\nQuestion: {question}\nAnswer:"""
 chain = (
 {
 "context": vectorstore.as_retriever(search_kwargs={"k": 5})
 | (lambda docs: "\n".join(d.page_content for d in docs)),
 "question": RunnablePassthrough()
 }
 | ChatPromptTemplate.from_template(template)
 | ChatOpenAI(model="gpt-4o", temperature=0)
 | StrOutputParser()
 )
 return chain

# --- LlamaIndex Implementation ---
def build_llamaindex_rag(docs_path: str):
 from llama_index.core import (
 VectorStoreIndex, SimpleDirectoryReader, Settings
 )
 from llama_index.core.node_parser import SentenceSplitter
 from llama_index.llms.openai import OpenAI
 from llama_index.embeddings.openai import OpenAIEmbedding

 Settings.llm = OpenAI(model="gpt-4o", temperature=0)
 Settings.embed_model = OpenAIEmbedding(
 model_name="text-embedding-3-small"
 )
 Settings.node_parser = SentenceSplitter(
 chunk_size=1024, chunk_overlap=200
 )

 documents = SimpleDirectoryReader(docs_path).load_data()
 index = VectorStoreIndex.from_documents(documents)
 query_engine = index.as_query_engine(similarity_top_k=5)
 return query_engine

# --- Comparison ---
test_questions = [
 "What are the main components of a RAG system?",
 "How does hybrid search improve retrieval?",
 "What are best practices for chunking documents?",
]

docs_path = "./test_data"

# Build both pipelines
lc_chain = build_langchain_rag(docs_path)
li_engine = build_llamaindex_rag(docs_path)

results = []
for question in test_questions:
 # LangChain timing
 start = time.time()
 lc_answer = lc_chain.invoke(question)
 lc_time = time.time() - start

 # LlamaIndex timing
 start = time.time()
 li_answer = li_engine.query(question)
 li_time = time.time() - start

 results.append({
 "question": question,
 "langchain_time": round(lc_time, 3),
 "llamaindex_time": round(li_time, 3),
 "langchain_answer_len": len(lc_answer),
 "llamaindex_answer_len": len(str(li_answer)),
 })

# Summary
print("Framework Comparison Results:")
print(json.dumps(results, indent=2))

avg_lc = sum(r["langchain_time"] for r in results) / len(results)
avg_li = sum(r["llamaindex_time"] for r in results) / len(results)
print(f"\nAvg LangChain latency: {avg_lc:.3f}s")
print(f"Avg LlamaIndex latency: {avg_li:.3f}s")
Framework Comparison Results: [ { "question": "What are the main components of a RAG system?", "langchain_time": 2.341, "llamaindex_time": 2.187, "langchain_answer_len": 412, "llamaindex_answer_len": 389 }, ... ] Avg LangChain latency: 2.456s Avg LlamaIndex latency: 2.312s
Code Fragment 20.6.5: implement build_langchain_rag, build_llamaindex_rag
Framework Comparison Results: [ { "question": "What are the main components of a RAG system?", "langchain_time": 2.341, "llamaindex_time": 2.187, "langchain_answer_len": 847, "llamaindex_answer_len": 923 }, { "question": "How does hybrid search improve retrieval?", "langchain_time": 1.892, "llamaindex_time": 1.756, "langchain_answer_len": 612, "llamaindex_answer_len": 701 }, { "question": "What are best practices for chunking documents?", "langchain_time": 2.103, "llamaindex_time": 1.934, "langchain_answer_len": 758, "llamaindex_answer_len": 834 } ] Avg LangChain latency: 2.112s Avg LlamaIndex latency: 1.959s
Note

To deepen your comparison, try these extensions: (1) Add a Haystack implementation as a third pipeline and compare all three. (2) Swap the vector store from Chroma to FAISS or Pinecone and measure how much framework code changes in each case. (3) Add a reranker step (such as Cohere Rerank) to each pipeline and compare the integration effort. (4) Test with larger document sets (1,000+ documents) to measure indexing performance differences. (5) Evaluate answer quality using an LLM judge that scores relevance and completeness for each framework's output.

8. Production Considerations

Moving a framework-based RAG pipeline from prototype to production introduces additional requirements: observability, error handling, caching, rate limiting, and deployment packaging. Each framework addresses these concerns differently.

8.1 Observability and Tracing

LangChain offers LangSmith, a hosted tracing platform that records every step of your pipeline (retriever calls, LLM requests, latency breakdowns). LlamaIndex provides callback handlers and integrations with observability platforms like Arize and Weights & Biases. Haystack pipelines can export their graph structure as YAML, making it straightforward to visualize and audit the processing flow. Regardless of framework, production RAG systems should log the query, retrieved documents, generated answer, and latency for every request.

8.2 Error Handling and Fallbacks

Production pipelines must handle failures gracefully. Common failure modes include embedding API timeouts, vector store connection errors, and LLM rate limiting. Frameworks provide varying levels of built-in retry logic. LangChain supports configurable retry with exponential backoff on all Runnable components. LlamaIndex provides retry logic through its service context. Haystack lets you define fallback components in the pipeline graph. For any framework, you should also implement application-level fallbacks (returning cached results, falling back to a simpler model, or showing a helpful error message).

8.3 Caching Strategies

Embedding computation and LLM calls are the most expensive operations in a RAG pipeline. Caching these results can dramatically reduce both cost and latency. All three frameworks support caching at multiple levels: embedding caches (avoid re-embedding identical text), retrieval caches (return the same documents for identical queries), and LLM caches (return the same answer for identical prompts). For production systems, Redis or a similar distributed cache is recommended over in-memory caching to support horizontal scaling.

Key Insight

The most pragmatic approach to framework adoption is the "prototype with, produce without" pattern. Use a framework during the exploration phase to rapidly test different retrieval strategies, embedding models, and LLM configurations. Once you have identified the winning combination, evaluate whether the framework's abstractions are still earning their keep. For simple, stable pipelines, replacing the framework layer with direct API calls often yields a faster, more maintainable system. For complex pipelines with many components, the framework's orchestration value usually justifies its continued use.

Self-Check
Q1: What is the main advantage of using LCEL (LangChain Expression Language) over manually wiring chains?
Show Answer
LCEL provides a declarative composition syntax using the pipe operator that automatically grants every pipeline streaming, batching, and async execution. When you compose components with LCEL, each step implements the Runnable interface, which means invoke, stream, batch, and ainvoke work automatically without extra code. Manually wired chains require implementing these capabilities separately for each pipeline variant.
Q2: How does LlamaIndex's RouterQueryEngine improve retrieval for diverse question types?
Show Answer
The RouterQueryEngine examines each incoming question and dispatches it to the most appropriate sub-query engine based on the question's characteristics. For example, factual lookup questions can be routed to a VectorStoreIndex for semantic search, summary questions to a TreeIndex for hierarchical summarization, and data questions to a SQL query engine. This means a single application can handle diverse query types effectively, rather than forcing all questions through the same retrieval strategy.
Q3: What makes Haystack's pipeline architecture particularly well-suited for regulated industries?
Show Answer
Haystack's key advantage for regulated industries is its strict typing and serializability. Components declare typed inputs and outputs, and the pipeline validates connections at build time. The entire pipeline can be serialized to YAML, versioned in Git, and reconstructed identically across environments. This makes pipelines reproducible and auditable, which is essential when regulations require documenting exactly how data is processed, who approved the pipeline configuration, and ensuring consistent behavior across deployments.
Q4: When should you build a RAG pipeline from scratch instead of using a framework?
Show Answer
Build from scratch in four scenarios: (1) Your pipeline is simple and stable with known, fixed components (direct API calls are simpler). (2) Latency is critical, since frameworks add 5 to 50ms overhead per component call. (3) You need deep customization with non-standard scoring functions or pipeline patterns that fight against framework abstractions. (4) You need minimal dependencies for lightweight deployments like Lambda functions or edge computing, where framework transitive dependencies add unwanted bloat.
Q5: What caching strategies should production RAG systems implement to reduce cost and latency?
Show Answer
Production systems should implement caching at three levels: (1) Embedding caches that avoid re-computing embeddings for text that has already been embedded. (2) Retrieval caches that return the same set of documents for identical or near-identical queries without hitting the vector store. (3) LLM response caches that return the same answer when the same prompt (including context) is seen again. For horizontal scaling, these caches should use distributed systems like Redis rather than in-memory caches, and cache invalidation policies should account for document updates in the underlying knowledge base.
Real-World Scenario: Choosing Between LangChain and LlamaIndex for a Production RAG System

Who: A tech lead at an insurance company building a claims processing assistant

Situation: The team needed to build a RAG system that searched 2 million policy documents, applied business rules for claims validation, and generated structured determination letters. A working prototype was needed in 6 weeks.

Problem: Both LangChain and LlamaIndex could handle the core RAG pipeline, but the team worried about framework lock-in. Previous experience with a LangChain v0.1 prototype required significant rework when v0.2 changed core abstractions.

Dilemma: LangChain offered richer agent tooling and LCEL for composable chains. LlamaIndex provided better out-of-the-box document indexing with its node/index abstractions. Building from scratch with raw API calls would avoid lock-in but slow development by 3 to 4 weeks.

Decision: They used LlamaIndex for the ingestion and retrieval layer (leveraging its document parsers and hierarchical index), but kept all business logic (claims rules, letter templates, approval workflows) in plain Python functions that accepted standard dictionaries.

How: A thin adapter layer translated between LlamaIndex's NodeWithScore objects and the team's internal ClaimContext dataclass. This meant the business logic never imported LlamaIndex directly.

Result: The prototype shipped in 5 weeks. When LlamaIndex released a breaking change to its retriever API three months later, the migration required updating only the adapter layer (47 lines of code) rather than the entire application.

Lesson: Use frameworks for what they do best (parsing, indexing, orchestration) but isolate your domain logic behind adapter layers. This gives you framework velocity without framework lock-in.

9. Compound AI Systems and DSPy

The RAG pipelines discussed throughout this chapter represent a broader trend: the shift from monolithic LLMs (a single model answering all queries) to compound AI systems (multi-component architectures where each component handles a specific subtask). Berkeley's Compound AI Systems manifesto (Zaharia et al., 2024) argues that the most effective AI systems combine multiple models, retrievers, tools, and verifiers into orchestrated pipelines, and that this compositional approach consistently outperforms scaling a single model alone.

9.1 The Compound AI Architecture

A compound AI system decomposes a complex task into specialized stages. A typical RAG system is already a compound system with three core components: a retriever that finds relevant documents, an optional reranker that refines the ranking, and a generator that produces the final answer. More sophisticated systems add a query rewriter that reformulates the user's question for better retrieval, a verifier that checks the answer for faithfulness, and a router that selects between different retrieval strategies based on query type.

The key advantage of compound systems is that each component can be optimized independently. You can swap a cheaper embedding model for a better one without touching the generator, add a reranker without changing the retriever, or replace the generator with a smaller model for latency-sensitive queries. This modularity also enables targeted evaluation: you can measure retrieval quality independently from generation quality, diagnosing exactly where failures occur.

9.2 DSPy: Programming (Not Prompting) LLM Pipelines

DSPy (Khattab et al., 2024) from Stanford reimagines LLM pipelines as programs rather than prompt templates. Instead of manually crafting prompts, you define signatures (input/output specifications) and modules (processing steps), then let a compiler optimize the prompts, few-shot examples, and even the choice of LLM for each chapter based on evaluation metrics.

The DSPy workflow follows four steps: (1) define signatures describing what each chapter should do ("question, context -> answer"), (2) compose modules into a pipeline, (3) provide a small set of training examples and an evaluation metric, and (4) run the compiler, which searches over prompt strategies, few-shot selections, and module configurations to maximize the metric. This connects directly to the automated prompt optimization ideas from Section 11.3.

9.3 When Compound Systems Beat Single Models

7.3 When Compound Systems Beat Single Models Comparison
Scenario Single Model Compound System
Knowledge-intensive QA Relies on parametric knowledge (hallucinates) Retriever grounds the answer in documents
Multi-step reasoning Single pass may miss steps Decomposer + solver + verifier catches errors
High reliability required No self-checking Generator + NLI verifier ensures faithfulness
Heterogeneous data One retrieval strategy for all Router selects vector, keyword, or SQL per query
Cost optimization Same large model for everything Cheap model for easy queries, large model for hard ones

The compound approach connects naturally to multi-agent systems (Chapter 24), where specialized agents collaborate to solve complex tasks. The difference is one of framing: compound AI systems emphasize data-flow pipelines with compiled optimization, while multi-agent systems emphasize autonomous agents with message-passing coordination. In practice, many production systems blend both paradigms.

Key Insight

The shift to compound AI systems reflects a fundamental insight: LLMs are better as components than as complete solutions. A 7B model in a well-designed compound system (with retrieval, reranking, and verification) often outperforms a 70B model used alone, at a fraction of the cost. The engineering challenge shifts from "pick the best model" to "design the best system," which favors teams with strong software engineering skills alongside ML expertise.

10. RAG Poisoning and Retrieval-Layer Security

RAG systems inherit a unique class of vulnerabilities: attacks that target the retrieval layer itself. Unlike prompt injection (which targets the LLM), retrieval-layer attacks manipulate which documents the model sees, corrupting the context before generation even begins. As RAG pipelines connect to external data sources, wikis, customer databases, and web crawlers, the attack surface grows with every new data connection.

10.1 Attack Vectors

PoisonedRAG. An adversary crafts documents specifically designed to be retrieved for target queries. The poisoned documents contain false information, biased framing, or embedded prompt injections. Because the retrieval model selects documents by semantic similarity rather than truthfulness, a well-crafted adversarial document can outrank legitimate sources. Research by Zou et al. (2024) demonstrated that injecting as few as five poisoned documents into a corpus of 10,000 can flip RAG answers on targeted queries with over 90% success rate.

Embedding space attacks. Adversarial perturbations to document text can manipulate cosine similarity scores without changing the document's apparent meaning to a human reader. By adding invisible Unicode characters, strategic synonym substitutions, or appended trigger phrases, an attacker can boost a document's similarity to specific queries. These perturbations exploit the gap between what embeddings measure (surface-level semantic proximity) and what humans judge (factual relevance and trustworthiness).

Retrieval jamming. Instead of targeting specific queries, an attacker floods the index with high-similarity noise documents. These documents are semantically close to many queries but contain no useful information, diluting retrieval quality across the board. This is the retrieval equivalent of a denial-of-service attack: the system still returns results, but they are degraded enough to make the RAG pipeline unreliable.

10.2 Defense Strategies

Defending against retrieval-layer attacks requires controls at multiple points in the pipeline:

Real-World Scenario: Trust-Scored Retrieval Pipeline

A production RAG system for a financial services firm implements a three-stage retrieval pipeline. Stage 1: standard vector similarity retrieval returns the top 20 candidates. Stage 2: a trust scorer assigns each candidate a composite score based on source reputation (internal documents score 1.0, verified partner feeds score 0.8, web-crawled content scores 0.4), document freshness (exponential decay over 90 days), and cross-document consistency (documents that contradict the majority of other retrieved results receive a penalty). Stage 3: the final ranking combines semantic similarity (60% weight) with trust score (40% weight), and the top 5 documents are passed to the generator. Documents with trust scores below 0.3 are excluded entirely, regardless of their similarity score.

Warning

Every external data connection in your RAG pipeline is an entry point for adversarial content. A web crawler, a customer-facing upload endpoint, a Slack integration, or a shared knowledge base can all be vectors for document poisoning. Treat every ingestion source as untrusted input. Apply content validation, provenance tagging, and anomaly detection at the ingestion boundary, not just at retrieval time. The retrieval layer should be the last line of defense, not the only one.

Key Takeaways

Exercises

These exercises compare RAG frameworks and help you decide when to use one.

Exercise 20.6.1: Framework motivation Conceptual

List three benefits and two drawbacks of using a RAG framework like LangChain instead of building from raw API calls.

Show Answer

Benefits: (a) faster development with pre-built components, (b) easy swapping of embedding models, vector stores, and LLMs, (c) built-in patterns for common tasks (summarization, QA). Drawbacks: (a) abstraction hides important details, making debugging harder, (b) framework updates can break your code, introducing dependency risk.

Exercise 20.6.2: LangChain vs. LlamaIndex Conceptual

Compare the core philosophies of LangChain and LlamaIndex. When would you choose one over the other?

Show Answer

LangChain is agent-first: designed for chaining arbitrary tools, actions, and LLM calls. LlamaIndex is data-first: optimized for ingesting, indexing, and querying data. Choose LangChain for complex agentic workflows, LlamaIndex for data-heavy RAG applications.

Exercise 20.6.3: Abstraction tradeoff Conceptual

A framework's retriever abstraction hides the vector database implementation. When does this abstraction help, and when does it hurt?

Show Answer

Helps: rapid prototyping, easy A/B testing of different vector stores, team members do not need to learn each database's API. Hurts: when you need fine-grained control over index parameters, when the abstraction does not expose a feature you need (e.g., custom metadata filtering), or when debugging retrieval quality issues.

Exercise 20.6.4: DSPy approach Conceptual

Explain how DSPy differs from LangChain in its approach to prompt optimization. What problem does DSPy's compiler solve?

Show Answer

DSPy treats prompts as learnable parameters optimized by a compiler. Instead of manually writing prompt templates, you define input/output signatures and provide examples. The compiler searches over prompt strategies and few-shot selections to maximize a metric. This solves the brittle prompt engineering problem.

Exercise 20.6.5: Build vs. buy Conceptual

Your team is building a production RAG system for a financial services company. The system must meet strict latency (under 500ms), accuracy, and compliance requirements. Would you use a framework or build from scratch? Justify your decision.

Show Answer

For strict requirements, a hybrid approach works best: use a framework for rapid prototyping and evaluation, then replace framework components with custom implementations where you need control (e.g., custom retriever for latency, custom prompt for compliance). This gets the speed of a framework with the control of custom code.

Exercise 20.6.6: LangChain quickstart Coding

Build a RAG pipeline using LangChain: load a PDF, split into chunks, embed with OpenAI embeddings, store in FAISS, and create a RetrievalQA chain. Answer 5 questions.

Exercise 20.6.7: LlamaIndex comparison Conceptual

Rebuild the same pipeline using LlamaIndex. Compare the code complexity, default behaviors (chunk size, retrieval strategy), and answer quality.

Exercise 20.6.8: Custom retriever Coding

In LangChain, implement a custom retriever that combines BM25 and dense search using RRF. Plug it into the same QA chain and compare results against the default dense-only retriever.

Exercise 20.6.9: Framework-free RAG Coding

Build a complete RAG pipeline without any framework: direct calls to an embedding API, manual ChromaDB operations, hand-crafted prompt templates, and direct LLM API calls. Compare the code size, flexibility, and performance against the LangChain version.

Research Frontier

DSPy (Stanford, 2024) is pioneering a compiler-based approach to RAG pipeline optimization, automatically tuning prompts and few-shot examples based on evaluation metrics rather than manual engineering. LlamaIndex Workflows provide event-driven RAG orchestration that supports complex branching, parallelism, and error recovery. Haystack 2.0 introduces a component-based pipeline architecture with strong typing and serialization. Research into declarative RAG frameworks is exploring SQL-like query languages for expressing retrieval and generation logic, making RAG pipelines more composable and testable. Compound AI optimization is an active research direction, with work on end-to-end pipeline tuning, automatic component selection, and cost-aware routing that allocates expensive models only to queries that need them.

What Comes Next

In the next chapter, Chapter 21: Building Conversational AI Systems, we apply these retrieval and generation techniques to building complete conversational AI systems.

References & Further Reading

LangChain Documentation.

The most popular LLM application framework, with extensive RAG components including document loaders, splitters, and retrievers. Offers both Python and JavaScript libraries. The default choice for rapid RAG prototyping.

Tool

LlamaIndex Documentation.

A data framework purpose-built for RAG with advanced indexing, query engines, and response synthesis. Excels at structured data integration and complex retrieval strategies. Best for projects with sophisticated data ingestion needs.

Tool

Haystack by deepset.

A production-ready NLP framework with modular pipeline architecture for search and RAG. Strong emphasis on deployment and scalability. Recommended for enterprise teams building production search systems.

Tool

DSPy: Programming (not Prompting) Foundation Models.

A framework from Stanford that replaces manual prompting with programmatic optimization of LLM pipelines. Automatically tunes retrieval and generation steps. Ideal for researchers and advanced practitioners.

Tool

Semantic Kernel by Microsoft.

Microsoft's SDK for integrating LLMs into applications, with strong .NET and Python support. Includes planners, memory, and plugin systems. Best for teams in the Microsoft ecosystem.

Tool

Flowise: Drag-and-drop LLM Flow Builder.

A visual tool for building LLM workflows and RAG pipelines without writing code. Supports custom nodes and integrations. Perfect for non-developers or rapid prototyping of RAG flows.

Tool