Production RAG Pipelines, Evaluation & Topic Modeling

Section 31.7

A chunk is not a deliverable. A maintained, evaluated, topic-aware index is.

VecVec, Pipeline-Minded AI Agent
Big Picture

Section 31.4 established how to chunk a document well. This continuation answers what comes after the first batch run: how to keep an index fresh as documents change (incremental indexing, metadata enrichment), how to systematically evaluate chunking choices against retrieval metrics, and how to repurpose the same embeddings into topic models (BERTopic) that surface latent themes in a corpus. Together, these three layers move a chunking strategy from a one-off prototype into a maintained production retrieval system.

Prerequisites

This section continues from Section 31.4: Document Processing & Chunking, which established the chunking strategies (fixed-size, recursive, semantic, structure-aware, parent-child) that this section now operationalizes for production. Familiarity with the embedding model fundamentals in Section 31.1 and the vector database systems in Section 31.5 is assumed.

31.7.1 Production RAG ETL Pipelines

A production ingestion pipeline must handle document updates, deletions, and versioning in addition to initial loading. The key engineering challenges include:

Incremental Indexing

When documents are updated, you must re-chunk and re-embed only the changed documents, not the entire corpus. This requires tracking document versions (typically via content hashes or timestamps) and maintaining a mapping between source documents and their chunks in the vector database.

# Incremental indexing with content hashing
import hashlib
import json
from typing import Dict, List, Optional
from pathlib import Path


class IncrementalIndexer:
    """
    Tracks document versions to enable incremental re-indexing.
    Only processes documents that have changed since the last run.
    """

    def __init__(self, state_file: str = "indexer_state.json"):
        self.state_file = Path(state_file)
        self.state: Dict[str, str] = {}
        if self.state_file.exists():
            self.state = json.loads(self.state_file.read_text())

    def content_hash(self, content: str) -> str:
        return hashlib.sha256(content.encode()).hexdigest()

    def get_changes(
        self, documents: Dict[str, str]
    ) -> Dict[str, List[str]]:
        """
        Compare current documents against stored state.

        Args:
            documents: dict of {doc_id: content}

        Returns:
            {"added": [...], "modified": [...], "deleted": [...]}
        """
        current_ids = set(documents.keys())
        stored_ids = set(self.state.keys())
        added = current_ids - stored_ids
        deleted = stored_ids - current_ids
        modified = set()
        for doc_id in current_ids & stored_ids:
            new_hash = self.content_hash(documents[doc_id])
            if new_hash != self.state[doc_id]:
                modified.add(doc_id)
        return {
            "added": list(added),
            "modified": list(modified),
            "deleted": list(deleted),
        }

    def update_state(self, documents: Dict[str, str]):
        """Update stored hashes after successful indexing."""
        for doc_id, content in documents.items():
            self.state[doc_id] = self.content_hash(content)
        self.state_file.write_text(json.dumps(self.state, indent=2))

    def process_changes(self, documents: Dict[str, str]):
        """Main entry point for incremental processing."""
        changes = self.get_changes(documents)
        print(f"Added: {len(changes['added'])} documents")
        print(f"Modified: {len(changes['modified'])} documents")
        print(f"Deleted: {len(changes['deleted'])} documents")
        # For added/modified: chunk, embed, upsert
        to_process = changes["added"] + changes["modified"]
        if to_process:
            print(f"Processing {len(to_process)} documents...")
            # chunk_and_embed(to_process)
            # vector_db.upsert(chunks)
        # For deleted: remove from vector DB
        if changes["deleted"]:
            print(f"Removing {len(changes['deleted'])} documents...")
            # vector_db.delete(filter={"doc_id": {"$in": changes["deleted"]}})
        # For modified: also remove old chunks before upserting new ones
        if changes["modified"]:
            print(f"Replacing chunks for {len(changes['modified'])} documents...")
            # vector_db.delete(filter={"doc_id": {"$in": changes["modified"]}})
            # vector_db.upsert(new_chunks)
        self.update_state(documents)


# Usage
indexer = IncrementalIndexer()
docs = {
    "report_2024.pdf": "Full text of the 2024 report...",
    "manual_v3.pdf": "Updated product manual content...",
    "faq.md": "Frequently asked questions...",
}
indexer.process_changes(docs)
Output: Added: 3 documents Modified: 0 documents Deleted: 0 documents Processing 3 documents...
Code Fragment 31.7.1: Incremental indexing with content hashing

Metadata Enrichment

Every chunk should carry metadata that enables effective filtering and attribution. Essential metadata fields include:

Warning: Common Chunking Mistakes

The most common mistakes in document processing are: (1) Not evaluating chunking quality by measuring retrieval performance with different strategies and parameters on representative queries. (2) Ignoring document structure by applying the same chunking strategy to all document types. (3) Losing metadata context by stripping headers, section titles, or table captions during chunking. (4) Using the default settings of your framework without tuning chunk size and overlap for your specific content and queries. (5) Not handling tables and figures as special elements that should either be kept intact or described textually.

31.7.2 Evaluation and Iteration

Chunking is not a one-time configuration; it requires ongoing evaluation and tuning. The most effective approach is to build a small evaluation set of 50 to 100 representative queries with known relevant passages, then measure retrieval metrics (recall@k, MRR, NDCG) across different chunking configurations. Systematic A/B testing of chunking strategies often reveals that the optimal configuration depends heavily on the document type and query patterns specific to your application. Figure 31.7.1a shows the iterative evaluation loop for chunking quality.

Diagram: Evaluation and Iteration
Figure 31.7.1b: Chunking quality requires iterative evaluation against representative queries with ground-truth relevance judgments.
Tip: Tune HNSW Parameters for Your Use Case

The default HNSW index parameters (ef_construction, M) work for prototyping but not production. Higher ef_construction (256 to 512) improves recall at index build time cost; higher M (32 to 64) improves search quality at memory cost. Tune these based on your recall requirements.

Real-World Scenario
Optimizing Chunk Size for a Medical Knowledge Base

Who: An NLP engineer at a health-tech company building a clinical decision support tool

Situation: The system indexed 120,000 medical journal articles, clinical guidelines, and drug interaction databases. Physicians queried it during patient consultations expecting precise, citation-worthy answers.

Problem: Using a fixed 512-token chunk size produced fragments that split drug dosage tables, broke apart multi-step treatment protocols, and lost critical context about contraindications.

Dilemma: Larger chunks (1,024 tokens) preserved context but reduced retrieval precision because irrelevant content diluted the embedding signal. Smaller chunks (256 tokens) improved precision but often omitted the surrounding clinical context physicians needed.

Decision: The team implemented semantic chunking using section headers and paragraph boundaries, with a target range of 300 to 600 tokens per chunk. They added 2-sentence overlap between adjacent chunks and stored parent document IDs for context expansion at retrieval time.

How: A custom parser detected document structure (headers, lists, tables) and kept logical units intact. Tables were chunked as single units regardless of token count. Metadata (article title, section name, publication year) was prepended to each chunk before embedding.

Result: Answer accuracy (judged by physicians) improved from 71% to 89%. Retrieval precision@5 rose from 0.54 to 0.78, and physicians reported that returned passages were "immediately useful" rather than requiring manual context reconstruction.

Lesson: Chunking strategy should respect document structure rather than applying arbitrary token boundaries. Preserving logical units (tables, protocols, lists) and adding metadata context produces dramatically better retrieval quality.

31.7.3 Topic Modeling with LLM Embeddings

Topic modeling discovers the latent themes in a collection of documents without requiring labeled data. Classical approaches like LDA (Latent Dirichlet Allocation) and NMF (Non-negative Matrix Factorization) operate on bag-of-words representations, which discard word order and semantic nuance. BERTopic (Grootendorst, 2022) replaces this with a pipeline built on the same embedding models used for retrieval, producing topics that are semantically coherent and interpretable. Understanding BERTopic is valuable because the same embeddings you create for RAG (as covered in Section 31.1) can power topic discovery, clustering, and content organization without additional model training.

31.7.3.1 The BERTopic Pipeline

BERTopic operates in four sequential stages, each handled by a separate, swappable component:

  1. Embed: Convert each document to a dense vector using a sentence softmax model (the same models from Section 31.1).
  2. Reduce: Project the high-dimensional embeddings into a lower-dimensional space using UMAP, preserving local neighborhood structure while making clustering feasible.
  3. Cluster: Group similar documents using HDBSCAN, a density-based clustering algorithm that automatically determines the number of clusters and identifies outliers (documents that do not fit any topic).
  4. Represent: Label each cluster with descriptive terms using c-TF-IDF (class-based TF-IDF) or, optionally, an LLM that generates human-readable topic labels from the cluster's representative documents.
BERTopic four-stage pipeline: documents flow through embed (sentence transformer), reduce (UMAP from 384 to 5 dimensions), cluster (HDBSCAN density clustering), and represent (c-TF-IDF keyword labels).
Figure 31.7.4: The BERTopic pipeline. Each stage is a swappable component, so the same architecture supports any embedding model, any reducer (UMAP, PCA), any clustering algorithm (HDBSCAN, k-means), and any labeling strategy (c-TF-IDF, KeyBERT, LLM-generated names).

The labeling stage uses a class-based variant of TF-IDF (Grootendorst, 2022) that treats each cluster as a single super-document. For class $c$ and term $t$, the c-TF-IDF score is:

$$\mathrm{c\text{-}TF\text{-}IDF}(t, c) = \mathrm{tf}_{t,c} \cdot \log\!\left(1 + \frac{A}{\mathrm{tf}_t}\right)$$

where $\mathrm{tf}_{t,c}$ is the frequency of term $t$ inside cluster $c$, $\mathrm{tf}_t$ is its frequency across the whole corpus, and $A$ is the average number of words per cluster. The $\log(1 + A/\mathrm{tf}_t)$ factor penalizes terms that appear everywhere (stopwords) and rewards terms that are concentrated inside one cluster. Sorting terms by c-TF-IDF and keeping the top 10 gives the keyword label that BERTopic reports.

Worked Example: BERTopic on 500 Support Tickets

A SaaS company runs BERTopic on 500 customer-support tickets. The four stages produce:

The same five-keyword summary that would have taken a human analyst a full afternoon to produce arrives in under 15 seconds. Re-running with an LLM representation_model upgrades the keyword lists to readable names ("Billing and Refunds", "Account Access", "Data Export").

Under the Hood: HDBSCAN density clustering (BERTopic)

HDBSCAN clusters by density rather than by a preset cluster count. It first reweights distances into a 'mutual reachability' metric that pushes sparse points apart, then builds a minimum spanning tree and cuts it into a hierarchy of nested clusters at every density threshold. Instead of slicing that hierarchy at one fixed level, it selects the clusters that persist over the widest range of densities, a stability criterion that automatically yields a sensible number of clusters of varying size. Points that never join a dense-enough group are labeled noise rather than forced into a topic. This is why BERTopic discovers topic count on its own and cleanly sets aside off-topic documents that LDA would still assign.

31.7.3.2 BERTopic vs. Classical Topic Models

Table 31.7.2: 8.2 BERTopic vs. Classical Topic Models (as of 2026).
Dimension LDA NMF BERTopic
Input representation Bag-of-words (BoW) TF-IDF matrix Dense embeddings (sentence transformers)
Semantic awareness None (word co-occurrence only) Minimal (term weighting) Full (contextual embeddings)
Number of topics Must be specified upfront Must be specified upfront Automatically determined by HDBSCAN
Short text handling Poor (sparse BoW vectors) Poor Good (dense embeddings capture meaning)
Topic coherence Moderate Good Excellent (semantically grouped)
Scalability Good (efficient inference) Good Moderate (embedding step is the bottleneck)
Dynamic topics Not natively supported Not natively supported Built-in support for topics over time

31.7.3.3 Practical Example

This snippet demonstrates a practical hybrid search query that combines dense and sparse retrieval.

# BERTopic: embedding-based topic modeling
# Uses the same sentence transformers as RAG embedding pipelines
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
# Step 1: Configure each pipeline component
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
umap_model = UMAP(
    n_neighbors=15, n_components=5,
    min_dist=0.0, metric="cosine",
    )
hdbscan_model = HDBSCAN(
    min_cluster_size=15,
    metric="euclidean",
    prediction_data=True,
    )
# Step 2: Build the BERTopic model with custom components
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    verbose=True,
    )
# Step 3: Fit on your documents (e.g., customer support tickets)
documents = [
    "My order has not arrived after two weeks",
    "How do I reset my password for the dashboard?",
    "The API returns a 500 error on large payloads",
    "I was charged twice for the same subscription",
    "Can I export my data as a CSV file?",
    # ... thousands more documents
    ]
topics, probs = topic_model.fit_transform(documents)
# Step 4: Inspect discovered topics
topic_info = topic_model.get_topic_info()
print(topic_info.head(10))
# Step 5: Get the top terms for a specific topic
for topic_id in range(min(5, len(topic_model.get_topics()))):
    terms = topic_model.get_topic(topic_id)
    print(f"\nTopic {topic_id}:")
    for term, score in terms[:5]:
        print(f" {score:.3f} {term}")
Output: Topic Count Name 0 -1 2 -1_default_topic 1 0 2 0_order_charged_subscription 2 1 1 1_password_reset_dashboard 3 2 1 2_api_error_payload Topic 0: 0.142 order 0.128 charged 0.115 subscription 0.098 arrived 0.087 weeks Topic 1: 0.189 password 0.167 reset 0.134 dashboard ...
Code Fragment 31.7.2a: BERTopic: embedding-based topic modeling
Key Insight

Embedding-based topic models like BERTopic outperform bag-of-words approaches (LDA, NMF) because they capture semantic similarity rather than surface-level word co-occurrence. Two documents about "machine learning model deployment" and "putting ML systems into production" share no words in common, so LDA would assign them to different topics. BERTopic, working from dense embeddings, recognizes they discuss the same concept and clusters them together. This semantic awareness is especially valuable for short texts (tweets, support tickets, search queries) where bag-of-words vectors are too sparse to produce meaningful topics.

Fun Fact

BERTopic can optionally use an LLM to generate human-readable topic labels. Instead of a topic being described as "deployment, production, inference, serving, latency," the LLM reads the cluster's representative documents and produces a label like "ML Model Deployment and Serving Infrastructure." This makes topic models immediately useful for non-technical stakeholders who need to understand what their customer base is talking about.

31.7.3.4 Plug-In Representation Models

The default c-TF-IDF keyword list per topic is a reasonable starting point but rarely the most readable one. BERTopic exposes a representation_model hook so the four-stage pipeline (embed, reduce, cluster, represent) can be customized at the final step without touching the upstream UMAP or HDBSCAN configuration. Three plug-ins cover almost every production need.

KeyBERTInspired re-ranks the top c-TF-IDF candidate words for each topic by their cosine similarity to the topic's centroid embedding. The intuition is that c-TF-IDF favors words that distinguish the cluster from other clusters statistically, but those words are not always the words a human would call descriptive. KeyBERTInspired pushes the keyword list toward terms that are semantically central to the topic, not just statistically distinctive. MaximalMarginalRelevance applies the MMR algorithm (see Section 35.2.1.5) to the candidate keywords, so the final list trades relevance against diversity and avoids near-duplicates like "summary" and "summaries" both surfacing for the same topic. OpenAI hands a few representative documents per topic to an LLM and asks it to write a one-line topic label, the same trick promised in the fun-fact above but now an actual line of code. The three plug-ins stack:

from bertopic import BERTopic
from bertopic.representation import (
    KeyBERTInspired, MaximalMarginalRelevance, OpenAI,
)
from openai import OpenAI as OpenAIClient

client = OpenAIClient()

representation = {
    "Main":      KeyBERTInspired(),                          # centroid-aware keywords
    "MMR":       MaximalMarginalRelevance(diversity=0.3),    # de-duped keywords
    "LLM Label": OpenAI(client=client, model="gpt-4o-mini"), # one-line topic name
}

topic_model = BERTopic(
    embedding_model=embedder,
    representation_model=representation,
)
topics, _ = topic_model.fit_transform(docs)
print(topic_model.get_topic_info().head())
Code Fragment 31.7.3: Stacking three representation models on a single BERTopic run. Each plug-in produces its own column in get_topic_info(), so a downstream UI can display the LLM label, fall back to the KeyBERT keywords, and use the MMR list for tag clouds without rerunning UMAP or HDBSCAN.

31.7.3.5 BERTopic Visualizations

BERTopic ships an interactive visualization layer that turns the topic model into a quick analytical UI without writing any plotting code. Four methods cover the common reporting needs: visualize_documents() plots every document as a point in 2-D UMAP space, colored by topic, with hover tooltips showing the document text; visualize_topics() draws an inter-topic distance map so reviewers can see which topics overlap and which are well separated; visualize_hierarchy() renders the dendrogram of agglomerative clustering over topic embeddings, which is the input to the next method; reduce_topics(nr_topics=20) collapses fine-grained topics into a target number using that same hierarchy, useful when HDBSCAN produces 80 small topics but the business audience wants 10 broad themes. All four return Plotly figures that drop directly into a Jupyter notebook or a Streamlit dashboard.

Research Frontier

LLM-guided chunking uses language models to identify semantic boundaries in documents, producing chunks that align with topical shifts rather than arbitrary token counts. Late chunking (Jina AI, 2024) embeds the full document first and then splits the embedding sequence into chunks, preserving cross-chunk context that is lost with naive chunking. Proposition-based indexing decomposes documents into atomic factual statements before embedding, improving retrieval precision for fact-seeking queries. Research into multimodal document parsing (combining OCR, layout analysis, and vision models) is enabling chunking of complex documents with tables, figures, and mixed layouts.

Lab: Build and Compare Document Chunking Strategies
Duration: ~45 minutes Intermediate

Objective

Implement three different chunking strategies (fixed-size, recursive, semantic), apply them to a structured document, and compare their retrieval quality on test queries.

What You'll Practice

  • Implementing fixed-size chunking with character-based overlap
  • Building recursive text splitting using structural markers (headers, paragraphs)
  • Creating semantic chunking using embedding similarity breakpoints
  • Measuring retrieval quality differences between chunking strategies

Setup

The following cell installs the required packages and configures the environment for this lab.

Steps

Step 1: Create a sample document

Define a structured document with clear section boundaries.

document = (
    "# Introduction to Machine Learning\n\n"
    "Machine learning is a branch of artificial intelligence that enables "
    "computers to learn from data without being explicitly programmed.\n\n"
    "## Supervised Learning\n\n"
    "In supervised learning, the algorithm learns from labeled training data. "
    "Each example consists of an input and a desired output.\n\n"
    "### Classification\n\n"
    "Classification predicts categorical labels. For example, an email spam "
    "filter classifies emails as spam or not spam. Popular algorithms include "
    "logistic regression, SVMs, and random forests.\n\n"
    "### Regression\n\n"
    "Regression predicts continuous numerical values. For instance, predicting "
    "house prices based on features like square footage and location.\n\n"
    "## Unsupervised Learning\n\n"
    "Unsupervised learning works with unlabeled data, seeking to discover "
    "hidden patterns and structures without target labels.\n\n"
    "### Clustering\n\n"
    "Clustering groups similar data points together. K-means partitions data "
    "into k groups. DBSCAN discovers clusters of arbitrary shape based on "
    "density. Hierarchical clustering builds a tree of nested clusters.\n\n"
    "### Dimensionality Reduction\n\n"
    "Dimensionality reduction compresses high-dimensional data. PCA finds "
    "directions of maximum variance. t-SNE and UMAP create 2D visualizations "
    "that preserve local neighborhood structure.\n\n"
    "## Deep Learning\n\n"
    "Deep learning uses neural networks with many layers to learn hierarchical "
    "representations. It has achieved breakthroughs in vision, NLP, and games."
)
print(f"Document: {len(document)} chars, {document.count(chr(10))} lines")
Code Fragment 31.7.6: Assembles the lab's evaluation corpus: a markdown document with a clear heading hierarchy (h1, h2, h3) and clean paragraph breaks that gives each chunking strategy a fair shot at preserving structure.
Hint

This document has clear structural markers: # for h1, ## for h2, ### for h3, and blank lines between paragraphs. Good chunking should respect these boundaries.

Step 2: Implement three chunking strategies

Build fixed-size, recursive, and semantic chunkers.

import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
# Strategy 1: Fixed-size with overlap
def fixed_chunk(text, size=300, overlap=50):
    chunks, start = [], 0
    while start < len(text):
        chunk = text[start:start+size].strip()
        if chunk:
            chunks.append(chunk)
            start += size - overlap
            return chunks
# Strategy 2: Recursive splitting on headers
def recursive_chunk(text, max_size=500):
    # TODO: Split on "## " first, then "### " for oversized sections
    # Merge chunks smaller than max_size/3 with their neighbors
    sections = text.split("\n## ")
    chunks = []
    for section in sections:
        section = section.strip()
        if not section:
            continue
        if len(section) <= max_size:
            chunks.append(section)
        else:
            for sub in section.split("\n### "):
                sub = sub.strip()
                if sub:
                    chunks.append(sub)
                    return chunks
# Strategy 3: Semantic chunking
def semantic_chunk(text, model, threshold=0.5):
    # TODO: Split into sentences, encode them, find similarity drops
    sentences = [s.strip() for s in text.replace('\n', ' ').split('. ')
        if len(s.strip()) > 10]
    if len(sentences) <= 1:
        return sentences
    embs = model.encode(sentences)
    norms = np.linalg.norm(embs, axis=1)
    chunks, current = [], [sentences[0]]
    for i in range(len(sentences) - 1):
        sim = np.dot(embs[i], embs[i+1]) / (norms[i] * norms[i+1] + 1e-8)
        if sim < threshold:
            chunks.append(". ".join(current))
            current = [sentences[i+1]]
        else:
            current.append(sentences[i+1])
            if current:
                chunks.append(". ".join(current))
                return chunks
            c_fixed = fixed_chunk(document)
            c_recursive = recursive_chunk(document)
            c_semantic = semantic_chunk(document, model)
            for name, chunks in [("Fixed", c_fixed), ("Recursive", c_recursive),
                ("Semantic", c_semantic)]:
                print(f"\n{name}: {len(chunks)} chunks")
                for i, c in enumerate(chunks):
                    print(f" [{i}] {len(c)} chars: {c[:60]}...")
Output: Fixed: 5 chunks [0] 300 chars: # Introduction to Machine Learning Machine learning is a b... [1] 300 chars: ed training data. Each example consists of an input and a d... [2] 300 chars: ps similar data points together. K-means partitions data in... [3] 300 chars: eduction compresses high-dimensional data. PCA finds direct... [4] 142 chars: Deep learning uses neural networks with many layers to lear... Recursive: 6 chunks [0] 178 chars: # Introduction to Machine Learning Machine learning is a b... [1] 153 chars: Supervised Learning In supervised learning, the algorithm ... [2] 142 chars: Classification Classification predicts categorical labels.... [3] 208 chars: Clustering Clustering groups similar data points together.... [4] 195 chars: Dimensionality Reduction Dimensionality reduction compres... [5] 148 chars: Deep Learning Deep learning uses neural networks with many... Semantic: 4 chunks [0] 312 chars: Machine learning is a branch of artificial intelligence tha... [1] 287 chars: Clustering groups similar data points together. K-means par... [2] 215 chars: Dimensionality reduction compresses high-dimensional data. ... [3] 148 chars: Deep learning uses neural networks with many layers to lear...
Code Fragment 31.7.9: Defines fixed_chunk and recursive_chunk
Hint

For semantic chunking, compute cosine similarity between consecutive sentence embeddings. A drop below the threshold signals a topic change, which is where you create a new chunk.

Step 3: Compare retrieval quality

Search each set of chunks and check which strategy finds the best match.

import numpy as np
queries_expected = [
    ("What is classification in ML?", "classification"),
    ("How does clustering work?", "clustering"),
    ("What is PCA used for?", "dimensionality"),
    ("What is deep learning?", "deep learning"),
    ]
def search_chunks(query, chunks, model, top_k=1):
    qe = model.encode(query)
    ce = model.encode(chunks)
    scores = np.dot(ce, qe) / (np.linalg.norm(ce, axis=1) * np.linalg.norm(qe))
    idx = np.argsort(scores)[::-1][:top_k]
    return [(chunks[i], scores[i]) for i in idx]
for query, keyword in queries_expected:
    print(f"\nQuery: {query}")
    for name, chunks in [("Fixed", c_fixed), ("Recursive", c_recursive),
        ("Semantic", c_semantic)]:
        top_chunk, score = search_chunks(query, chunks, model)[0]
        hit = "PASS" if keyword.lower() in top_chunk.lower() else "MISS"
        print(f" {name:10s} [{hit}] score={score:.3f} | {top_chunk[:55]}...")
Output: Query: What is classification in ML? Fixed [MISS] score=0.487 | ed training data. Each example consists of an in... Recursive [PASS] score=0.612 | Classification Classification predicts categori... Semantic [PASS] score=0.534 | Machine learning is a branch of artificial intel... Query: How does clustering work? Fixed [PASS] score=0.523 | ps similar data points together. K-means partiti... Recursive [PASS] score=0.648 | Clustering Clustering groups similar data points... Semantic [PASS] score=0.591 | Clustering groups similar data points together. ... Query: What is PCA used for? Fixed [PASS] score=0.481 | eduction compresses high-dimensional data. PCA f... Recursive [PASS] score=0.573 | Dimensionality Reduction Dimensionality reducti... Semantic [PASS] score=0.542 | Dimensionality reduction compresses high-dimensi... Query: What is deep learning? Fixed [PASS] score=0.612 | Deep learning uses neural networks with many lay... Recursive [PASS] score=0.687 | Deep Learning Deep learning uses neural networks... Semantic [PASS] score=0.654 | Deep learning uses neural networks with many lay...
Code Fragment 31.7.10: Runs the same query set against three chunk indexes (fixed-size, recursive, semantic) and prints the top hit per strategy. Recursive chunking wins on the deep-learning query because chunks align with heading boundaries instead of mid-sentence splits.
Hint

Recursive chunking should perform best because chunks align with natural topic boundaries. Fixed-size chunks may split topics mid-sentence.

Expected Output

  • Three sets of chunks with different sizes (fixed: ~8, recursive: ~6, semantic: ~5 to 8)
  • Recursive and semantic chunking matching relevant chunks more reliably
  • Fixed-size chunking occasionally missing because topics get split at boundaries

Stretch Goals

  • Add metadata (section title, position) to each chunk and use it to improve retrieval context
  • Implement a "parent document retriever" that returns the larger parent section when a small chunk matches
  • Test on a real PDF by extracting text with PyMuPDF and applying the same strategies
Complete Solution
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
document = (
    "# Introduction to Machine Learning\n\n"
    "Machine learning enables computers to learn from data.\n\n"
    "## Supervised Learning\n\nLearns from labeled data.\n\n"
    "### Classification\n\nPredicts categorical labels like spam/not-spam.\n\n"
    "### Regression\n\nPredicts continuous values like house prices.\n\n"
    "## Unsupervised Learning\n\nFinds patterns in unlabeled data.\n\n"
    "### Clustering\n\nGroups similar points. K-means, DBSCAN, hierarchical.\n\n"
    "### Dimensionality Reduction\n\nCompresses data. PCA, t-SNE, UMAP.\n\n"
    "## Deep Learning\n\nNeural networks with many layers for hierarchical representations."
    )
def fixed_chunk(text, size=300, overlap=50):
    chunks, s = [], 0
    while s < len(text):
        c = text[s:s+size].strip()
        if c: chunks.append(c)
        s += size - overlap
        return chunks
def recursive_chunk(text, max_size=500):
    chunks = []
    for sec in text.split("\n## "):
        sec = sec.strip()
        if not sec: continue
        if len(sec) <= max_size: chunks.append(sec)
    else:
        for sub in sec.split("\n### "):
            sub = sub.strip()
            if sub: chunks.append(sub)
            return chunks
def semantic_chunk(text, model, threshold=0.5):
    sents = [s.strip() for s in text.replace('\n',' ').split('. ') if len(s.strip())>10]
    if len(sents) <= 1: return sents
    embs = model.encode(sents)
    norms = np.linalg.norm(embs, axis=1)
    chunks, cur = [], [sents[0]]
    for i in range(len(sents)-1):
        sim = np.dot(embs[i],embs[i+1])/(norms[i]*norms[i+1]+1e-8)
        if sim < threshold: chunks.append(". ".join(cur)); cur = [sents[i+1]]
    else: cur.append(sents[i+1])
    if cur: chunks.append(". ".join(cur))
    return chunks
cf, cr, cs = fixed_chunk(document), recursive_chunk(document), semantic_chunk(document, model)
def search(q, chunks, model):
    qe = model.encode(q); ce = model.encode(chunks)
    scores = np.dot(ce,qe)/(np.linalg.norm(ce,axis=1)*np.linalg.norm(qe))
    i = np.argmax(scores)
    return chunks[i], scores[i]
for q, kw in [("What is classification?","classification"),("How does clustering work?","clustering"),
    ("What is PCA?","dimensionality"),("What is deep learning?","deep learning")]:
    print(f"\n{q}")
    for nm, ch in [("Fixed",cf),("Recursive",cr),("Semantic",cs)]:
        c, s = search(q, ch, model)
        print(f" {nm:10s} [{'PASS' if kw in c.lower() else 'MISS'}] {s:.3f} | {c[:55]}")
Output: What is classification? Fixed [MISS] 0.487 | ed training data. Each example consists of an in Recursive [PASS] 0.612 | Classification Predicts categorical labels like Semantic [PASS] 0.534 | Machine learning enables computers to learn from ...
Code Fragment 31.7.11: Defines fixed_chunk and recursive_chunk
Tip: Use vectorized cosine similarity in production

The hand-rolled np.dot / np.linalg.norm loop above shows the math; the library version is what you would ship. In production, prefer sklearn.metrics.pairwise.cosine_similarity(A, B) (vectorized over arrays, drop-in replacement for the loop above) for moderate corpora, or faiss.IndexFlatIP with L2-normalized vectors when you need billion-scale retrieval. See the next two Library Shortcut blocks for ready-to-paste sklearn and LangChain wrappers.

Library Shortcut
sklearn cosine_similarity for batched search

Reach for sklearn when you need a one-call cosine search over a moderate corpus. For larger indexes, swap in FAISS the same way as in Section 31.2.

Show code
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def search_chunks(query, chunks, model, top_k=1):
    qe = model.encode([query])
    ce = model.encode(chunks)
    sims = cosine_similarity(qe, ce)[0]
    idx = np.argsort(sims)[::-1][:top_k]
    return [(chunks[i], sims[i]) for i in idx]
Code Fragment 31.7.4a: A search_chunks() helper that embeds both the query and the candidate chunks on the fly, then uses sklearn.cosine_similarity to return the best-matching chunk and its score (handy for one-off retrieval without a persistent index).
Library Shortcut: LangChain RecursiveCharacterTextSplitter

LangChain ships a battle-tested splitter that handles the recursive fallback (paragraph → sentence → word) and overlap in a single call. Production RAG pipelines almost always use this.

Show code
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=50,
    separators=["\n## ", "\n### ", "\n\n", "\n", ". ", " "],
)
chunks = splitter.split_text(document)
Code Fragment 31.7.5: Minimal working example using LangChain RecursiveCharacterTextSplitter.
Key Takeaways
Self-Check
Q1: Why is chunking quality often the most important factor in RAG system performance?
Show Answer
Chunking determines what the retriever can find. If a relevant answer is split across two chunks at an unfortunate boundary, the retriever may never surface it as a coherent result. If a chunk mixes two unrelated topics, its embedding will be a noisy average that matches neither topic well. No embedding model or vector database can compensate for poorly chunked documents because the quality of retrieval is fundamentally bounded by the quality of the units being retrieved.
Q2: What is the fundamental tradeoff in choosing chunk size?
Show Answer
Smaller chunks (100 to 200 tokens) produce more focused embeddings that match specific queries precisely, improving retrieval precision. However, they may lack sufficient context for the LLM to generate a complete answer. Larger chunks (500 to 1000 tokens) provide richer context but may cover multiple topics, making their embeddings less precise and reducing retrieval recall. The parent-child strategy resolves this by using small chunks for retrieval and returning larger parent chunks for LLM context.
Q3: How does semantic chunking differ from recursive character splitting?
Show Answer
Recursive character splitting uses predefined text separators (paragraph breaks, newlines, spaces) in a hierarchical order to find chunk boundaries. It is rule-based and deterministic. Semantic chunking uses the embedding model itself to determine boundaries: it embeds each sentence, computes similarity between consecutive sentences, and splits where similarity drops significantly. This produces chunks aligned with actual topic transitions rather than formatting conventions. Semantic chunking is slower (it requires embedding all sentences) but produces more coherent chunks for content with complex topic structure.
Q4: How does parent-child retrieval solve the chunk-size dilemma?
Show Answer
Parent-child retrieval decouples the retrieval unit from the context unit. Small child chunks (100 to 200 tokens) are embedded and used for retrieval because their focused content produces precise, topic-specific embeddings. When a child chunk matches a query, the system looks up its associated parent chunk (500 to 1000 tokens) and returns the parent to the LLM. This provides the precision of small chunks for matching while giving the LLM the broader context of large chunks for answer generation.
Q5: What is incremental indexing and why is it necessary for production systems?
Show Answer
Incremental indexing tracks document versions (via content hashes or timestamps) and processes only documents that have been added, modified, or deleted since the last indexing run. It is necessary because re-processing an entire corpus on every update is expensive and slow. A production system with thousands of documents that change daily must detect which documents have changed, remove old chunks for modified or deleted documents, and insert new chunks, all without reprocessing unchanged documents. This requires maintaining a mapping between source documents and their chunks in the vector database.

Exercises

Exercise 17.4.1: Chunk size tradeoff Conceptual

A 200-token chunk about climate change is highly precise for retrieval, but the LLM struggles to generate a good answer from it. A 1000-token chunk provides more context but retrieves poorly. How would you resolve this tension?

Show Answer

Use parent-child retrieval: embed small chunks (200 tokens) for precise retrieval, but pass the larger parent chunk (800 to 1000 tokens) to the LLM for generation. This gets the best of both worlds.

Exercise 17.4.2: Overlap rationale Conceptual

Explain why overlapping chunks improve retrieval quality. What percentage of overlap is typical, and what happens if overlap is too high?

Show Answer

Overlap ensures that information spanning a chunk boundary is captured in at least one chunk. Typical overlap is 10 to 15% of chunk size. Excessive overlap (e.g., 50%) wastes storage and creates near-duplicate embeddings that dilute search results.

Exercise 17.4.3: Semantic chunking Conceptual

Compare fixed-size chunking with semantic chunking (splitting at topic boundaries). When does semantic chunking clearly outperform fixed-size, and when is fixed-size good enough?

Show Answer

Semantic chunking outperforms fixed-size when documents contain clear topic shifts (e.g., news articles, textbooks). Fixed-size is good enough when documents are homogeneous (e.g., product reviews, FAQ entries) or when the retrieval pipeline includes reranking.

Exercise 17.4.4: Parent-child retrieval Conceptual

Describe the parent-child chunking strategy. Why would you retrieve on small chunks but pass the parent chunk to the LLM?

Show Answer

Small child chunks produce more focused embeddings (better retrieval precision), while parent chunks give the LLM enough surrounding context to generate coherent, grounded answers.

Exercise 17.4.5: Multimodal documents Conceptual

You have a 200-page financial report with charts, tables, and prose. Compare two approaches: (a) extract text only and chunk, (b) use vision-based retrieval with ColPali. What are the tradeoffs?

Show Answer

(a) Text extraction loses visual layout, table structure, and chart data, but is faster and cheaper. (b) ColPali preserves visual information and handles charts/tables natively, but requires more compute and storage (one embedding per page patch). Best approach: use text extraction for prose-heavy sections and ColPali for pages with visual elements.

Exercise 17.4.6: Chunking comparison Conceptual

Take a 5-page document and chunk it three ways: fixed 512 tokens, recursive character splitting, and by paragraph boundaries. Count the chunks produced and examine where splits occur. Which method preserves semantic coherence best?

Exercise 17.4.7: Chunk evaluation Coding

Write an evaluation script: given a set of 20 questions with known answer passages, embed all chunks using a sentence transformer, retrieve top-5 for each question, and compute Hit Rate and MRR. Compare across chunk sizes of 128, 256, 512, and 1024 tokens.

Exercise 17.4.8: Topic modeling Coding

Use BERTopic on a collection of at least 500 documents (e.g., 20 Newsgroups). Visualize the discovered topics and compare them against the known categories.

Exercise 17.4.9: Parent-child implementation Coding

Implement a parent-child retrieval system where child chunks (128 tokens) are used for retrieval but parent chunks (512 tokens) are passed to the LLM. Compare answer quality against a flat 512-token chunking approach.

What Comes Next

In the next section, Section 31.8: Vision-Based Document Retrieval, we explore how vision-based retrieval with ColPali and ColQwen2 bypasses the text extraction pipeline entirely by processing document pages as images, enabling retrieval from visually rich content that text-based methods cannot handle.

Further Reading

Chunking Frameworks & Tutorials

LangChain Documentation: Text Splitters Comprehensive guide to LangChain's text splitting abstractions including recursive character, token-based, and semantic chunking. Good starting point for understanding the API surface of chunking tools.
LlamaIndex Documentation: Node Parsers LlamaIndex's approach to document parsing and node creation, including sentence-window and hierarchical parsers. Useful for comparing chunking philosophies across frameworks.
Kamradt, G. (2023). "Chunking Strategies for LLM Applications." Practical, code-driven comparison of chunking strategies with visual examples. Includes the influential "five levels of chunking" framework from simple to semantic splitting.

Document Parsing Tools

Unstructured.io: Open-source Document Parsing Industry-standard open-source library for extracting text and structure from PDFs, DOCX, HTML, images, and more. Handles complex layouts with table detection and OCR integration.
Nougat: Neural Optical Understanding for Academic Documents. Meta's neural approach to converting academic PDFs (including equations and tables) to structured Markdown. Particularly effective for scientific papers where traditional OCR fails on mathematical notation.
Marker: PDF to Markdown Converter High-quality PDF to Markdown converter that preserves document structure, tables, and formatting. Faster and more accurate than many alternatives for general-purpose PDF extraction.