Part V: Retrieval and Conversation
Chapter 19: Embeddings and Vector Databases

Document Processing & Chunking

Garbage in, garbage out. But with chunking, it is more like: split wrong, retrieve wrong, answer wrong.

Vec Vec, Slice-Savvy AI Agent
Big Picture

The quality of your RAG system is bounded by the quality of your chunks. No embedding model or vector database can compensate for poorly chunked documents. If a relevant answer spans two chunks that were split in the wrong place, the retriever will never surface it as a single coherent result. Document processing and chunking is where most RAG systems succeed or fail, yet it receives far less attention than model selection or index tuning. This section covers chunking strategies from basic to advanced, document parsing tools for complex formats, and the engineering of production-grade ingestion pipelines. The tokenization concepts from Section 02.1 directly inform chunk size decisions, since models have fixed token-level context windows.

Prerequisites

Effective chunking depends on understanding what embedding models expect as input, so review the embedding model fundamentals in Section 19.1 before proceeding. The tokenization concepts from Section 02.1 are directly relevant because chunk boundaries interact with token limits. This section feeds directly into the RAG pipeline design covered in Section 20.1, where chunking quality determines retrieval quality.

A baguette being sliced into chunks of different sizes, representing document chunking strategies
Figure 19.4.1: Chunking a document is like slicing a baguette. Too thin and you lose context; too thick and it will not fit in the model's mouth.
A sushi chef carefully cutting fish into precise pieces, representing semantic chunking that respects natural boundaries
Figure 19.4.2: Semantic chunking is sushi-grade precision. Cut at the natural boundaries, not at arbitrary character counts, and every piece is a complete thought.

1. The Document Processing Pipeline

Before text can be embedded and indexed, raw documents must pass through a multi-stage processing pipeline. Each stage introduces potential failure modes that can degrade retrieval quality downstream. Figure 19.4.3 outlines the complete ingestion pipeline from raw files to indexed vectors.

  1. Loading: Reading raw files from various sources (file systems, S3, URLs, databases, APIs).
  2. Parsing: Extracting text and structure from complex formats (PDF, DOCX, HTML, slides, scanned images).
  3. Cleaning: Removing headers, footers, page numbers, boilerplate, and artifacts from parsing.
  4. Chunking: Splitting cleaned text into segments suitable for embedding and retrieval.
  5. Enrichment: Adding metadata (source, page number, section title, date) to each chunk.
  6. Embedding: Converting chunks to vectors using the selected embedding model.
  7. Indexing: Storing vectors and metadata in the vector database.
Key Insight

The chunking problem in document processing is, at its core, a segmentation problem with deep roots in psycholinguistics and information theory. George Miller's seminal 1956 paper "The Magical Number Seven" showed that human working memory processes information in "chunks," and that the boundaries between chunks are determined by semantic coherence, not arbitrary length. The same principle applies to RAG: chunks that align with natural discourse boundaries (paragraphs, sections, topic shifts) produce better retrieval results than chunks split at arbitrary token counts. Linguists study this under the name "discourse segmentation" or "topic modeling," and the computational approaches (TextTiling by Hearst, 1997; Bayesian topic segmentation by Eisenstein and Barzilay, 2008) predate RAG by decades. Semantic chunking strategies in modern RAG systems are rediscovering these linguistic principles: the best chunk boundary is where the topic changes, which is precisely where human readers would naturally pause and begin processing a new idea.

Load PDF, DOCX Parse Extract text Clean Normalize Chunk Split text Embed Vectorize Index Store in DB Vector DB Each stage introduces potential failure modes that compound downstream Metadata enrichment: source, page, section, date, permissions
Figure 19.4.3: The RAG document ingestion pipeline from raw files to indexed vectors.

2. Document Parsing

The PDF Challenge

PDFs are the most common and most difficult document format for RAG systems. A PDF is fundamentally a page layout format, not a text format. Text is stored as positioned glyphs on a page, with no inherent reading order, paragraph structure, or semantic hierarchy. Tables, multi-column layouts, headers, footers, and embedded images all require specialized handling. Scanned PDFs contain only images, requiring OCR before text extraction is even possible. For an alternative approach that skips text extraction entirely, see Section 19.5 on vision-based document retrieval. Code Fragment 19.4.2 below puts this into practice.

A telescope zooming from a wide view to a specific detail, representing parent-child chunk retrieval
Figure 19.4.4: Parent-child retrieval: search finds the specific detail (child chunk), then hands back the full context (parent chunk). Zoom in to find it, zoom out to understand it.

Parsing Tools

Tip

Before investing time in parsing optimization, test your documents with the simplest tool first. Run PyPDF on a sample of 20 documents and manually inspect the output. If 80% parse cleanly, you may only need a specialized parser for the remaining 20%. Many teams over-engineer their parsing pipeline for edge cases that represent a tiny fraction of their corpus.

# Document parsing with Unstructured.io
from unstructured.partition.pdf import partition_pdf

# Parse a PDF with layout detection
elements = partition_pdf(
 filename="technical_report.pdf",
 strategy="hi_res", # Use layout detection model
 infer_table_structure=True, # Extract table structure
 include_page_breaks=True, # Track page boundaries
)

# Inspect extracted elements
for element in elements[:10]:
 print(f"Type: {type(element).__name__:20s} | "
 f"Page: {element.metadata.page_number} | "
 f"Text: {str(element)[:60]}...")

# Filter by element type
from unstructured.documents.elements import Title, NarrativeText, Table

titles = [e for e in elements if isinstance(e, Title)]
text_blocks = [e for e in elements if isinstance(e, NarrativeText)]
tables = [e for e in elements if isinstance(e, Table)]

print(f"\nExtracted: {len(titles)} titles, "
 f"{len(text_blocks)} text blocks, "
 f"{len(tables)} tables")
Type: Title | Page: 1 | Text: Technical Report: Vector Database Performance Analy... Type: NarrativeText | Page: 1 | Text: This report presents a comprehensive benchmark of v... Type: NarrativeText | Page: 1 | Text: We evaluated five vector database systems across th... Type: Title | Page: 2 | Text: Methodology... Type: NarrativeText | Page: 2 | Text: Our benchmark framework measures three primary dime... Type: Table | Page: 2 | Text: System | QPS | Recall@10 | P99 Latency... Extracted: 8 titles, 24 text blocks, 3 tables
Code Fragment 19.4.1: Document parsing with Unstructured.io

3. Chunking Strategies

Key Insight: The Chunking Dilemma

Chunk size involves a fundamental tradeoff. Smaller chunks (100 to 200 tokens) produce more precise embeddings because each chunk covers a single topic, improving retrieval precision. However, they may lack sufficient context for the LLM to generate a good answer. Larger chunks (500 to 1000 tokens) provide more context but may cover multiple topics, reducing embedding precision and retrieval recall. Most production systems settle on 256 to 512 tokens as a baseline, then tune based on evaluation results.

Common Misconception: Smaller Chunks Are More Precise, So They Must Be Better

Developers often assume that smaller chunks improve retrieval because each chunk covers a narrower topic. While this is true for embedding precision, it ignores two critical failure modes. First, when an answer spans two small chunks split at the wrong boundary, neither chunk contains the complete answer, and the retriever may surface only one half. Second, small chunks strip away surrounding context that the LLM needs to interpret the passage correctly. A 100-token chunk saying "the rate increased to 3.5%" is useless without knowing which rate, in which time period. Chunk size must be tuned empirically on your data, not chosen from a general rule. Start at 400 to 512 tokens with 50-token overlap, then use retrieval evaluation metrics (covered in Chapter 29) to find the optimum for your corpus.

Fun Fact

Ask ten RAG engineers for their optimal chunk size and you will get twelve answers. The chunking literature is littered with benchmarks "proving" that 256, 512, or 1024 tokens is best, usually on completely different datasets. The real answer is always "it depends," which is the most frustrating and most honest thing in engineering.

Fixed-Size Chunking

Chunk Overlap Geometry.

For a document of length D tokens, chunk size C, and overlap O:

$$Number of chunks: N = \lceil(D - O) / (C - O)\rceil$$

Effective stride: stride = C - O

Overlap ratio: O / C (typically 10% to 20%)

Storage overhead from overlap: (N × C) / D = C / (C - O)

Worked example: A 10,000-token document with C = 512, O = 50:

$$stride = 512 - 50 = 462 \\ N = \lceil(10000 - 50) / 462\rceil = \lceil21.5\rceil = 22 chunks$$

Storage overhead: 512 / 462 = 1.11 × (11% more vectors than zero-overlap chunking).

$$\text{With zero overlap: } \lceil 10000 / 512 \rceil = 20 \text{ chunks}$$

The extra 2 chunks ensure no sentence split at a boundary is lost.

The simplest approach splits text into chunks of a fixed number of characters or tokens. While naive, fixed-size chunking is fast, deterministic, and serves as a reasonable baseline. Code Fragment 19.4.8 below puts this into practice.

# Fixed-size chunking with overlap
from typing import List

def fixed_size_chunk(
 text: str,
 chunk_size: int = 500,
 chunk_overlap: int = 50
) -> List[str]:
 """
 Split text into fixed-size chunks with overlap.

 Args:
 text: Input text to chunk
 chunk_size: Maximum characters per chunk
 chunk_overlap: Characters to overlap between consecutive chunks
 """
 chunks = []
 start = 0

 while start < len(text):
 end = start + chunk_size

 # If not the last chunk, try to break at a sentence boundary
 if end < len(text):
 # Look for sentence boundary near the end
 for boundary in [". ", ".\n", "? ", "! "]:
 last_boundary = text[start:end].rfind(boundary)
 if last_boundary > chunk_size * 0.5:
 end = start + last_boundary + len(boundary)
 break

 chunk = text[start:end].strip()
 if chunk:
 chunks.append(chunk)

 # Move start position, accounting for overlap
 start = end - chunk_overlap

 return chunks

# Example
sample_text = """
Vector databases are specialized systems designed for storing and querying
high-dimensional vectors. They use approximate nearest neighbor algorithms
to find similar vectors efficiently.

The most common algorithm is HNSW, which builds a multi-layer graph structure.
Each layer connects vectors to their nearest neighbors, enabling fast navigation
from any starting point to the target region of the vector space.

Product Quantization reduces memory usage by compressing vectors. Each vector
is split into sub-vectors, and each sub-vector is replaced by its nearest
codebook entry. This can achieve 32x compression with acceptable accuracy loss.
"""

chunks = fixed_size_chunk(sample_text, chunk_size=200, chunk_overlap=30)
for i, chunk in enumerate(chunks):
 print(f"Chunk {i} ({len(chunk)} chars): {chunk[:70]}...")
Chunk 0 (198 chars): Vector databases are specialized systems designed for storing and query... Chunk 1 (203 chars): The most common algorithm is HNSW, which builds a multi-layer graph st... Chunk 2 (196 chars): Product Quantization reduces memory usage by compressing vectors. Each...
Code Fragment 19.4.2: Fixed-size chunking with overlap

Recursive Character Splitting

Recursive character splitting (popularized by LangChain) attempts to split text at the most semantically meaningful boundary possible. It tries a hierarchy of separators: first by paragraph (\n\n), then by sentence (\n), then by word ( ), and finally by character. At each level, if a chunk exceeds the size limit, it is split using the next separator in the hierarchy. Code Fragment 19.4.8 below puts this into practice.

# Recursive character text splitting (LangChain-style)
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
 chunk_size=400,
 chunk_overlap=50,
 separators=["\n\n", "\n", ". ", " ", ""],
 length_function=len,
 is_separator_regex=False,
)

document = """# Introduction to Embeddings

Text embeddings convert natural language into dense vector representations.
These vectors capture semantic meaning, allowing mathematical operations
like cosine similarity to measure how related two pieces of text are.

## Training Approaches

Modern embedding models use contrastive learning. The model is trained to
produce similar vectors for semantically related text pairs and different
vectors for unrelated pairs. Hard negative mining improves training by
providing challenging negative examples that force the model to learn
fine-grained distinctions.

## Applications

Embeddings power semantic search, recommendation systems, clustering,
and retrieval-augmented generation. They serve as the foundation for
virtually every modern NLP application that requires understanding
meaning beyond keyword matching.
"""

chunks = splitter.split_text(document)
for i, chunk in enumerate(chunks):
 print(f"Chunk {i} ({len(chunk)} chars):")
 print(f" {chunk[:80]}...")
 print()
Chunk 0 (251 chars): # Introduction to Embeddings Text embeddings convert natural language into den... Chunk 1 (308 chars): ## Training Approaches Modern embedding models use contrastive learning. The ... Chunk 2 (241 chars): ## Applications Embeddings power semantic search, recommendation systems, cl...
Code Fragment 19.4.3: Recursive character text splitting (LangChain-style)

Semantic Chunking

Semantic chunking uses the embedding model itself to determine chunk boundaries. It computes embeddings for each sentence (or small segment), then identifies natural breakpoints where the cosine similarity between consecutive segments drops below a threshold. This produces chunks that are semantically coherent, with boundaries aligned to topic transitions.

# Semantic chunking based on embedding similarity
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import List, Tuple
import re

def semantic_chunk(
 text: str,
 model: SentenceTransformer,
 threshold_percentile: int = 25,
 min_chunk_size: int = 100,
) -> List[str]:
 """
 Split text into semantically coherent chunks by detecting
 topic boundaries using embedding similarity.
 """
 # Split into sentences
 sentences = re.split(r'(?<=[.!?])\s+', text.strip())
 sentences = [s for s in sentences if len(s) > 10]

 if len(sentences) <= 1:
 return [text]

 # Embed all sentences
 embeddings = model.encode(sentences, normalize_embeddings=True)

 # Compute cosine similarity between consecutive sentences
 similarities = []
 for i in range(len(embeddings) - 1):
 sim = np.dot(embeddings[i], embeddings[i + 1])
 similarities.append(sim)

 # Find breakpoints where similarity drops below threshold
 threshold = np.percentile(similarities, threshold_percentile)
 breakpoints = [i + 1 for i, sim in enumerate(similarities)
 if sim < threshold]

 # Build chunks from breakpoints
 chunks = []
 start = 0
 for bp in breakpoints:
 chunk = " ".join(sentences[start:bp])
 if len(chunk) >= min_chunk_size:
 chunks.append(chunk)
 start = bp

 # Add remaining sentences
 final_chunk = " ".join(sentences[start:])
 if final_chunk:
 chunks.append(final_chunk)

 return chunks

# Example usage
model = SentenceTransformer("all-MiniLM-L6-v2")
text = """
Machine learning models learn patterns from data. They adjust internal
parameters to minimize prediction errors. The training process uses
gradient descent to iteratively improve the model.

Vector databases store high-dimensional vectors. They use algorithms like
HNSW for fast approximate nearest neighbor search. These systems are
critical for semantic search applications.

Python is the most popular language for data science. It provides libraries
like NumPy, pandas, and scikit-learn. The ecosystem continues to grow rapidly.
"""

chunks = semantic_chunk(text, model)
for i, chunk in enumerate(chunks):
 print(f"Semantic Chunk {i}: {chunk[:70]}...")
Semantic Chunk 0: Machine learning models learn patterns from data. They adjust internal... Semantic Chunk 1: Vector databases store high-dimensional vectors. They use algorithms l... Semantic Chunk 2: Python is the most popular language for data science. It provides libr...
Code Fragment 19.4.4: Semantic chunking based on embedding similarity

Structure-Aware Chunking

When documents have clear structural elements (headings, sections, subsections), the most effective strategy respects this structure. Structure-aware chunking uses document hierarchy to create chunks that align with the author's intended organization. A section with its heading forms a natural chunk; a table stays intact rather than being split across chunks. Figure 19.4.5 compares these three approaches.

Fixed-Size ## Heading + partial paragraph (500 chars) ...end of paragraph + table start (500 chars) ...table end + next section start (500 chars) Splits mid-sentence Breaks tables Mixes topics Recursive ## Heading Full paragraph 1 (split at \n\n) Table (may split if large) (split at \n) ## Next Section paragraph (split at \n\n) Respects paragraphs May still break tables Good default choice Structure-Aware ## Section 1 Complete section content (heading-bounded) Table (kept intact) (element-preserved) ## Section 2 complete (heading-bounded) Topic-coherent chunks Preserves tables Best retrieval quality
Figure 19.4.5: Fixed-size chunking breaks at arbitrary points; recursive splitting respects paragraphs; structure-aware chunking preserves semantic units.

4. Overlap and Parent-Child Retrieval

Chunk Overlap

Adding overlap between consecutive chunks ensures that sentences at chunk boundaries are not lost in context. A typical overlap of 10 to 20% of the chunk size (e.g., 50 to 100 tokens for a 500-token chunk) provides continuity without excessive duplication. Too much overlap wastes storage and can introduce duplicate results; too little risks losing context at boundaries.

Parent-Child (Small-to-Big) Retrieval

The parent-child strategy addresses the chunk-size dilemma by decoupling the retrieval unit from the context unit. Small chunks (child chunks, 100 to 200 tokens) are used for embedding and retrieval because their focused content produces precise embeddings. When a child chunk is retrieved, the system returns the larger parent chunk (500 to 1000 tokens) that contains it, providing the LLM with sufficient context to generate a high-quality answer. Code Fragment 19.4.12 below puts this into practice.

# Parent-child chunking strategy
from langchain_text_splitters import RecursiveCharacterTextSplitter
from typing import List, Dict
import uuid

def create_parent_child_chunks(
 text: str,
 parent_chunk_size: int = 1000,
 child_chunk_size: int = 200,
 child_overlap: int = 20,
) -> List[Dict]:
 """
 Create a two-tier chunking structure for parent-child retrieval.

 Child chunks are used for embedding and retrieval.
 Parent chunks are returned for LLM context.
 """
 # Create parent chunks
 parent_splitter = RecursiveCharacterTextSplitter(
 chunk_size=parent_chunk_size,
 chunk_overlap=0,
 )
 parent_chunks = parent_splitter.split_text(text)

 all_chunks = []

 for parent_idx, parent_text in enumerate(parent_chunks):
 parent_id = str(uuid.uuid4())

 # Store parent chunk
 all_chunks.append({
 "id": parent_id,
 "text": parent_text,
 "type": "parent",
 "parent_id": None,
 })

 # Create child chunks from this parent
 child_splitter = RecursiveCharacterTextSplitter(
 chunk_size=child_chunk_size,
 chunk_overlap=child_overlap,
 )
 child_texts = child_splitter.split_text(parent_text)

 for child_idx, child_text in enumerate(child_texts):
 all_chunks.append({
 "id": str(uuid.uuid4()),
 "text": child_text,
 "type": "child",
 "parent_id": parent_id,
 })

 parents = [c for c in all_chunks if c["type"] == "parent"]
 children = [c for c in all_chunks if c["type"] == "child"]
 print(f"Created {len(parents)} parents, {len(children)} children")
 print(f"Avg parent size: {sum(len(p['text']) for p in parents) / len(parents):.0f} chars")
 print(f"Avg child size: {sum(len(c['text']) for c in children) / len(children):.0f} chars")

 return all_chunks

# Usage: embed children, retrieve parents
# At query time:
# 1. Search child embeddings for top-k matches
# 2. For each matching child, look up its parent_id
# 3. Return deduplicated parent chunks to the LLM
Created 3 parents, 14 children Avg parent size: 812 chars Avg child size: 178 chars
Code Fragment 19.4.5: Parent-child chunking strategy
Note: Sentence Window Retrieval

A variation of parent-child retrieval is sentence window retrieval. Each sentence is embedded individually for maximum retrieval precision. When a sentence matches, the system returns a window of surrounding sentences (e.g., 3 sentences before and after) as context. This provides a fine-grained retrieval unit with a flexible context window, and it avoids the need to predefine parent chunk boundaries. LlamaIndex provides a built-in SentenceWindowNodeParser for this pattern.

5. Chunking Strategy Comparison

5. Chunking Strategy Comparison Intermediate
Strategy Pros Cons Best For
Fixed-size Simple, fast, predictable Splits mid-sentence, ignores structure Baseline, homogeneous text
Recursive Respects natural boundaries, configurable May still break complex elements General purpose (default choice)
Semantic Topic-coherent chunks, data-driven boundaries Slower (requires embeddings), variable sizes Long-form content, mixed topics
Structure-aware Preserves document hierarchy, best quality Requires structural parsing, format-specific Structured docs (manuals, reports)
Parent-child Precise retrieval with rich context More complex pipeline, extra storage High-stakes RAG applications
Sentence window Maximum retrieval precision Many embeddings, higher index cost Q&A over dense technical content

6. Production RAG ETL Pipelines

A production ingestion pipeline must handle document updates, deletions, and versioning in addition to initial loading. The key engineering challenges include:

Incremental Indexing

When documents are updated, you must re-chunk and re-embed only the changed documents, not the entire corpus. This requires tracking document versions (typically via content hashes or timestamps) and maintaining a mapping between source documents and their chunks in the vector database. Code Fragment 19.4.6 below puts this into practice.

# Incremental indexing with content hashing
import hashlib
import json
from typing import Dict, List, Optional
from pathlib import Path

class IncrementalIndexer:
 """
 Tracks document versions to enable incremental re-indexing.
 Only processes documents that have changed since the last run.
 """

 def __init__(self, state_file: str = "indexer_state.json"):
 self.state_file = Path(state_file)
 self.state: Dict[str, str] = {}
 if self.state_file.exists():
 self.state = json.loads(self.state_file.read_text())

 def content_hash(self, content: str) -> str:
 return hashlib.sha256(content.encode()).hexdigest()

 def get_changes(
 self, documents: Dict[str, str]
 ) -> Dict[str, List[str]]:
 """
 Compare current documents against stored state.

 Args:
 documents: dict of {doc_id: content}

 Returns:
 {"added": [...], "modified": [...], "deleted": [...]}
 """
 current_ids = set(documents.keys())
 stored_ids = set(self.state.keys())

 added = current_ids - stored_ids
 deleted = stored_ids - current_ids
 modified = set()

 for doc_id in current_ids & stored_ids:
 new_hash = self.content_hash(documents[doc_id])
 if new_hash != self.state[doc_id]:
 modified.add(doc_id)

 return {
 "added": list(added),
 "modified": list(modified),
 "deleted": list(deleted),
 }

 def update_state(self, documents: Dict[str, str]):
 """Update stored hashes after successful indexing."""
 for doc_id, content in documents.items():
 self.state[doc_id] = self.content_hash(content)
 self.state_file.write_text(json.dumps(self.state, indent=2))

 def process_changes(self, documents: Dict[str, str]):
 """Main entry point for incremental processing."""
 changes = self.get_changes(documents)

 print(f"Added: {len(changes['added'])} documents")
 print(f"Modified: {len(changes['modified'])} documents")
 print(f"Deleted: {len(changes['deleted'])} documents")

 # For added/modified: chunk, embed, upsert
 to_process = changes["added"] + changes["modified"]
 if to_process:
 print(f"Processing {len(to_process)} documents...")
 # chunk_and_embed(to_process)
 # vector_db.upsert(chunks)

 # For deleted: remove from vector DB
 if changes["deleted"]:
 print(f"Removing {len(changes['deleted'])} documents...")
 # vector_db.delete(filter={"doc_id": {"$in": changes["deleted"]}})

 # For modified: also remove old chunks before upserting new ones
 if changes["modified"]:
 print(f"Replacing chunks for {len(changes['modified'])} documents...")
 # vector_{db}.delete(filter={"doc_{id}": {"$in": changes["modified"]}})
 # vector_db.upsert(new_chunks)

 self.update_state(documents)

# Usage
indexer = IncrementalIndexer()
docs = {
 "report_2024.pdf": "Full text of the 2024 report...",
 "manual_v3.pdf": "Updated product manual content...",
 "faq.md": "Frequently asked questions...",
}
indexer.process_changes(docs)
Added: 3 documents Modified: 0 documents Deleted: 0 documents Processing 3 documents...
Code Fragment 19.4.6: Incremental indexing with content hashing

Metadata Enrichment

Every chunk should carry metadata that enables effective filtering and attribution. Essential metadata fields include:

Warning: Common Chunking Mistakes

The most common mistakes in document processing are: (1) Not evaluating chunking quality by measuring retrieval performance with different strategies and parameters on representative queries. (2) Ignoring document structure by applying the same chunking strategy to all document types. (3) Losing metadata context by stripping headers, section titles, or table captions during chunking. (4) Using the default settings of your framework without tuning chunk size and overlap for your specific content and queries. (5) Not handling tables and figures as special elements that should either be kept intact or described textually.

7. Evaluation and Iteration

Chunking is not a one-time configuration; it requires ongoing evaluation and tuning. The most effective approach is to build a small evaluation set of 50 to 100 representative queries with known relevant passages, then measure retrieval metrics (recall@k, MRR, NDCG) across different chunking configurations. Systematic A/B testing of chunking strategies often reveals that the optimal configuration depends heavily on the document type and query patterns specific to your application. Figure 19.4.6 shows the iterative evaluation loop for chunking quality.

Configure Strategy + params Chunk + Embed Process test docs Evaluate Recall@k, MRR Analyze Failure cases Iterate: adjust strategy, chunk size, overlap, metadata
Figure 19.4.6: Chunking quality requires iterative evaluation against representative queries with ground-truth relevance judgments.
Self-Check
Q1: Why is chunking quality often the most important factor in RAG system performance?
Show Answer
Chunking determines what the retriever can find. If a relevant answer is split across two chunks at an unfortunate boundary, the retriever may never surface it as a coherent result. If a chunk mixes two unrelated topics, its embedding will be a noisy average that matches neither topic well. No embedding model or vector database can compensate for poorly chunked documents because the quality of retrieval is fundamentally bounded by the quality of the units being retrieved.
Q2: What is the fundamental tradeoff in choosing chunk size?
Show Answer
Smaller chunks (100 to 200 tokens) produce more focused embeddings that match specific queries precisely, improving retrieval precision. However, they may lack sufficient context for the LLM to generate a complete answer. Larger chunks (500 to 1000 tokens) provide richer context but may cover multiple topics, making their embeddings less precise and reducing retrieval recall. The parent-child strategy resolves this by using small chunks for retrieval and returning larger parent chunks for LLM context.
Q3: How does semantic chunking differ from recursive character splitting?
Show Answer
Recursive character splitting uses predefined text separators (paragraph breaks, newlines, spaces) in a hierarchical order to find chunk boundaries. It is rule-based and deterministic. Semantic chunking uses the embedding model itself to determine boundaries: it embeds each sentence, computes similarity between consecutive sentences, and splits where similarity drops significantly. This produces chunks aligned with actual topic transitions rather than formatting conventions. Semantic chunking is slower (it requires embedding all sentences) but produces more coherent chunks for content with complex topic structure.
Q4: How does parent-child retrieval solve the chunk-size dilemma?
Show Answer
Parent-child retrieval decouples the retrieval unit from the context unit. Small child chunks (100 to 200 tokens) are embedded and used for retrieval because their focused content produces precise, topic-specific embeddings. When a child chunk matches a query, the system looks up its associated parent chunk (500 to 1000 tokens) and returns the parent to the LLM. This provides the precision of small chunks for matching while giving the LLM the broader context of large chunks for answer generation.
Q5: What is incremental indexing and why is it necessary for production systems?
Show Answer
Incremental indexing tracks document versions (via content hashes or timestamps) and processes only documents that have been added, modified, or deleted since the last indexing run. It is necessary because re-processing an entire corpus on every update is expensive and slow. A production system with thousands of documents that change daily must detect which documents have changed, remove old chunks for modified or deleted documents, and insert new chunks, all without reprocessing unchanged documents. This requires maintaining a mapping between source documents and their chunks in the vector database.
Tip: Tune HNSW Parameters for Your Use Case

The default HNSW index parameters (ef_construction, M) work for prototyping but not production. Higher ef_construction (256 to 512) improves recall at index build time cost; higher M (32 to 64) improves search quality at memory cost. Tune these based on your recall requirements.

Real-World Scenario: Optimizing Chunk Size for a Medical Knowledge Base

Who: An NLP engineer at a health-tech company building a clinical decision support tool

Situation: The system indexed 120,000 medical journal articles, clinical guidelines, and drug interaction databases. Physicians queried it during patient consultations expecting precise, citation-worthy answers.

Problem: Using a fixed 512-token chunk size produced fragments that split drug dosage tables, broke apart multi-step treatment protocols, and lost critical context about contraindications.

Dilemma: Larger chunks (1,024 tokens) preserved context but reduced retrieval precision because irrelevant content diluted the embedding signal. Smaller chunks (256 tokens) improved precision but often omitted the surrounding clinical context physicians needed.

Decision: The team implemented semantic chunking using section headers and paragraph boundaries, with a target range of 300 to 600 tokens per chunk. They added 2-sentence overlap between adjacent chunks and stored parent document IDs for context expansion at retrieval time.

How: A custom parser detected document structure (headers, lists, tables) and kept logical units intact. Tables were chunked as single units regardless of token count. Metadata (article title, section name, publication year) was prepended to each chunk before embedding.

Result: Answer accuracy (judged by physicians) improved from 71% to 89%. Retrieval precision@5 rose from 0.54 to 0.78, and physicians reported that returned passages were "immediately useful" rather than requiring manual context reconstruction.

Lesson: Chunking strategy should respect document structure rather than applying arbitrary token boundaries. Preserving logical units (tables, protocols, lists) and adding metadata context produces dramatically better retrieval quality.

8. Topic Modeling with LLM Embeddings

Topic modeling discovers the latent themes in a collection of documents without requiring labeled data. Classical approaches like LDA (Latent Dirichlet Allocation) and NMF (Non-negative Matrix Factorization) operate on bag-of-words representations, which discard word order and semantic nuance. BERTopic (Grootendorst, 2022) replaces this with a pipeline built on the same embedding models used for retrieval, producing topics that are semantically coherent and interpretable. Understanding BERTopic is valuable because the same embeddings you create for RAG (as covered in Section 19.1) can power topic discovery, clustering, and content organization without additional model training.

8.1 The BERTopic Pipeline

BERTopic operates in four sequential stages, each handled by a separate, swappable component:

  1. Embed: Convert each document to a dense vector using a sentence transformer model (the same models from Section 19.1).
  2. Reduce: Project the high-dimensional embeddings into a lower-dimensional space using UMAP, preserving local neighborhood structure while making clustering feasible.
  3. Cluster: Group similar documents using HDBSCAN, a density-based clustering algorithm that automatically determines the number of clusters and identifies outliers (documents that do not fit any topic).
  4. Represent: Label each cluster with descriptive terms using c-TF-IDF (class-based TF-IDF) or, optionally, an LLM that generates human-readable topic labels from the cluster's representative documents.

8.2 BERTopic vs. Classical Topic Models

8.2 BERTopic vs. Classical Topic Models
Dimension LDA NMF BERTopic
Input representation Bag-of-words (BoW) TF-IDF matrix Dense embeddings (sentence transformers)
Semantic awareness None (word co-occurrence only) Minimal (term weighting) Full (contextual embeddings)
Number of topics Must be specified upfront Must be specified upfront Automatically determined by HDBSCAN
Short text handling Poor (sparse BoW vectors) Poor Good (dense embeddings capture meaning)
Topic coherence Moderate Good Excellent (semantically grouped)
Scalability Good (efficient inference) Good Moderate (embedding step is the bottleneck)
Dynamic topics Not natively supported Not natively supported Built-in support for topics over time

8.3 Practical Example

This snippet demonstrates a practical hybrid search query that combines dense and sparse retrieval.

# BERTopic: embedding-based topic modeling
# Uses the same sentence transformers as RAG embedding pipelines
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN

# Step 1: Configure each pipeline component
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
umap_model = UMAP(
 n_neighbors=15, n_components=5,
 min_dist=0.0, metric="cosine",
)
hdbscan_model = HDBSCAN(
 min_cluster_size=15,
 metric="euclidean",
 prediction_data=True,
)

# Step 2: Build the BERTopic model with custom components
topic_model = BERTopic(
 embedding_model=embedding_model,
 umap_model=umap_model,
 hdbscan_model=hdbscan_model,
 verbose=True,
)

# Step 3: Fit on your documents (e.g., customer support tickets)
documents = [
 "My order has not arrived after two weeks",
 "How do I reset my password for the dashboard?",
 "The API returns a 500 error on large payloads",
 "I was charged twice for the same subscription",
 "Can I export my data as a CSV file?",
 # ... thousands more documents
]

topics, probs = topic_model.fit_transform(documents)

# Step 4: Inspect discovered topics
topic_info = topic_model.get_topic_info()
print(topic_info.head(10))

# Step 5: Get the top terms for a specific topic
for topic_id in range(min(5, len(topic_model.get_topics()))):
 terms = topic_model.get_topic(topic_id)
 print(f"\nTopic {topic_id}:")
 for term, score in terms[:5]:
 print(f" {score:.3f} {term}")
Topic Count Name 0 -1 2 -1_default_topic 1 0 2 0_order_charged_subscription 2 1 1 1_password_reset_dashboard 3 2 1 2_api_error_payload Topic 0: 0.142 order 0.128 charged 0.115 subscription 0.098 arrived 0.087 weeks Topic 1: 0.189 password 0.167 reset 0.134 dashboard ...
Code Fragment 19.4.7: BERTopic: embedding-based topic modeling
Key Insight

Embedding-based topic models like BERTopic outperform bag-of-words approaches (LDA, NMF) because they capture semantic similarity rather than surface-level word co-occurrence. Two documents about "machine learning model deployment" and "putting ML systems into production" share no words in common, so LDA would assign them to different topics. BERTopic, working from dense embeddings, recognizes they discuss the same concept and clusters them together. This semantic awareness is especially valuable for short texts (tweets, support tickets, search queries) where bag-of-words vectors are too sparse to produce meaningful topics.

Fun Fact

BERTopic can optionally use an LLM to generate human-readable topic labels. Instead of a topic being described as "deployment, production, inference, serving, latency," the LLM reads the cluster's representative documents and produces a label like "ML Model Deployment and Serving Infrastructure." This makes topic models immediately useful for non-technical stakeholders who need to understand what their customer base is talking about.

Key Takeaways
  • Chunking quality bounds RAG quality. No downstream component can compensate for chunks that split relevant information or mix unrelated topics.
  • Recursive character splitting is the best default for most text content, balancing simplicity with respect for natural text boundaries.
  • Semantic chunking produces the most coherent chunks by detecting topic boundaries via embedding similarity, at the cost of additional computation.
  • Structure-aware chunking is essential for formatted documents (PDFs, HTML, Markdown) where headings, tables, and figures define natural semantic units.
  • Parent-child retrieval resolves the chunk-size tradeoff by using small chunks for precise retrieval and large chunks for LLM context.
  • Always enrich chunks with metadata (source, page, section title, date) to enable filtered search and proper attribution.
  • Build an evaluation set of representative queries with known relevant passages, and systematically test chunking configurations against retrieval metrics.
  • Incremental indexing with content hashing is essential for production pipelines that process evolving document collections.

Lab: Build and Compare Document Chunking Strategies

Duration: ~45 minutes Intermediate

Objective

Implement three different chunking strategies (fixed-size, recursive, semantic), apply them to a structured document, and compare their retrieval quality on test queries.

What You'll Practice

  • Implementing fixed-size chunking with character-based overlap
  • Building recursive text splitting using structural markers (headers, paragraphs)
  • Creating semantic chunking using embedding similarity breakpoints
  • Measuring retrieval quality differences between chunking strategies

Setup

The following cell installs the required packages and configures the environment for this lab.

pip install sentence-transformers numpy
Document: 1042 chars, 18 lines
Code Fragment 19.4.8: Code example

Steps

Step 1: Create a sample document

Define a structured document with clear section boundaries.

document = (
 "# Introduction to Machine Learning\n\n"
 "Machine learning is a branch of artificial intelligence that enables "
 "computers to learn from data without being explicitly programmed.\n\n"
 "## Supervised Learning\n\n"
 "In supervised learning, the algorithm learns from labeled training data. "
 "Each example consists of an input and a desired output.\n\n"
 "### Classification\n\n"
 "Classification predicts categorical labels. For example, an email spam "
 "filter classifies emails as spam or not spam. Popular algorithms include "
 "logistic regression, SVMs, and random forests.\n\n"
 "### Regression\n\n"
 "Regression predicts continuous numerical values. For instance, predicting "
 "house prices based on features like square footage and location.\n\n"
 "## Unsupervised Learning\n\n"
 "Unsupervised learning works with unlabeled data, seeking to discover "
 "hidden patterns and structures without target labels.\n\n"
 "### Clustering\n\n"
 "Clustering groups similar data points together. K-means partitions data "
 "into k groups. DBSCAN discovers clusters of arbitrary shape based on "
 "density. Hierarchical clustering builds a tree of nested clusters.\n\n"
 "### Dimensionality Reduction\n\n"
 "Dimensionality reduction compresses high-dimensional data. PCA finds "
 "directions of maximum variance. t-SNE and UMAP create 2D visualizations "
 "that preserve local neighborhood structure.\n\n"
 "## Deep Learning\n\n"
 "Deep learning uses neural networks with many layers to learn hierarchical "
 "representations. It has achieved breakthroughs in vision, NLP, and games."
)

print(f"Document: {len(document)} chars, {document.count(chr(10))} lines")
Code Fragment 19.4.9: Working with data, labeled
Hint

This document has clear structural markers: # for h1, ## for h2, ### for h3, and blank lines between paragraphs. Good chunking should respect these boundaries.

Step 2: Implement three chunking strategies

Build fixed-size, recursive, and semantic chunkers.

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# Strategy 1: Fixed-size with overlap
def fixed_chunk(text, size=300, overlap=50):
 chunks, start = [], 0
 while start < len(text):
 chunk = text[start:start+size].strip()
 if chunk:
 chunks.append(chunk)
 start += size - overlap
 return chunks

# Strategy 2: Recursive splitting on headers
def recursive_chunk(text, max_size=500):
 # TODO: Split on "## " first, then "### " for oversized sections
 # Merge chunks smaller than max_size/3 with their neighbors
 sections = text.split("\n## ")
 chunks = []
 for section in sections:
 section = section.strip()
 if not section:
 continue
 if len(section) <= max_size:
 chunks.append(section)
 else:
 for sub in section.split("\n### "):
 sub = sub.strip()
 if sub:
 chunks.append(sub)
 return chunks

# Strategy 3: Semantic chunking
def semantic_chunk(text, model, threshold=0.5):
 # TODO: Split into sentences, encode them, find similarity drops
 sentences = [s.strip() for s in text.replace('\n', ' ').split('. ')
 if len(s.strip()) > 10]
 if len(sentences) <= 1:
 return sentences
 embs = model.encode(sentences)
 norms = np.linalg.norm(embs, axis=1)
 chunks, current = [], [sentences[0]]
 for i in range(len(sentences) - 1):
 sim = np.dot(embs[i], embs[i+1]) / (norms[i] * norms[i+1] + 1e-8)
 if sim < threshold:
 chunks.append(". ".join(current))
 current = [sentences[i+1]]
 else:
 current.append(sentences[i+1])
 if current:
 chunks.append(". ".join(current))
 return chunks

c_fixed = fixed_chunk(document)
c_recursive = recursive_chunk(document)
c_semantic = semantic_chunk(document, model)

for name, chunks in [("Fixed", c_fixed), ("Recursive", c_recursive),
 ("Semantic", c_semantic)]:
 print(f"\n{name}: {len(chunks)} chunks")
 for i, c in enumerate(chunks):
 print(f" [{i}] {len(c)} chars: {c[:60]}...")
Fixed: 5 chunks [0] 300 chars: # Introduction to Machine Learning Machine learning is a b... [1] 300 chars: ed training data. Each example consists of an input and a d... [2] 300 chars: ps similar data points together. K-means partitions data in... [3] 300 chars: eduction compresses high-dimensional data. PCA finds direct... [4] 142 chars: Deep learning uses neural networks with many layers to lear... Recursive: 6 chunks [0] 178 chars: # Introduction to Machine Learning Machine learning is a b... [1] 153 chars: Supervised Learning In supervised learning, the algorithm ... [2] 142 chars: Classification Classification predicts categorical labels.... [3] 208 chars: Clustering Clustering groups similar data points together.... [4] 195 chars: Dimensionality Reduction Dimensionality reduction compres... [5] 148 chars: Deep Learning Deep learning uses neural networks with many... Semantic: 4 chunks [0] 312 chars: Machine learning is a branch of artificial intelligence tha... [1] 287 chars: Clustering groups similar data points together. K-means par... [2] 215 chars: Dimensionality reduction compresses high-dimensional data. ... [3] 148 chars: Deep learning uses neural networks with many layers to lear...
Code Fragment 19.4.10: Implementation of fixed_chunk, recursive_chunk, semantic_chunk
Hint

For semantic chunking, compute cosine similarity between consecutive sentence embeddings. A drop below the threshold signals a topic change, which is where you create a new chunk.

Step 3: Compare retrieval quality

Search each set of chunks and check which strategy finds the best match.

queries_expected = [
 ("What is classification in ML?", "classification"),
 ("How does clustering work?", "clustering"),
 ("What is PCA used for?", "dimensionality"),
 ("What is deep learning?", "deep learning"),
]

def search_chunks(query, chunks, model, top_k=1):
 qe = model.encode(query)
 ce = model.encode(chunks)
 scores = np.dot(ce, qe) / (np.linalg.norm(ce, axis=1) * np.linalg.norm(qe))
 idx = np.argsort(scores)[::-1][:top_k]
 return [(chunks[i], scores[i]) for i in idx]

for query, keyword in queries_expected:
 print(f"\nQuery: {query}")
 for name, chunks in [("Fixed", c_fixed), ("Recursive", c_recursive),
 ("Semantic", c_semantic)]:
 top_chunk, score = search_chunks(query, chunks, model)[0]
 hit = "PASS" if keyword.lower() in top_chunk.lower() else "MISS"
 print(f" {name:10s} [{hit}] score={score:.3f} | {top_chunk[:55]}...")
Query: What is classification in ML? Fixed [MISS] score=0.487 | ed training data. Each example consists of an in... Recursive [PASS] score=0.612 | Classification Classification predicts categori... Semantic [PASS] score=0.534 | Machine learning is a branch of artificial intel... Query: How does clustering work? Fixed [PASS] score=0.523 | ps similar data points together. K-means partiti... Recursive [PASS] score=0.648 | Clustering Clustering groups similar data points... Semantic [PASS] score=0.591 | Clustering groups similar data points together. ... Query: What is PCA used for? Fixed [PASS] score=0.481 | eduction compresses high-dimensional data. PCA f... Recursive [PASS] score=0.573 | Dimensionality Reduction Dimensionality reducti... Semantic [PASS] score=0.542 | Dimensionality reduction compresses high-dimensi... Query: What is deep learning? Fixed [PASS] score=0.612 | Deep learning uses neural networks with many lay... Recursive [PASS] score=0.687 | Deep Learning Deep learning uses neural networks... Semantic [PASS] score=0.654 | Deep learning uses neural networks with many lay...
Code Fragment 19.4.11: Implementation of search_chunks
Hint

Recursive chunking should perform best because chunks align with natural topic boundaries. Fixed-size chunks may split topics mid-sentence.

Expected Output

  • Three sets of chunks with different sizes (fixed: ~8, recursive: ~6, semantic: ~5 to 8)
  • Recursive and semantic chunking matching relevant chunks more reliably
  • Fixed-size chunking occasionally missing because topics get split at boundaries

Stretch Goals

  • Add metadata (section title, position) to each chunk and use it to improve retrieval context
  • Implement a "parent document retriever" that returns the larger parent section when a small chunk matches
  • Test on a real PDF by extracting text with PyMuPDF and applying the same strategies
Complete Solution
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

document = (
 "# Introduction to Machine Learning\n\n"
 "Machine learning enables computers to learn from data.\n\n"
 "## Supervised Learning\n\nLearns from labeled data.\n\n"
 "### Classification\n\nPredicts categorical labels like spam/not-spam.\n\n"
 "### Regression\n\nPredicts continuous values like house prices.\n\n"
 "## Unsupervised Learning\n\nFinds patterns in unlabeled data.\n\n"
 "### Clustering\n\nGroups similar points. K-means, DBSCAN, hierarchical.\n\n"
 "### Dimensionality Reduction\n\nCompresses data. PCA, t-SNE, UMAP.\n\n"
 "## Deep Learning\n\nNeural networks with many layers for hierarchical representations."
)

def fixed_chunk(text, size=300, overlap=50):
 chunks, s = [], 0
 while s < len(text):
 c = text[s:s+size].strip()
 if c: chunks.append(c)
 s += size - overlap
 return chunks

def recursive_chunk(text, max_size=500):
 chunks = []
 for sec in text.split("\n## "):
 sec = sec.strip()
 if not sec: continue
 if len(sec) <= max_size: chunks.append(sec)
 else:
 for sub in sec.split("\n### "):
 sub = sub.strip()
 if sub: chunks.append(sub)
 return chunks

def semantic_chunk(text, model, threshold=0.5):
 sents = [s.strip() for s in text.replace('\n',' ').split('. ') if len(s.strip())>10]
 if len(sents) <= 1: return sents
 embs = model.encode(sents)
 norms = np.linalg.norm(embs, axis=1)
 chunks, cur = [], [sents[0]]
 for i in range(len(sents)-1):
 sim = np.dot(embs[i],embs[i+1])/(norms[i]*norms[i+1]+1e-8)
 if sim < threshold: chunks.append(". ".join(cur)); cur = [sents[i+1]]
 else: cur.append(sents[i+1])
 if cur: chunks.append(". ".join(cur))
 return chunks

cf, cr, cs = fixed_chunk(document), recursive_chunk(document), semantic_chunk(document, model)

def search(q, chunks, model):
 qe = model.encode(q); ce = model.encode(chunks)
 scores = np.dot(ce,qe)/(np.linalg.norm(ce,axis=1)*np.linalg.norm(qe))
 i = np.argmax(scores)
 return chunks[i], scores[i]

for q, kw in [("What is classification?","classification"),("How does clustering work?","clustering"),
 ("What is PCA?","dimensionality"),("What is deep learning?","deep learning")]:
 print(f"\n{q}")
 for nm, ch in [("Fixed",cf),("Recursive",cr),("Semantic",cs)]:
 c, s = search(q, ch, model)
 print(f" {nm:10s} [{'PASS' if kw in c.lower() else 'MISS'}] {s:.3f} | {c[:55]}")
What is classification? Fixed [MISS] 0.487 | ed training data. Each example consists of an in Recursive [PASS] 0.612 | Classification Predicts categorical labels like Semantic [PASS] 0.534 | Machine learning enables computers to learn from ...
Code Fragment 19.4.12: Implementation of fixed_chunk, recursive_chunk, semantic_chunk
Research Frontier

LLM-guided chunking uses language models to identify semantic boundaries in documents, producing chunks that align with topical shifts rather than arbitrary token counts. Late chunking (Jina AI, 2024) embeds the full document first and then splits the embedding sequence into chunks, preserving cross-chunk context that is lost with naive chunking. Proposition-based indexing decomposes documents into atomic factual statements before embedding, improving retrieval precision for fact-seeking queries. Research into multimodal document parsing (combining OCR, layout analysis, and vision models) is enabling chunking of complex documents with tables, figures, and mixed layouts.

Exercises

These exercises cover document parsing, chunking strategies, and topic modeling. Use LangChain or LlamaIndex document loaders for coding exercises.

Exercise 19.4.1: Chunk size tradeoff Conceptual

A 200-token chunk about climate change is highly precise for retrieval, but the LLM struggles to generate a good answer from it. A 1000-token chunk provides more context but retrieves poorly. How would you resolve this tension?

Show Answer

Use parent-child retrieval: embed small chunks (200 tokens) for precise retrieval, but pass the larger parent chunk (800 to 1000 tokens) to the LLM for generation. This gets the best of both worlds.

Exercise 19.4.2: Overlap rationale Conceptual

Explain why overlapping chunks improve retrieval quality. What percentage of overlap is typical, and what happens if overlap is too high?

Show Answer

Overlap ensures that information spanning a chunk boundary is captured in at least one chunk. Typical overlap is 10 to 15% of chunk size. Excessive overlap (e.g., 50%) wastes storage and creates near-duplicate embeddings that dilute search results.

Exercise 19.4.3: Semantic chunking Conceptual

Compare fixed-size chunking with semantic chunking (splitting at topic boundaries). When does semantic chunking clearly outperform fixed-size, and when is fixed-size good enough?

Show Answer

Semantic chunking outperforms fixed-size when documents contain clear topic shifts (e.g., news articles, textbooks). Fixed-size is good enough when documents are homogeneous (e.g., product reviews, FAQ entries) or when the retrieval pipeline includes reranking.

Exercise 19.4.4: Parent-child retrieval Conceptual

Describe the parent-child chunking strategy. Why would you retrieve on small chunks but pass the parent chunk to the LLM?

Show Answer

Small child chunks produce more focused embeddings (better retrieval precision), while parent chunks give the LLM enough surrounding context to generate coherent, grounded answers.

Exercise 19.4.5: Multimodal documents Conceptual

You have a 200-page financial report with charts, tables, and prose. Compare two approaches: (a) extract text only and chunk, (b) use vision-based retrieval with ColPali. What are the tradeoffs?

Show Answer

(a) Text extraction loses visual layout, table structure, and chart data, but is faster and cheaper. (b) ColPali preserves visual information and handles charts/tables natively, but requires more compute and storage (one embedding per page patch). Best approach: use text extraction for prose-heavy sections and ColPali for pages with visual elements.

Exercise 19.4.6: Chunking comparison Conceptual

Take a 5-page document and chunk it three ways: fixed 512 tokens, recursive character splitting, and by paragraph boundaries. Count the chunks produced and examine where splits occur. Which method preserves semantic coherence best?

Exercise 19.4.7: Chunk evaluation Coding

Write an evaluation script: given a set of 20 questions with known answer passages, embed all chunks using a sentence transformer, retrieve top-5 for each question, and compute Hit Rate and MRR. Compare across chunk sizes of 128, 256, 512, and 1024 tokens.

Exercise 19.4.8: Topic modeling Coding

Use BERTopic on a collection of at least 500 documents (e.g., 20 Newsgroups). Visualize the discovered topics and compare them against the known categories.

Exercise 19.4.9: Parent-child implementation Coding

Implement a parent-child retrieval system where child chunks (128 tokens) are used for retrieval but parent chunks (512 tokens) are passed to the LLM. Compare answer quality against a flat 512-token chunking approach.

What Comes Next

In the next section, Section 19.5: Vision-Based Document Retrieval, we explore how vision-based retrieval with ColPali and ColQwen2 bypasses the text extraction pipeline entirely by processing document pages as images, enabling retrieval from visually rich content that text-based methods cannot handle.

References & Further Reading
Chunking Frameworks & Tutorials

LangChain Documentation: Text Splitters

Comprehensive guide to LangChain's text splitting abstractions including recursive character, token-based, and semantic chunking. Good starting point for understanding the API surface of chunking tools.

🎓 Tutorial

LlamaIndex Documentation: Node Parsers

LlamaIndex's approach to document parsing and node creation, including sentence-window and hierarchical parsers. Useful for comparing chunking philosophies across frameworks.

🎓 Tutorial

Kamradt, G. (2023). "Chunking Strategies for LLM Applications."

Practical, code-driven comparison of chunking strategies with visual examples. Includes the influential "five levels of chunking" framework from simple to semantic splitting.

🎓 Tutorial
Document Parsing Tools

Unstructured.io: Open-source Document Parsing

Industry-standard open-source library for extracting text and structure from PDFs, DOCX, HTML, images, and more. Handles complex layouts with table detection and OCR integration.

🔧 Tool

Nougat: Neural Optical Understanding for Academic Documents.

Meta's neural approach to converting academic PDFs (including equations and tables) to structured Markdown. Particularly effective for scientific papers where traditional OCR fails on mathematical notation.

📄 Paper

Marker: PDF to Markdown Converter

High-quality PDF to Markdown converter that preserves document structure, tables, and formatting. Faster and more accurate than many alternatives for general-purpose PDF extraction.

🔧 Tool