Section 36.2

Libraries and Frameworks

"A thin library is six function calls you understand; a thick framework is six abstractions hiding the same six function calls. Choose the one you can debug at 3 a.m."

PipPip, Dependency-Pruning AI Agent
Big Picture

Where the platform (Section 36.1) owns the index, the libraries own the pipeline that fills it and queries it. A production retrieval stack composes five library families: an embedding client (sentence-transformers, fastembed, infinity, FlagEmbedding, or a hosted-API SDK from Cohere, OpenAI, or Voyage); a reranker (cross-encoder or hosted reranker API); a retrieval orchestrator (LangChain, LlamaIndex, Haystack, DSPy) that wires the steps together; a document loader (Unstructured.io, LlamaParse, Docling, MarkItDown, pypdf, pymupdf) that turns PDFs and Office files into chunked text; and a hybrid-retrieval helper (rank-bm25, ranx, or in-engine fusion) that combines lexical and dense scores. The 2026 lesson is that "framework" and "library" are not the same: prefer thin libraries you control over thick frameworks that own your data flow, and use the orchestrators for what they are good at (caching, callbacks, evaluation hooks) rather than as a hidden control flow.

Prerequisites

This section assumes the retrieval platforms from Section 36.1 and the basic RAG orchestration patterns from Section 32.1.

The library choice is less consequential than the platform choice but more consequential than people think. A team that wires sentence-transformers directly into a Postgres-with-pgvector stack writes 200 lines of Python and owns every step. A team that picks LlamaIndex as the orchestrator can build the same system in 30 lines but inherits the framework's abstractions for chunking, retrieval, and post-processing, which can be hard to override when production requirements diverge from the framework's defaults. Both are valid; the wrong default is to assume that a framework is required when the underlying operations are six function calls.

36.2.1 Embedding libraries

Split panel cartoon. Left panel: a small thin library labelled six function calls is six neat building blocks the robot understands. Right panel: a thick framework labelled six abstractions is a tower of opaque crates with the same six blocks barely visible inside, while the robot scratches its head at three a m
Figure 36.2.1: A thin library is six function calls you can read. A thick framework is six abstractions hiding the same six function calls. Choose the one you can debug at 3 a.m.

Embedding libraries fall into three groups: open-weight runners (you load a model from Hugging Face and call .encode), high-throughput servers (Infinity, TEI, fastembed), and SDKs for hosted APIs. Each has a different operational profile and a different cost story.

Real-World Scenario
A typical encoding loop in sentence-transformers

The default open-weight embedder workflow is six lines (see code block below). The two production-relevant details: (1) the model-specific query prompt (BGE, E5, GTE all have different conventions; getting this wrong silently loses 2-5 NDCG points) and (2) normalize_embeddings=True so cosine similarity equals dot product in your vector database. Most "my retrieval is bad" tickets in 2024-25 traced back to missing one of these two.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-large-en-v1.5", device="cuda")

# BGE expects a query prefix; the document side is unprefixed.
query_emb = model.encode("federal reserve interest rate decision",
                         prompt_name="query", normalize_embeddings=True)
doc_embs = model.encode(documents, batch_size=64, normalize_embeddings=True)

36.2.2 Reranking libraries

Rerankers run after the first-stage retriever, on a candidate set of ~50-200 documents, with a more expensive model that scores each query-document pair directly. Two architectural families dominate: cross-encoders (one transformer over [query; document]) and ColBERT-style late-interaction (per-token vectors with MaxSim aggregation). The hosted reranker APIs are usually cross-encoders behind the curtain.

Library Shortcut
rerankers (Answer.AI) as the unified reranker API

The bullets above each have a different loading API (sentence-transformers CrossEncoder, the Cohere SDK, Jina's HTTP client, PyLate's ColBERT loader). The rerankers package from Answer.AI (Clavie, 2024) collapses all of them into one Reranker("model-name", model_type=...) constructor, so swapping BGE for Cohere for ColBERT is a one-line change. Prefer rerankers when you expect to A/B-test reranker choices, or when you want a hosted-or-self-hosted toggle without rewriting glue code.

Show code
pip install rerankers[transformers]
from rerankers import Reranker

# Pick a model by string; the same call works for any backend.
ranker = Reranker("BAAI/bge-reranker-v2-m3", model_type="cross-encoder")
# ranker = Reranker("cohere", api_key="...", lang="en")      # hosted Cohere
# ranker = Reranker("answerdotai/answerai-colbert-small-v1", model_type="colbert")

results = ranker.rank(query="When was BGE-M3 released?", docs=candidate_docs)
top_ids = [result.doc_id for result in results.top_k(5)]
Code Fragment 36.2.1.1: One reranker API across cross-encoders, Cohere, and ColBERT.

36.2.3 Retrieval orchestration frameworks

Orchestration frameworks are not the index and not the embedder; they are the glue that wires them into a pipeline, plus the callbacks, retries, caching, evaluation hooks, and observability that production stacks need. The 2026 landscape has converged on four heavyweight options plus several lightweight competitors.

Key Insight
Framework abstractions leak when production requirements diverge

Every orchestration framework has reasonable defaults that work well for the demo and break when production requires something specific. The recurring pattern: you start on LangChain or LlamaIndex, hit a requirement the framework does not anticipate (custom chunking with metadata propagation, mid-pipeline reranking with a private model, a hybrid query with engine-specific filter syntax), and find that overriding the framework's behavior is harder than reading the underlying vendor SDK and writing the pipeline by hand. The right mental model: use frameworks for their callbacks and observability, drop to the vendor SDK for the actual retrieval call. The thinnest viable stack in 2026 is "vendor vector DB SDK + sentence-transformers + your own 50-line pipeline + LangSmith or LangFuse for tracing"; everything heavier should justify itself against this baseline.

Figure 36.2.2 makes the nesting concrete: every abstraction layer eventually bottoms out in one HTTP call.

A row of nested Russian matryoshka dolls in decreasing size, labeled LangChain, LlamaIndex, Anthropic SDK, and a tiny innermost doll labeled POST /v1/messages, while an engineer in glasses peers into the smallest doll with a flashlight.
Figure 36.2.2: Each framework wraps the one beneath it, and at the center of the nesting dolls is a single POST /v1/messages. When an abstraction leaks, debugging means opening every doll until you reach the HTTP call the framework was hiding.

36.2.4 Document loaders and parsers

"Garbage in, garbage out" is the recurring lesson of production RAG. The chunk that goes into your embedder determines the quality of your retrieval; the parser that produces that chunk from a PDF or DOCX or HTML determines the chunk. The 2026 parser landscape sorts into low-level libraries (pypdf, pymupdf) and high-level structure-aware parsers (Unstructured, LlamaParse, Docling, MarkItDown).

Library Shortcut: Marker for table-heavy and math-heavy PDFs

marker-pdf (Vik Paruchuri / Datalab, 2024) is an open-weights PDF-to-Markdown converter that combines layout detection, table recognition (Surya), and an OCR fallback into a single pipeline; on the standard PDF benchmarks (PubLayNet, FinTabNet, and Marker's own evaluation) it outperforms Unstructured and is competitive with LlamaParse on tables and math while remaining fully self-hostable. Prefer marker-pdf when your corpus is scientific papers, financial filings, or textbooks where tables and equations dominate retrieval quality.

Show code
pip install marker-pdf
# One-shot CLI: produces report.md (and report_meta.json) next to the PDF.
marker_single report.pdf --output_dir ./parsed/ --output_format markdown
# Python API for batch ingestion:
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
converter = PdfConverter(artifact_dict=create_model_dict())
rendered = converter("report.pdf")
markdown_text, _, images = rendered.markdown, rendered.metadata, rendered.images
Code Fragment 36.2.2.1: Marker as the table-and-math-aware PDF parser.

36.2.5 Hybrid retrieval helpers

When your platform does not run hybrid search natively (pgvector, Chroma, FAISS), you implement it yourself. The library catalog for the BM25 side and the fusion step:

Real-World Scenario
A minimal hybrid retriever with bm25s and ranx

When the vector store does not run hybrid for you, the pattern is two queries plus rank fusion (see code block below). The RRF k parameter (60 is the de facto default) controls how much weight late ranks get. For most production cases, RRF-with-default-k is within 1-2 NDCG points of a fully tuned weighted-sum fusion and does not require tuning.

import bm25s
from ranx import Run, fuse

# 1. Pre-built BM25 index (loaded from disk).
bm25 = bm25s.BM25.load("bm25_index/")

# 2. Dense top-k from your vector database (Chroma in this example).
dense_hits = collection.query(query_texts=[query], n_results=50)
dense_ids  = dense_hits["ids"][0]
dense_scores = dense_hits["distances"][0]

# 3. BM25 top-k.
sparse_ids, sparse_scores = bm25.retrieve(
    bm25s.tokenize(query), k=50, return_as="tuple")

# 4. Reciprocal Rank Fusion via ranx.
dense_run  = Run({"q": dict(zip(dense_ids,  -1.0 * dense_scores))})  # cosine dist -> score
sparse_run = Run({"q": dict(zip(sparse_ids,         sparse_scores))})
fused = fuse([dense_run, sparse_run], method="rrf", params={"k": 60})
top_ids = list(fused["q"].keys())[:10]
Algorithm 36.2.1: Algorithm: Reciprocal Rank Fusion (Cormack, Clarke & Buettcher 2009)

Reciprocal Rank Fusion (RRF) combines $Q$ ranked lists into one by summing reciprocal-rank contributions:

$$\text{RRF}(d) = \sum_{q \in Q} \frac{1}{k + \text{rank}_q(d)}$$

where $\text{rank}_q(d)$ is document $d$'s rank in retriever $q$'s output (rank 1 = top), with $\text{rank}_q(d) = \infty$ if $d$ is not retrieved by $q$ (contribution drops to 0). The constant $k = 60$ (proposed in the original paper) downweights early-rank dominance: at rank 1 the contribution is $\frac{1}{61} \approx 0.0164$, at rank 10 it is $\frac{1}{70} \approx 0.0143$, at rank 100 it is $\frac{1}{160} \approx 0.00625$. Larger $k$ flattens the rank curve further; smaller $k$ rewards top positions more aggressively.

Worked example. Two retrievers, BM25 and a dense embedder, return the following ranks for three documents:

DocumentBM25 rankDense rankRRF score (k=60)
$d_A$15$\frac{1}{61} + \frac{1}{65} \approx 0.0318$
$d_B$32$\frac{1}{63} + \frac{1}{62} \approx 0.0320$
$d_C$101$\frac{1}{70} + \frac{1}{61} \approx 0.0307$

$d_B$ wins because it appears near the top in both rankings; the score is robust to a single retriever's outlier ranks. The key property is that RRF is score-scale-free: it does not require calibrated similarities, which is what makes it the de facto default for hybrid BM25 + dense pipelines where the two scores live in different units.

Library Shortcut: The thinnest retrieval stack (2026)

For a small team standing up RAG from zero in a week, the canonical "four-library" recipe ships with no bespoke framework: sentence-transformers (BGE-M3 or NV-Embed encoding), Qdrant OSS (single container, filterable HNSW), RAGAS (reference-free faithfulness / context precision / context recall), and Arize Phoenix (trace-level retrieval observability and embedding-drift dashboards). Total install: pip install sentence-transformers qdrant-client ragas arize-phoenix. Add bm25s and ranx if hybrid retrieval is required; add a 50-line Python pipeline for ingest and query. This stack is the unrolled-loop equivalent of LangChain or LlamaIndex for a retrieval-first application; only adopt a heavier orchestrator once a concrete missing capability (callbacks, agent loops, multi-modal nodes) justifies it.

Show code
# End-to-end thinnest-viable RAG: encode, index, query, in three calls.
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

encoder = SentenceTransformer("BAAI/bge-m3")
client = QdrantClient(":memory:")
client.recreate_collection("docs",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE))

docs = ["Paris is the capital of France.", "Berlin is the capital of Germany."]
embeddings = encoder.encode(docs, normalize_embeddings=True)
client.upsert("docs", points=[PointStruct(id=i, vector=v.tolist(),
              payload={"text": d}) for i, (v, d) in enumerate(zip(embeddings, docs))])

hits = client.search("docs",
    query_vector=encoder.encode("What is the capital of France?",
                                normalize_embeddings=True).tolist(),
    limit=2)
Code Fragment 36.2.3: End-to-end thinnest-viable RAG: encode, index, query, in three calls.

36.2.6 Evaluation and observability libraries

Retrieval evaluation libraries are the often-skipped fourth leg of the stack; they are what turn "the retrieval feels better" into a number you can track. Coverage:

36.2.7 Comparing the orchestration frameworks

Table 36.2.1a: Retrieval orchestration frameworks (2026).
Framework Sweet spot Abstraction 2026 momentum
LangChain + LangGraph General LLM apps, agent loops Runnable + StateGraph High; agent push dominates
LlamaIndex RAG-first applications Index + QueryEngine High; retrieval-deep catalog
Haystack 2.x Typed pipelines, production audit Component + Pipeline Stable; enterprise focus
DSPy Optimizer-driven prompt programs Module + Signature + Optimizer Rising; research-community led
RAGatouille ColBERT-first late interaction One-line ColBERT API Niche but growing
GraphRAG Entity-graph augmented RAG Ingest graph + query plan Active research area

36.2.8 The thinnest viable stack

The most defensible 2026 retrieval stack for a small production team:

This stack is six libraries plus one optional framework; it fits in a single repository and one person can hold the whole flow in their head. The recurring 2026 lesson is that adding more framework rarely makes retrieval better; tuning the embedder, the chunker, the reranker, and the BM25-vs-dense weight does. Figure 36.2.3 collects the six core ingredients on a single card.

A hand-illustrated recipe card on parchment titled The 2026 thinnest viable RAG stack, listing six ingredients with cute icons: Embedder sentence-transformers, Vector store pgvector, BM25 bm25s, Fusion ranx, Reranker BGE-Reranker, and Parser pymupdf.
Figure 36.2.3a: The thinnest viable RAG stack as a recipe card: sentence-transformers, pgvector, bm25s, ranx, BGE-Reranker, and pymupdf. Six small dependencies cover embedding, storage, lexical search, fusion, reranking, and parsing, which is the whole retrieval pipeline before any orchestration framework is added.
Warning: Pin everything, especially embedder versions

An embedder upgrade changes every vector in your index. The 2024 BGE v1.5 to BGE-M3 transition required full re-encoding for everyone who upgraded; the 2024-25 OpenAI text-embedding-ada-002 to text-embedding-3-large migration likewise. The right defaults: pin the embedder version in your dependency file, write the embedder model ID into the vector metadata so you can detect mixed-version corpora, and have a documented re-encoding playbook before you need it. Every team that has been bitten by this learned the lesson the same way: a quiet pip upgrade ships a "minor" embedder version bump and retrieval quality silently regresses.

36.2.9 Knowledge-graph and structured-retrieval libraries

For corpora where entity-and-relationship structure matters more than free-text similarity, the 2026 libraries that handle the structured side:

36.2.10 Chunking libraries

Chunking sits between the document parser and the embedder, and it is the single highest-leverage place to improve retrieval quality after the embedder choice. The 2026 chunking libraries:

The recurring chunking lesson: fixed-size chunking (split every N tokens) is the wrong default in 2026. Structure-aware chunking (split at headings, paragraph boundaries, or semantic breakpoints) gives 3-8 NDCG points over fixed-size on most corpora. The Anthropic contextual-retrieval recipe (prepend an LLM-generated context line to each chunk before embedding) adds another 5-15 points on top of structure-aware chunking. Both are cheap upgrades relative to changing the embedder.

Note
Library churn is high; pin versions and read changelogs

The 2024-25 LangChain and LlamaIndex releases broke their retriever APIs multiple times. The right defaults: pin minor versions in your dependency file, subscribe to the framework's release notes, and budget a few hours per quarter to absorb breaking changes. The library churn is the single biggest hidden cost of orchestration frameworks; treating it as a known operational tax (rather than a surprise) keeps the project moving.

What's Next?

In the next section, Section 36.3: Datasets and Benchmarks, we build on the material covered here.

Further Reading
Reimers, N. and Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP 2019. arxiv.org/abs/1908.10084. The Sentence-BERT paper and the original sentence-transformers library. The reference for pooled-transformer dense retrieval that defines the open-weight embedder category.
Khattab, O. and Zaharia, M. (2020). "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT." SIGIR 2020. arxiv.org/abs/2004.12832. The original ColBERT paper. Defines the late-interaction architecture that PyLate and RAGatouille re-implement for modern stacks.
Cormack, G. V., Clarke, C. L. A., and Buettcher, S. (2009). "Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods." SIGIR 2009. University of Waterloo / RRF paper. The RRF paper. The k=60 default that every hybrid retrieval pipeline uses traces back to this 2009 work.
Es, S., James, J., Espinosa-Anke, L., and Schockaert, S. (2023). "RAGAS: Automated Evaluation of Retrieval Augmented Generation." EACL 2024 Demo. arxiv.org/abs/2309.15217. The RAGAS paper, introducing the reference-free RAG-evaluation metrics (faithfulness, answer relevance, context precision/recall) that became the 2024-26 default.
Khattab, O. et al. (2024). "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines." ICLR 2024. arxiv.org/abs/2310.03714. The DSPy paper. Defines the Module / Signature / Optimizer abstractions that underlie the framework's prompt-as-program design.