Section 36.4

Models

Big Picture

This section catalogs the embedding and reranking models that power retrieval-augmented generation in 2026. An embedding model maps a query or document into a fixed-length vector so that semantically similar items end up close to each other; a reranker then re-scores the top-k candidates returned by the embedding-based search with a heavier cross-encoder pass to maximize relevance. The section compares three architectural families: bi-encoders (separate query and document towers, ANN-friendly), cross-encoders (joint encoding, slower but more accurate), and ColBERT-style late-interaction models (per-token vectors plus MaxSim). For each, we name the dominant closed-API and open-weight choices, their multilingual and multimodal coverage, and the licensing constraints that matter when shipping to production. For the underlying contrastive training, see Section 31.1.

Three retrieval-model architectures
Figure 36.4.1: The three retrieval architectures. Bi-encoders are the cheap default (one vector per doc, ANN-friendly), cross-encoders are the high-recall second stage (joint encoding, slow), and late-interaction (ColBERT) sits between them with per-token vectors and a MaxSim score that recovers cross-encoder accuracy at near-bi-encoder latency.

"Change the embedder, re-encode the corpus. Pick the model after the eval set, never before; the leaderboards shortlist three, and your data picks one."

VecVec, Dimensionally-Particular AI Agent
Big Picture

The 2026 embedding-model landscape sorts into four camps. Closed-API embedders (OpenAI text-embedding-3, Cohere Embed v3 / Embed-4, Voyage AI voyage-3) are the easiest path with strong multilingual support and per-call pricing. Open-weight embedders (BGE-M3, NV-Embed, GTE-Qwen2, Stella, Snowflake Arctic Embed, mxbai-embed, Linq-Embed-Mistral, Jina Embeddings v3) are the self-host path with similar quality and full control of the model. Late-interaction embedders (ColBERTv2, ColPali, JinaColBERT) emit per-token vectors for higher recall on hard queries. And multimodal embedders (ColPali for documents, Cohere Embed v3 multimodal, JinaCLIP, Voyage multimodal) handle images and document layouts directly. The choice axes are dimension count (affects vector-store cost), license (closed APIs vs Apache 2.0 vs research-only), context length (512 vs 8K vs 32K), language coverage, and whether you need the matryoshka property that lets you truncate the vector at query time.

Prerequisites

This section assumes the embedding-model architectures from Section 3.1, the open-versus-closed licensing landscape from Section 10.6, and the multilingual considerations from Section 32.4.

Pick the model after you have an in-domain eval set; pick it before you ingest a corpus, because the embedder defines the vectors and changing the embedder forces a full re-encode. The leaderboards (Section 36.3) shortlist the candidates; your eval set decides between the top three. The dimension count, license, and context length below are 2026-accurate but change quarterly; always check the model card for the current state.

36.4.1 Closed-API embedders

Closed APIs are the fastest path to retrieval that works. You pay per token (or per call), the model is opaque, and you cannot self-host. The 2026 standouts:

36.4.2 Open-weight embedders

Open-weight embedders are the right pick when self-hosting is required, when per-call API cost would dominate at scale, or when the model card needs to be inspectable. The 2026 leaders, mostly Apache 2.0 or MIT, runnable on a single 24GB GPU at production batch sizes:

36.4.3 Late-interaction and multi-vector models

Late-interaction models emit per-token vectors and score query-document pairs via MaxSim over all token pairs. They outperform single-vector dense retrieval on hard out-of-domain queries at the cost of 30-100x more storage per document. Use as rerankers or as primary retrievers with the right vector store.

Algorithm 36.4.1: Algorithm: ColBERT MaxSim late interaction

ColBERT (Khattab & Zaharia 2020) replaces the single-vector dot product with a per-token MaxSim aggregation. Given query $q$ with $|q|$ tokens producing per-token embeddings $\{q_1, \ldots, q_{|q|}\}$ and document $d$ with $|d|$ tokens producing $\{d_1, \ldots, d_{|d|}\}$ (each typically 128-dim normalized), the relevance score is:

$$s(q, d) = \sum_{i=1}^{|q|} \max_{j=1, \ldots, |d|} q_i \cdot d_j$$

Every query token contributes its best-matching document token's similarity; no soft averaging, no pooling. The "late" interaction means token-level matching happens at scoring time rather than being baked into a single sentence vector at encoding time, which preserves rare-token sensitivity in a way single-vector embeddings cannot.

Complexity and storage. Storage per document is $O(|d| \cdot d_{\text{tok}})$ where $d_{\text{tok}}$ is the per-token dimension (typically 128 in ColBERTv2, or 96 with PLAID quantization), versus $O(d)$ for a single-vector retriever (e.g. 1024 in BGE-M3). For an average 200-token passage with $d_{\text{tok}}=128$, that is $200 \cdot 128 \cdot 4 = 102{,}400$ bytes per document versus $1024 \cdot 4 = 4{,}096$ bytes single-vector: a $25\times$ storage premium. Score complexity is $O(|q| \cdot |d|)$ per pair: for $|q|=32$ tokens and $|d|=200$ tokens, that is 6,400 token-pair dot products versus 1 single-vector dot product, a $6{,}400\times$ ratio. PLAID indexes (Santhanam et al. 2022) and ColBERTv2's residual quantization cut both factors roughly $4\times$ to $8\times$, making the architecture viable for ~10M passages on commodity hardware.

36.4.4 Multimodal and cross-modal embedders

Multimodal embedders let you index and query across images, text, and sometimes audio in one shared vector space.

36.4.5 Reranker models

Reranker models are cross-encoders or late-interaction models that score query-document pairs on a candidate set. The 2026 leaders:

Real-World Scenario
Picking between BGE-M3, NV-Embed, and OpenAI text-embedding-3-large

The most common 2026 embedder decision is between an open-weight leader (BGE-M3 or NV-Embed) and OpenAI text-embedding-3-large. A worked decision:

For a healthcare RAG with PHI residency requirements: BGE-M3 wins (self-hostable, MIT, hybrid). For a research lab benchmarking against MTEB: NV-Embed wins (top scores, license OK for research). For a B2B SaaS prototyping in a week: OpenAI wins (one API key, ship Friday). All three are correct; the wrong default is to pick by leaderboard score alone.

36.4.6 Comparing the embedders

Table 36.4.1a: Embedding models (mid-2026).
Model Dim Context License Best for
OpenAI text-embedding-3-large 3072 (matryoshka) 8K API only Closed-API default
Cohere Embed v3 / Embed-4 1024 (matryoshka) 512 API only Multilingual hosted
Voyage voyage-3-large 1024 (matryoshka) 32K API only Long-doc, vertical-tuned
BGE-M3 1024 + sparse + 8x token 8K MIT Open hybrid default
NV-Embed-v2 4096 (matryoshka) 32K CC-BY-NC-4.0 MTEB-English top score
Stella en 1.5B v5 1024 (matryoshka) 8K MIT Small + top MTEB
Snowflake Arctic Embed L 2.0 1024 (matryoshka) 8K Apache 2.0 Enterprise open
GTE-Qwen2-7B 3584 32K Apache 2.0 Long-context permissive
Jina Embeddings v3 1024 (matryoshka to 32) 8K CC-BY-NC-4.0 Task-adapter routing
Linq-Embed-Mistral 4096 32K Apache 2.0 Mistral-lineage open
mxbai-embed-large-v1 1024 (matryoshka) 512 Apache 2.0 Small fast English
ColPali / ColQwen2 per-patch x 128 visual MIT Document-image RAG
BGE Reranker v2-m3 (reranker) 8K MIT Open reranker default
Cohere Rerank 3 (reranker) 4K API only Hosted reranker default

36.4.7 Matryoshka and dimension tradeoffs

Matryoshka representation learning (MRL, Kusupati et al. 2022) trains an embedder so that the first k dimensions of the output are themselves a useful embedding, for many different k. Every leading 2024-26 embedder is matryoshka-trained: OpenAI text-embedding-3, Cohere Embed-4, Voyage voyage-3, BGE-M3, NV-Embed-v2, Stella, Snowflake Arctic Embed, Jina v3, and mxbai-embed all support truncation. The production implications:

Key Insight
The Matryoshka loss forces redundancy decay along coordinate order

The matryoshka training objective (Kusupati et al. 2022) is a sum of standard retrieval losses computed at a nested set of prefix lengths $K = \{k_1 < k_2 < \ldots < k_m\}$:

$$\mathcal{L}_{\text{MRL}} = \sum_{k \in K} c_k \cdot \mathcal{L}(z_{1:k}, y)$$

where $z \in \mathbb{R}^{d}$ is the full embedding, $z_{1:k}$ is its first-$k$ prefix, $y$ is the label (positive/negative pair), $\mathcal{L}$ is typically InfoNCE or a softmax cross-entropy, and $c_k$ are scalar mixing weights (the original paper uses uniform $c_k = 1$). The crucial mechanism: every prefix $z_{1:k}$ must function as a useful embedding under the same loss, so the optimizer is forced to pack the highest-information content into the lowest-index coordinates and to relegate residual variance to later coordinates. This implicit redundancy ordering — first dimensions carry coarse semantics, later dimensions carry fine-grained details — explains why $k=512$ matryoshka prefixes lose only 1-3 MTEB points versus the full $k=1024$ or $k=3072$, while truncating a non-matryoshka embedder by the same factor catastrophically scrambles its geometry. The reader-friendly mental model: classical PCA orders dimensions by variance after training; MRL trains the model to produce that ordering by construction.

Library Shortcut
sentence-transformers v3 for fine-tuning your own embedder

When an off-the-shelf model loses too many NDCG points on your domain, fine-tune. sentence-transformers v3 (Reimers, 2024) ships a SentenceTransformerTrainer that mirrors the Hugging Face Trainer API, plus the canonical losses for retrieval (MultipleNegativesRankingLoss for query-passage pairs, MatryoshkaLoss for nested truncation training, CachedGISTEmbedLoss for hard-negative mining). A few thousand in-domain pairs and a few hours on a single A100 typically recover 5-10 NDCG points over the base checkpoint.

Show code
pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
from sentence_transformers.losses import MultipleNegativesRankingLoss, MatryoshkaLoss
from datasets import Dataset

model = SentenceTransformer("BAAI/bge-base-en-v1.5")
train = Dataset.from_dict({"anchor": queries, "positive": passages})
loss  = MatryoshkaLoss(model, MultipleNegativesRankingLoss(model),
                       matryoshka_dims=[768, 512, 256, 128])

SentenceTransformerTrainer(model=model, train_dataset=train, loss=loss).train()
model.save_pretrained("./my-domain-embedder")
Code Fragment 36.4.1b: Wrapping the retrieval loss in MatryoshkaLoss preserves the truncation-friendly geometry described above; drop the wrapper if you only need the full-dim model.

36.4.8 Licensing and deployment realities

License is the single most common 2024-26 deployment surprise. The recurring traps:

Warning
Read the embedder's prompt convention before measuring quality

Every modern instruction-tuned embedder (E5, BGE, GTE, Stella, NV-Embed, Linq-Embed) has a query prompt convention. E5 wants "query: ..."; BGE wants "Represent this sentence for searching relevant passages: ..."; NV-Embed wants a long task instruction. Getting the prompt wrong silently loses 2-10 NDCG points. The model card lists the convention; verify it before benchmarking. Every "we benchmarked X and it underperformed" thread in 2024-25 has the same root cause once you dig in: missing query prompt, wrong instruction format, or skipped L2 normalization.

Figure 36.4.2 shows what the same model looks like with and without its prompt convention:

Two identical BGE embedder characters at podiums: the left one uses the correct query prompt and wears a small crown labeled +10 NDCG, while the right one uses a plain prompt and frowns sadly with less retrieval quality.
Figure 36.4.2: The same embedder, two prompts. Wrapping the query in the model's documented convention (here BGE's "Represent this sentence for searching: ...") is worth several NDCG points for free; skipping it silently throws that quality away.

36.4.9 Domain-specific and niche embedders

General-purpose embedders score well on aggregate benchmarks but lose 5-15 points to domain-tuned alternatives on in-domain data. The 2026 domain-specific landscape:

The recurring 2024-25 finding: a domain-tuned 110M-parameter model often beats a general-purpose 7B model on in-domain retrieval. Domain fine-tuning is the highest-leverage upgrade once you have a working general-purpose baseline.

36.4.10 Fine-tuning your own embedder

If a domain-tuned model does not exist for your domain, fine-tuning your own is the next step. The 2026 standard recipe:

The expected improvement from a single fine-tune is 3-10 NDCG points on in-domain data versus the base model. Beyond that, the next leverage points are the reranker (which often gives another 2-5 points) and the retrieval pipeline structure (hybrid, query rewriting, contextual retrieval).

Key Insight
InfoNCE — the contrastive loss every modern embedder is trained with

The training objective behind every BGE, E5, NV-Embed, Stella, Linq-Embed, and most closed-API embedders in 2026 is InfoNCE (Oord et al. 2018, "Representation Learning with Contrastive Predictive Coding"), applied at the query-passage pair level. For a query $q$ with a positive passage $d^+$ and a batch of negatives $\{d_1^-, \ldots, d_N^-\}$:

$$\mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(s(q, d^+) / \tau)}{\exp(s(q, d^+) / \tau) + \sum_{i=1}^{N} \exp(s(q, d_i^-) / \tau)}$$

where $s(q, d) = q \cdot d / (\|q\| \|d\|)$ is cosine similarity (or scaled dot product) and $\tau$ is the temperature.

Temperature. Small $\tau$ (e.g. 0.01-0.05) sharpens the softmax, forcing the model to push negatives strongly away from the query; large $\tau$ (e.g. 0.5-1.0) softens it. The 2024-25 community default sits around $\tau = 0.02$ for English retrieval and $\tau = 0.05$ for multilingual where the negative distribution is noisier. Too-small $\tau$ collapses positives into delta functions and amplifies label noise; too-large $\tau$ fails to separate near-duplicates.

Hard negatives. Random in-batch negatives are easy (a typical document is trivially distinct from a query); the gradient signal saturates quickly. ANCE-style hard-negative mining (Xiong et al. 2020) refreshes negatives every few epochs from the current model's top-100 retrieval errors, where each negative is "almost right" by current parameters. The empirical recipe: 5-10 hard negatives per query, refreshed every 2-4 epochs, with a small fraction (1-2%) screened by LLM-judge to filter false negatives (documents that are actually relevant but unlabeled). This single step lifts MTEB scores by 3-7 points over random-negative baselines and is the single most important reason every BGE-family model outperforms naive DPR.

Library Shortcut
A minimal fine-tune with sentence-transformers 3.x
from sentence_transformers import SentenceTransformer, losses, InputExample
from sentence_transformers.trainer import SentenceTransformerTrainer
from sentence_transformers.training_args import SentenceTransformerTrainingArguments
from datasets import Dataset

# 1. Load a strong base model.
model = SentenceTransformer("BAAI/bge-base-en-v1.5")

# 2. Prepare training data: (query, positive, hard_neg_1, hard_neg_2, ...) tuples.
train_pairs = Dataset.from_list([
    {"anchor": q, "positive": p, "negative": n}
    for q, p, n in your_mined_triplets
])

# 3. Use a triplet loss with hard negatives.
loss = losses.MultipleNegativesRankingLoss(model)

# 4. Train.
trainer = SentenceTransformerTrainer(
    model=model, loss=loss, train_dataset=train_pairs,
    args=SentenceTransformerTrainingArguments(
        output_dir="bge-base-domain-finetuned",
        num_train_epochs=2, per_device_train_batch_size=64,
        learning_rate=2e-5, warmup_ratio=0.1,
    ),
)
trainer.train()
model.save_pretrained("bge-base-domain-finetuned")

The training step is 50 lines; the data preparation (mining hard negatives, filtering false negatives) is the part that takes a week. Budget 80% of the project time for the data, not the training loop.

What's Next?

In the next section, Section 36.5: External Reading and Communities, we build on the material covered here.

Further Reading
Chen, J. et al. (2024). "BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation." BAAI. arxiv.org/abs/2402.03216. The BGE-M3 paper. Defines the dense + sparse + multi-vector multi-functional design that became the 2024-26 open-weight hybrid-retrieval default.
Lee, C. et al. (2024). "NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models." NVIDIA. arxiv.org/abs/2405.17428. The NV-Embed paper. Defines the decoder-LLM-as-encoder approach plus the training-recipe details that produced the highest-MTEB open embedder of 2024.
Kusupati, A. et al. (2022). "Matryoshka Representation Learning." NeurIPS 2022. arxiv.org/abs/2205.13147. The Matryoshka paper. The training technique that lets a single embedder serve many dimension counts and that powers the dimensions parameter in OpenAI, Cohere, Voyage, BGE-M3, NV-Embed, Stella, Snowflake Arctic, and Jina v3.
Faysse, M. et al. (2024). "ColPali: Efficient Document Retrieval with Vision Language Models." arXiv preprint. arxiv.org/abs/2407.01449. The ColPali paper. Defines the late-interaction-over-document-image approach that became the 2024 default for retrieving from visually-rich PDFs.
OpenAI (2024). "New Embedding Models and API Updates." OpenAI Blog, January 2024. openai.com/index/new-embedding-models-and-api-updates. Launch reference for text-embedding-3-large and text-embedding-3-small, the introduction of the dimensions parameter, and the matryoshka-style truncation that became standard across the closed-API category.
Wang, L. et al. (2024). "Improving Text Embeddings with Large Language Models." ACL 2024. arxiv.org/abs/2401.00368. The E5-Mistral paper. Defines the decoder-LLM-encoder pattern and the instruction-prefix-on-query technique that the GTE-Qwen2 and Linq-Embed lineages adopted.