Section 36.3

Datasets and Benchmarks

"Never trust an aggregate score; always read the per-subtask breakdown. Last year's leaderboard is this year's contamination report."

EvalEval, Benchmark-Skeptical AI Agent
Big Picture

Retrieval benchmarks sort into four layers, from foundational to end-to-end:

Knowing which benchmark answers which question is the first step in evaluating retrieval systems honestly. The most common production mistake is to ship on an MTEB score and discover three months in that your in-domain numbers do not match.

Prerequisites

This section assumes the retrieval evaluation methodology from Section 31.8 and the embedding-model fundamentals from Section 3.1. The LLM-as-judge methodology is covered in detail later in the book.

The 2026 retrieval-benchmark landscape exists in tension with two facts: training-data contamination has eaten into the validity of every public benchmark older than a year, and embedding leaderboards have ceiling effects that make the top-5 indistinguishable on overall score while still differing by 10+ NDCG points on individual subtasks. The mature reading practice is to never trust an aggregate score, always look at the per-subtask breakdown, and always pair public benchmark numbers with an in-domain evaluation set of at least 200 queries you constructed yourself.

Four-tier benchmark pyramid for retrieval: classical IR at the bottom, BEIR cross-domain, MTEB embedding aggregates, and RAG-specific end-to-end benchmarks at the top
Figure 36.3.1: The retrieval benchmark hierarchy. Classical IR datasets at the base train and grade individual rankers; BEIR tests cross-domain transfer; MTEB aggregates embedder skills across tasks; RAG benchmarks test the full pipeline. Each tier asks a different question, and contamination eats into validity as you climb.

36.3.1 Classical retrieval datasets

The foundational datasets that taught the field what retrieval evaluation looks like. Most predate transformers; all remain in active use for both pretraining and evaluation of dense retrievers.

36.3.2 Cross-domain and multilingual benchmarks

The benchmarks that test transfer across domains and languages. Critical for choosing an embedder whose training data does not match your corpus.

36.3.3 RAG-specific benchmarks

Retrieval benchmarks evaluate retrievers; RAG benchmarks evaluate the whole pipeline (retriever + reader + generator). The distinction matters because a perfect retriever can still produce a hallucinating answer, and an imperfect retriever can produce a correct one if the generator is robust.

36.3.4 Reranking and late-interaction benchmarks

36.3.5 Comparing the benchmarks

Table 36.3.1a: Retrieval and RAG benchmarks (2026).
Benchmark What it tests SOTA score Caveat
MS MARCO Passage Web-scale passage retrieval ~45 MRR@10 Sparse judgments
TREC-DL 2019-2023 Deeply judged MS MARCO ~75 NDCG@10 Same corpus as MS MARCO
BEIR (avg) Cross-domain zero-shot ~55 NDCG@10 Per-task variance huge
MTEB (overall) Embedder generalist score ~72 Aggregate hides subtasks
MIRACL (avg) Multilingual retrieval ~70 NDCG@10 18 languages only
HotpotQA (fullwiki) Multi-hop retrieval ~55 EM Wikipedia-bound
NQ-open Single-hop QA retrieval ~60 EM Wikipedia-bound
FRAMES Multi-hop RAG with constraints ~65 EM (agents) Hard by design
LongRAG Long-context RAG ~50 EM Long-context model dependent
Warning: Contamination has eaten older benchmarks

MS MARCO, NQ, TriviaQA, and HotpotQA are all in the training data of every embedder released after 2022 and every LLM released after 2021. Their absolute numbers are no longer comparable across model generations, only as relative measures within a model release. The 2024-25 BEIR re-runs by independent labs found "leaderboard inflation" of 5-15 NDCG points on the most-contaminated subtasks. The right reading practice is to weigh recent contamination-resistant benchmarks (FRAMES 2024, CRAG 2024, MS MARCO Web Search 2024, BRIGHT 2024) more heavily than the classical lineage when comparing 2024-26 embedders.

Fun Fact
The BM25 baseline still beats half of the dense retrievers on BEIR

The BEIR paper's central finding (December 2021) was that BM25, a 1994-era algorithm with three tunable constants, beat most state-of-the-art dense retrievers on out-of-domain BEIR datasets. The result was so embarrassing that it changed the field: every modern open-weight embedder (BGE, GTE, Stella, NV-Embed, Linq-Embed) now trains explicitly on the BEIR-style transfer setup with hard-negative mining and instruction-tuned queries. As of 2026, the best dense retrievers beat BM25 on the BEIR average by 8-12 NDCG points; but on specific tasks like fiqa (financial QA) and scifact (scientific fact-checking), BM25 is still within 1-2 points of the leaders. The lesson: never ship pure dense retrieval to a domain you have not benchmarked; hybrid retrieval is the safe default precisely because BM25 absorbs the failure modes of dense retrieval that the embedder authors did not train against.

Figure 36.3.2 sums up the surprise in one image:

Two characters share a park bench: on the left an elderly grandparent labeled BM25 (1994) in reading glasses and a cardigan, knitting; on the right a young hipster robot labeled Dense Retriever (2024) in sunglasses. Both hold scoreboards showing the same number and look surprised.
Figure 36.3.2: A 1994 lexical baseline and a 2024 neural retriever post nearly the same out-of-domain NDCG on several BEIR tasks. BM25 remains a stubbornly strong baseline, which is why hybrid retrieval, not pure dense, is the safe production default.

36.3.6 Leaderboards and where to read them

The 2026 active leaderboards every retrieval engineer should know:

36.3.7 Contamination-resistant and fresh benchmarks (2024-26)

Because every classical benchmark has now been in some embedder's training set, the 2024-26 field has produced a wave of contamination-resistant benchmarks. These are the ones to weight most heavily when picking a recent embedder:

36.3.8 Information extraction benchmarks

Retrieval and information extraction overlap heavily in 2026 because LLMs do both with the same prompt structure. The relevant benchmarks for the IE side:

A three-layer cake on a stand in watercolor: the largest bottom layer reads Public benchmarks - noisy, the middle layer reads MTEB shortlist - top 10, and the smallest top layer with a cherry reads Your 200-query gold set, with a friendly chef standing beside it holding a label-pen.
Figure 36.3.3: The evaluation hierarchy as a tiered cake: broad public benchmarks at the noisy base, an MTEB shortlist in the middle, and your own small in-domain gold set on top. The higher the tier, the smaller and more trustworthy the signal, which is why the next section builds that top layer by hand.

36.3.9 Building your own evaluation set

Every public benchmark is wrong for your use case in some way. The standard production practice is to build an in-domain evaluation set of 200-2000 query-answer pairs. The 2026 best practices:

Key Insight
Public leaderboards pick the embedder; your eval set ships the product

The right way to use leaderboards is as a shortlister: filter to the top-10 embedders on a leaderboard whose tasks resemble yours, then run all 10 against your in-domain eval set and pick by your own NDCG numbers. The wrong way is to pick the top-1 on MTEB and ship. Every retrieval war story in 2024-25 has the same structure: "we picked the best model on MTEB, it scored 5 points below a worse-on-MTEB model on our actual queries, and we caught it three months too late". The leaderboards do real work; they shortlist. Your eval set decides.

36.3.10 Evaluation metrics: the numbers themselves

Knowing the benchmark is half the story; knowing the metric is the other half. The 2026 evaluation-metric inventory you should be fluent with:

The recurring mistake in 2024-25 RAG evaluation reports is to report a single number; the right report is at least a triple: a retrieval metric (NDCG@10 or Recall@k), an answer-quality metric (EM, F1, or LLM-judge correctness), and a faithfulness or hallucination metric.

Algorithm 36.3.1: Algorithm: Metrics primer — NDCG, MRR, MAP, Recall@k, BM25

The canonical formulas every retrieval engineer should be able to write from memory.

Discounted Cumulative Gain. For top-$k$ results with graded relevance $\text{rel}_i \in \{0, 1, 2, \ldots\}$ (Jarvelin & Kekalainen 2002):

$$\text{DCG@}k = \sum_{i=1}^{k} \frac{2^{\text{rel}_i} - 1}{\log_2(i+1)}, \qquad \text{NDCG@}k = \frac{\text{DCG@}k}{\text{IDCG@}k}$$

where $\text{IDCG@}k$ is the DCG of the ideal ranking (relevant docs sorted by relevance descending). The $\log_2(i+1)$ denominator is the rank discount; the $2^{\text{rel}_i} - 1$ numerator gives exponential reward to higher-graded documents and is what makes NDCG sensitive to relevance levels rather than just binary hits.

Mean Reciprocal Rank. For one correct answer per query:

$$\text{MRR@}k = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\text{rank}_q^*}$$

where $\text{rank}_q^*$ is the rank of the first relevant document (or $\infty$ if no relevant document in top-$k$, contributing 0).

Mean Average Precision. Average over recall levels:

$$\text{MAP} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{|R_q|} \sum_{i=1}^{|R_q|} \text{Prec}@\text{rank}(r_i)$$

where $R_q$ is the set of relevant documents for query $q$ and $\text{rank}(r_i)$ is the rank of the $i$-th relevant document.

Recall@k. Fraction of relevant documents present in the top-$k$:

$$\text{Recall@}k = \frac{1}{|Q|} \sum_{q \in Q} \frac{|\{\text{relevant docs in top-}k\}|}{|R_q|}$$

BM25 (Robertson & Walker 1994). Lexical score of document $d$ for query $q = \{t_1, \ldots, t_n\}$:

$$\text{BM25}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t, d) \cdot (k_1 + 1)}{f(t, d) + k_1 \cdot (1 - b + b \cdot |d| / \text{avgdl})}$$

with $\text{IDF}(t) = \log \frac{N - n_t + 0.5}{n_t + 0.5}$, $f(t,d)$ = term frequency in $d$, $|d|$ = document length in tokens, $\text{avgdl}$ = mean document length, and tunables $k_1 \in [1.2, 2.0]$ (term-frequency saturation), $b \in [0.5, 0.85]$ (length normalization). Lucene defaults $k_1 = 1.2$, $b = 0.75$.

Worked NDCG@5 example. A retriever returns five documents with graded relevance $(\text{rel}_1, \ldots, \text{rel}_5) = (3, 2, 0, 0, 1)$ (3 = perfect, 0 = irrelevant). Then $\text{DCG@5} = \frac{2^3-1}{\log_2 2} + \frac{2^2-1}{\log_2 3} + 0 + 0 + \frac{2^1-1}{\log_2 6} = 7 + 1.893 + 0.387 = 9.280$. The ideal ranking $(3, 2, 1, 0, 0)$ gives $\text{IDCG@5} = 7 + 1.893 + \frac{1}{\log_2 4} + 0 + 0 = 9.393$, so $\text{NDCG@5} = 9.280 / 9.393 = 0.988$. Almost-ideal because the only fault is putting the rel=1 doc in rank 5 instead of rank 3.

Library Shortcut: ranx for IR metrics and rank fusion

Do not roll your own NDCG, MRR, or Reciprocal Rank Fusion. ranx (Bassani 2022) is the Python successor to trec_eval: it computes every metric in the inventory above on a pair of Qrels (gold) and Run (system) objects, supports significance testing via paired bootstrap, and fuses multiple runs (RRF, CombSUM, weighted) in two lines. It is the standard tool in 2024-26 retrieval research and what every BEIR-style ablation uses under the hood.

Show code
pip install ranx
from ranx import Qrels, Run, evaluate, fuse

qrels = Qrels({"q1": {"d1": 2, "d2": 1}})
bm25  = Run({"q1": {"d1": 0.9, "d3": 0.4}})
dense = Run({"q1": {"d2": 0.8, "d1": 0.7}})

ndcg = evaluate(qrels, bm25, "ndcg@10")
hybrid = fuse(runs=[bm25, dense], norm="min-max", method="rrf")
print(evaluate(qrels, hybrid, ["ndcg@10", "mrr@10", "recall@100"]))
Code Fragment 36.3.1b: fuse(...) is the canonical RRF call referenced in the Key Insight below; replace method="rrf" with "wsum" for weighted blends.
Key Insight
Why hybrid BM25 + dense almost always wins on out-of-domain data

The BEIR finding that BM25 still beats half of dense retrievers on out-of-domain tasks (Thakur et al. 2021) has a clean mechanistic explanation, sharpened by Sciavolino et al. (2021)'s entity-question study. Dense bi-encoders pool a passage into a single $d$-dimensional vector by averaging or [CLS]-pooling over token embeddings; this projection is dominated by frequent terms in the encoder's training distribution. Rare entity tokens (an obscure drug name, a niche product SKU, a 2024-launched startup) appear so rarely during pretraining that their token vectors sit close to the centroid of frequent neighbors after pooling, so dense retrieval cannot distinguish documents that mention "Tirzepatide" from those that mention "Semaglutide" if neither token was well-trained. BM25 does not have this failure mode: its $\text{IDF}(t) = \log \frac{N - n_t + 0.5}{n_t + 0.5}$ term assigns very high weight to corpus-rare terms, so an exact match on a rare entity dominates the score. Hybrid retrieval ($\text{RRF}$ over BM25 and dense lists) gets both: dense captures paraphrase and synonymy ("monetary policy" vs "interest rate decision"), BM25 captures rare-entity exact match. The 2024-25 production consensus that hybrid is the safe default is empirical confirmation of this asymmetric coverage.

What's Next?

In the next section, Section 36.4: Models, we build on the material covered here.

Further Reading
Nguyen, T. et al. (2016). "MS MARCO: A Human Generated MAchine Reading COmprehension Dataset." CoCo@NeurIPS 2016. arxiv.org/abs/1611.09268. The MS MARCO paper. The training-data substrate for nearly every modern dense retriever and the canonical web-scale passage benchmark.
Thakur, N. et al. (2021). "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models." NeurIPS 2021 Datasets and Benchmarks. arxiv.org/abs/2104.08663. The BEIR paper. Defines the 18-task zero-shot evaluation that exposed dense retrievers' transfer weakness and motivated the hybrid-retrieval consensus.
Muennighoff, N. et al. (2023). "MTEB: Massive Text Embedding Benchmark." EACL 2023. arxiv.org/abs/2210.07316. The MTEB paper. The 56-task aggregate benchmark behind the canonical embedder leaderboard; required reading for understanding what an MTEB score does and does not measure.
Kwiatkowski, T. et al. (2019). "Natural Questions: A Benchmark for Question Answering Research." TACL 2019. aclanthology.org/Q19-1026. The Natural Questions paper. Real-Google-query-distribution QA benchmark and the source of NQ-open for open-domain retrieval evaluation.
Yang, Z. et al. (2018). "HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering." EMNLP 2018. arxiv.org/abs/1809.09600. The HotpotQA paper. The canonical multi-hop retrieval and reasoning benchmark; the source of "fullwiki" evaluation that every modern multi-hop RAG paper uses.
Krishna, S. et al. (2024). "FRAMES: Fact-Retrieval And reasoning MEasurement Set." Google Research. arxiv.org/abs/2409.12941. The FRAMES paper. The 2024 multi-hop fact-retrieval benchmark designed to break naive single-hop RAG; the canonical evaluation for advanced retrieval strategies.