Section 36.1

Platforms

"Pick the wrong vector platform on Monday; spend the next six months explaining to leadership why the migration is on the roadmap. The index format is forever; the marketing copy is not."

VecVec, Platform-Comparison-Spreadsheet AI Agent
Big Picture

A retrieval "platform" is the system of record for your vector index, your keyword index, and the metadata filters that join them. The 2026 landscape sorts into four buckets: serverless vector databases (Pinecone, Turbopuffer, Weaviate Cloud, Qdrant Cloud) that hide the index from you; managed search engines with vector add-ons (Elasticsearch, OpenSearch, Azure AI Search, Vespa Cloud, MongoDB Atlas Vector Search) that bolt kNN onto a mature lexical engine; self-hosted vector databases (Milvus, Qdrant OSS, Weaviate OSS, Chroma, LanceDB, Marqo, Vald) you run yourself for cost or data-residency reasons; and SQL-extensions like pgvector that put the vectors inside your existing Postgres so you do not run a second store at all. Pick along four axes: managed-vs-self-hosted, vector-only-vs-hybrid, single-tenant-vs-multi-tenant, and how much filter selectivity your queries demand.

Prerequisites

This section assumes the vector-search fundamentals from Section 32.1, the RAG architecture from Section 32.1, and the embedding-model vocabulary from Section 3.1.

The platform choice is the most consequential infrastructure decision in a retrieval-augmented system, because every downstream concern (ingest throughput, query latency at the 99th percentile, how filtering interacts with the ANN index, multi-tenant isolation, backup and disaster recovery, cost at 10x today's volume) inherits its idioms. A team on pgvector reuses Postgres operations; a team on Pinecone outsources the index entirely but lives inside one vendor's pricing model; a team on Milvus runs a distributed system whose etcd, MinIO, and Pulsar dependencies need their own oncall. None of these are wrong, but switching costs after six months are large.

A 2x2 quadrant placing vector platforms by deployment model (managed vs self-hosted) and search style (vector-only vs hybrid lexical-plus-vector)
Figure 36.1.1: The 2026 vector platform landscape sorted along two axes: managed vs. self-hosted on the horizontal, hybrid (BM25 + dense) vs. vector-first on the vertical. Pick the quadrant first, then the engine.
Key Insight
History matters — the 60-year lineage from tf-idf to dense retrieval

Every "modern" retrieval platform sits on a 60-year algorithmic lineage that is easy to forget under the marketing layer. The probabilistic relevance framework that anchors BM25 was formalized by Robertson and Sparck-Jones (1976), giving the term-weight $\text{IDF}(t) = \log \frac{N - n_t + 0.5}{n_t + 0.5}$ and the saturation curve $\frac{f(t,d)(k_1+1)}{f(t,d)+k_1(1-b+b|d|/\text{avgdl})}$ that still ships inside Lucene. Latent Semantic Indexing (Deerwester et al. 1990) was the first to map documents into a low-rank continuous space via truncated SVD of the term-document matrix, $\mathbf{X} \approx U_k \Sigma_k V_k^\top$, prefiguring every dense embedding by 25 years. word2vec (Mikolov et al. 2013) replaced the SVD with a contrastive shallow network; BERT-base dense retrieval (Devlin 2018, Karpukhin et al. DPR 2020) replaced the shallow network with a transformer encoder; and the bi-encoder + ANN-index pipeline we ship in 2026 is the same architectural idea ("project queries and documents into a shared semantic space; score by inner product") that LSI introduced. The through-line tf-idf $\to$ BM25 $\to$ LSI $\to$ word2vec $\to$ BERT-dense is not a sequence of revolutions; it is one slowly-improving recipe for compressed semantic similarity.

36.1.1 Serverless and hosted vector databases

Serverless and hosted vector databases are the right default when you want to ship a retrieval system in weeks rather than months, when you do not have a dedicated infrastructure team, and when per-query cost is acceptable in exchange for zero operational surface area. You pay in vendor lock-in (each platform's index format is proprietary) and per-vector / per-query pricing.

Key Insight: Serverless does not mean zero cost at zero load

Every "serverless" vector database in 2026 has a storage-tier cost that bills continuously, even when nobody is querying. Pinecone Serverless charges per stored vector per month; Turbopuffer charges per GB of object storage; Weaviate Cloud charges per provisioned compute regardless of traffic. The "serverless" promise is autoscaled compute, not free idle. A B2B product with 500 customer indexes and 480 of them inactive most of the month will see storage cost dominate. Model your billing as "storage + writes + queries" with realistic per-tenant activity distributions before assuming serverless is cheaper.

Figure 36.1.2 dramatizes exactly this gap between the pitch and the bill:

A cartoon salesperson on a podium waves a flag labeled SERVERLESS while behind them a printer endlessly spits out a long paper labeled STORAGE: $50K/mo, and a confused engineer reads the printout.
Figure 36.1.2: The salesperson sells autoscaled compute; the printer prints the storage tier. "Serverless" eliminates idle compute charges, not the per-vector storage bill that accrues every month whether or not a single query is run.

36.1.2 Self-hosted vector databases

Self-hosted vector databases are the right default when data residency rules out a cloud vendor, when per-query economics at high scale make a managed service unaffordable, or when the team has the operational maturity to run another stateful system. You pay in operational complexity and a longer time-to-first-bot, but you keep full control of the index format, the upgrade cadence, and the cost model.

Library Shortcut
qdrant-client for self-hosted vector search

Of the self-hosted options above, qdrant-client is the easiest path from "pip install" to a running production index. The Python client speaks both REST and gRPC, supports async, and exposes Qdrant's filterable-HNSW, hybrid (sparse + dense), and named-vector primitives uniformly. Pair it with the qdrant/qdrant Docker image (one binary, one volume) and you have a single-node deployment in five minutes.

Show code
pip install qdrant-client
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient(url="http://localhost:6333")
client.create_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)
client.upsert(collection_name="docs", points=[
    PointStruct(id=1, vector=[0.1]*1024, payload={"lang": "en"}),
])
hits = client.query_points(collection_name="docs", query=[0.1]*1024, limit=5)
Code Fragment 36.1.1a: Add QdrantClient.create_payload_index("docs", "lang", "keyword") to enable filter-aware HNSW for tenant-style queries.

36.1.3 Hybrid search and metadata filtering

"Hybrid" in 2026 means three different things depending on who is talking. Pin the speaker to one of these:

Hybrid filter performance is the biggest production differentiator at scale. A query that filters down to 1% of vectors and then asks for top-10 nearest neighbors stresses every engine differently. Benchmark on your actual filter selectivity, not on unfiltered queries.

Key Insight: Aha Moment: When 1% Selectivity Kills Recall

The 2024 ANN-Benchmarks "filtered-recall" addendum measured the same query (top-10 vector kNN, filter selectivity 1 percent) on a 10M-vector deep-image collection across four engines. Qdrant's filterable HNSW returned recall 0.94 at 8ms p95. Pinecone's post-filter strategy returned recall 0.31 at 5ms p95 because the ANN walked the index without knowing about the filter and most of the top-100 candidates were filtered out. Same query, same data, same recall target, 3x recall gap, all from how the engine threads the predicate. The lesson: at 50 percent selectivity all engines look identical, but at 1 percent selectivity (the regime that matters for tenant-scoped or date-scoped production search) the index architecture diverges by a factor of 3. This is why "we benchmarked on unfiltered queries" is the most common silent failure in vector-DB procurement.

Library Shortcut: A typical hybrid query in Weaviate

Hybrid search in Weaviate is a single GraphQL or Python call with an alpha that controls the BM25-vs-dense weight:

import weaviate
from weaviate.classes.query import Filter, HybridFusion

client = weaviate.connect_to_wcs(cluster_url="...", auth_credentials=...)
docs = client.collections.get("Docs")

results = docs.query.hybrid(
    query="federal reserve interest rate decision",
    alpha=0.7,                        # 0.0 = pure BM25, 1.0 = pure vector
    fusion_type=HybridFusion.RANKED,  # or RELATIVE_SCORE
    filters=Filter.by_property("language").equal("en") &
            Filter.by_property("published_at").greater_than("2025-01-01"),
    limit=10,
)
for o in results.objects:
    print(o.properties["title"], o.metadata.score)

The alpha is the most-tuned parameter in production hybrid search. 0.7 is a common starting point; tune on a held-out test set rather than by intuition.

36.1.4 Selection criteria: the axes that matter

The platform choice mostly reduces to seven axes. Score each candidate honestly against your workload before reading reviews:

Numeric Example
HNSW vs IVF-Flat vs IVF-PQ on Glove-100 (ANN-Benchmarks)

Public numbers from the ANN-Benchmarks suite (Aumuller et al. 2017; live at ann-benchmarks.com) on the Glove-100 dataset (1.18M 100-dim vectors, angular distance, k=10) anchor the three-way trade-off between the dominant ANN families. Numbers vary $\pm 20\%$ across hardware generations but the shape is stable.

Index (typical config)Recall@10QPS (single core)Memory
HNSW (M=16, efConstruction=200, efSearch=64)~0.95~5,0001.0x (baseline ~470 MB)
IVF-Flat (nlist=4096, nprobe=8)~0.85~12,000~1.0x (~470 MB)
IVF-PQ (nlist=4096, m=16 subquantizers)~0.80~30,000~0.12x (~60 MB, 8x savings)

The Pareto frontier reads cleanly: HNSW buys the top recall at moderate throughput and full memory; IVF-Flat halves recall-loss in exchange for $2.4\times$ throughput; IVF-PQ trades another 5 recall points for $6\times$ throughput and an $8\times$ memory cut. The IVF-PQ slot is the canonical billion-scale recipe because RAM cost dominates at that scale and a reranker recovers 3-5 of the lost recall points. For sub-50M-vector workloads where memory is cheap, HNSW is almost always the right default.

Algorithm 36.1.1: Algorithm: Complexity at a glance

The asymptotic bounds for the three index families, with $N$ = corpus size, $D$ = vector dimension, $M$ = HNSW out-degree, $n_{\text{list}}$ = IVF cell count, $D'$ = PQ code length in bytes, $k$ = top-k:

HNSW (Malkov & Yashunin 2018). Insertion cost $O(M \log N)$ per vector; query cost $O(\log N)$ greedy descent with constant factor proportional to $M \cdot D$ for distance computations. Memory $O(N \cdot D + N \cdot M)$, dominated by the vector payload plus the graph edges.

IVF-Flat (Jegou et al. 2011). Build cost $O(N \cdot D \cdot n_{\text{iter}})$ for k-means on $n_{\text{list}}$ centroids. Query cost $O\!\left(n_{\text{list}} \cdot D + \frac{n_{\text{probe}} \cdot N}{n_{\text{list}}} \cdot D\right)$: first locate the $n_{\text{probe}}$ closest cells, then scan all vectors inside them. The expected per-query work is approximately $O(N/n_{\text{list}} \cdot D)$ when $n_{\text{probe}}$ is fixed.

IVF-PQ (Johnson et al. 2019, FAISS). Same coarse step plus a PQ-coded fine step: $O(n_{\text{list}} \cdot D + \frac{n_{\text{probe}} \cdot N}{n_{\text{list}}} \cdot D' + k \cdot D)$. The $D'$ factor (typically 16-64 bytes per vector) replaces a $D \cdot 4$-byte float scan; the final $k \cdot D$ is the optional exact rerank of the top-$k$ candidates against the raw vectors. The compression ratio is $\frac{4D}{D'}$; recall is recovered via $n_{\text{probe}}$ or by reranking against full-precision vectors.

36.1.5 A decision tree

The fastest way to narrow the platform shortlist is to answer one or two of the questions below; the first question that matches your situation typically picks the platform within a small set. The tree is rough by design: the right second-pass evaluation is a small load-test on representative data, since real workloads diverge from synthetic benchmarks by 2-5x in either direction.

36.1.6 Comparing the platforms

Table 36.1.1b: Vector and hybrid platforms (mid-2026).
Platform Best for Index types Hybrid Deployment
Pinecone Serverless Managed default, elastic cost Proprietary (HNSW-class) Sparse + dense SaaS only
Turbopuffer Many cold tenants, S3-native Proprietary on object store BM25 + dense SaaS only
Weaviate (Cloud / OSS) Hybrid BM25+dense default HNSW + inverted First-class SaaS or self-hosted
Qdrant (Cloud / OSS) Heavy filtering, simple ops HNSW with filter graph Sparse + dense SaaS or self-hosted
Milvus / Zilliz Billion-scale distributed HNSW, IVF, IVF-PQ, DiskANN Sparse + dense SaaS or self-hosted
Vespa Cloud / OSS Multi-phase ranking at scale HNSW + tensor + BM25 First-class SaaS or self-hosted
Elasticsearch / OpenSearch Already-running keyword search HNSW (Lucene) First-class SaaS or self-hosted
MongoDB Atlas Vector Search Same store as documents HNSW (Lucene) BM25 + dense SaaS only
Azure AI Search Azure compliance perimeter HNSW + BM25 + semantic First-class SaaS only
pgvector One database, joins to tables HNSW + IVFFlat Via tsvector Self-hosted Postgres
Chroma Prototyping, simplest API HNSW Limited Embedded or SaaS
LanceDB Lakehouse-native vectors IVF-PQ + HNSW on Lance BM25 + dense Embedded or SaaS
Warning: Benchmark on your data, not theirs

Vendor benchmarks always look great. The 99th-percentile latency on a vendor's reference dataset (often Glove, MS MARCO, or a synthetic 1M-vector set) tells you almost nothing about how the engine behaves on your data, with your filter distribution, your vector dimensionality, and your insert / query mix. Allocate a week of evaluation time to load 1-5% of your real corpus into the top two or three candidates and measure recall and latency on a held-out query set with your real filter predicates. Every retrospective complaint about a vector database in 2026 ("we picked X, and it does not scale to our load") traces back to a vendor benchmark that did not match the real workload.

Figure 36.1.3 captures the moment that gap becomes a production incident:

A split-panel cartoon: on the left a glossy vendor poster celebrates p99 = 8ms with confetti, while on the right a tired engineer holding a clipboard labeled OUR DATA stares at a much sadder graph reading p99 = 320ms and mutters huh.
Figure 36.1.3: The vendor's p99 of 8ms and your p99 of 320ms can both be true at once: latency depends on your filter distribution, dimensionality, and query mix, which is why the only benchmark that counts runs on a slice of your real corpus.

36.1.7 Platform pricing shapes

Vector platform pricing clusters into four shapes; each has a different break-point at scale:

Key Insight: Aha Moment: 100M Vectors, Four Bills

Take one workload: a 100M-vector index of 768-dim float32 embeddings, ingest of 1M new vectors per day, query rate 200 QPS at p95 latency under 50ms. Plug it into 2024-Q4 published pricing for the four shapes. Pinecone Serverless (per-vector + per-query): about $2,800/month at $0.33 per million writes and $8 per million reads. Pinecone provisioned pods (s1.x2): about $1,440/month for two pods that hold the index with headroom. Azure AI Search Standard S1 (per-search-unit): about $1,000/month for a single tier that just fits the workload. Self-hosted Qdrant on a single c6a.4xlarge EC2: about $530/month in pure infrastructure, plus an estimated $4,000/month in engineer oncall time (the recurring lesson). Same workload, four shapes, a 7.5x raw-cost spread that flips entirely if you include engineering time. The pricing-shape decision is rarely about the sticker price; it is about which slope you can budget for as your vector count grows 10x in year two.

The single most common pricing mistake is to launch on a per-vector-plus-query plan, grow 10x in a year, and discover that the storage bill alone is now the size of three engineering salaries. Build the 10x model into the procurement decision.

36.1.8 Quantization and disk-resident indexes

Memory is the dominant cost at billion-vector scale. The 2026 platform-level techniques for staying within budget:

The cost savings from quantization compound quickly: a billion-vector index of 1024-dim float32 vectors needs ~4 TB of RAM at full precision; binary-quantized it needs ~128 GB, which fits on a single machine. The recall recovery via reranking is essentially free once the index is small enough to fit.

36.1.9 Operations: backup, replication, disaster recovery

The 2026 operational questions that should appear in every platform evaluation:

Note
Vector indexes are stateful systems that are easy to underestimate

The recurring lesson of 2024-25 retrieval-platform retrospectives: teams budget for the index but underbudget for the operational machinery around it (backups, replication, monitoring, capacity planning, the on-call rotation, the periodic re-encode when an embedder upgrades). A vector index is a stateful database; treat it with the same operational seriousness as your primary OLTP store. If you cannot articulate your RTO and RPO for the index, the platform choice is premature.

What's Next?

In the next section, Section 36.2: Libraries and Frameworks, we build on the material covered here.

Further Reading
Malkov, Y. A. and Yashunin, D. A. (2018). "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." IEEE Transactions on Pattern Analysis and Machine Intelligence. arxiv.org/abs/1603.09320. The HNSW paper. Every vector database in this section uses HNSW or a close variant as its primary index, so this is the canonical citation for the algorithm behind the entire category.
Pinecone (2024). "Introducing Pinecone Serverless." Pinecone Blog, January 2024. pinecone.io/blog/serverless. Launch reference for the storage-compute separation that defines the 2024-26 serverless vector database category, and the canonical description of geometric partitioning over object storage.
Wang, X. et al. (2021). "Milvus: A Purpose-Built Vector Data Management System." SIGMOD 2021. cs.purdue.edu / Milvus SIGMOD paper. The Milvus architecture paper. The reference for distributed vector serving with separated compute and storage tiers, etcd metadata, and message-queue ingest.
Johnson, J., Douze, M., and Jegou, H. (2019). "Billion-scale similarity search with GPUs." IEEE Transactions on Big Data. arxiv.org/abs/1702.08734. The FAISS paper. Defines the IVF, PQ, and IVF-PQ family of indexes that every CPU-bound vector database uses as a memory-saving option below HNSW.
ANN-Benchmarks (2024-2026). "Benchmarking Nearest Neighbor Algorithms." Project site. ann-benchmarks.com. The community-maintained benchmark suite for ANN algorithms. The right place to compare published recall-at-throughput numbers across engines; treat as a starting point, never as a substitute for benchmarking on your own data.