Prompt20
All posts
vector-searchembeddingsvector-databaseragpineconeqdrantweaviateturbopufferpgvectorvespahnswivfretrievalguide

Vector Search & Embeddings: The Ultimate Guide (2026 Edition)

Comprehensive 2026 guide to vector search and embeddings — the embedding-model landscape (OpenAI text-embedding-3, Cohere Embed v4, Voyage 3, Jina v3, BGE, MTEB winners), vector database choice (Pinecone, Qdrant, Weaviate, Milvus, Chroma, pgvector, Turbopuffer, Vespa, OpenSearch, Vertex AI Vector Search), retrieval algorithms (HNSW, IVF, DiskANN, ScaNN), hybrid lexical + vector search, evaluation, multi-tenant patterns, and the cost math.

By Prompt20 Editorial · 52 min read

Vector search is the substrate underneath modern RAG, semantic search, recommendation, fraud detection, anti-spam, deduplication, and the agent memory layer. By 2026 every production AI system touches an embedding model and a vector index somewhere — but the design choices are spread across at least four moving parts (the embedding model, the index algorithm, the storage backend, the query layer), each with a dozen credible options. This guide is the canonical map: what each layer does, what the 2026 frontier looks like, how to pick, and the patterns that actually scale to billions of vectors and tens of thousands of QPS.

The take: in 2026 the embedding-model decision is between OpenAI text-embedding-3-large (default for breadth), Cohere Embed v4 (best multilingual + multimodal), Voyage 3 / Voyage Multilingual 3 (domain-tuned excellence), Jina v3 (open-weights matrix-style), and the BGE / GTE / E5 open-weight families. The vector-database decision is between purpose-built (Pinecone, Qdrant, Weaviate, Milvus), Postgres-native (pgvector), search-engine-native (Vespa, OpenSearch, Elasticsearch), columnar-cloud (Turbopuffer, LanceDB), and managed-hyperscaler (Vertex Vector Search, AWS OpenSearch, Azure AI Search). The index-algorithm decision is mostly HNSW (default), DiskANN (billion-scale on SSD), ScaNN (Google), or IVF-PQ (memory-constrained). And the query decision is rarely pure-vector — almost everyone ends up at hybrid (BM25 + vector + reranker). Picking right requires honesty about volume, query patterns, latency budget, multi-tenancy needs, and whether you actually need a separate vector database at all.

Companion reading: RAG production architecture for the end-to-end retrieval pipeline that sits on top, open-weights ultimate guide for the LLM half of RAG, KV cache math for the inference economics that compete with retrieval cost, AI inference cost economics for the broader cost picture, and the agent protocols guide for how MCP servers wrap vector stores.

Table of contents

  1. Key takeaways
  2. What an embedding actually is
  3. The 2026 embedding-model landscape
  4. Matryoshka, mixed-precision, and quantized embeddings
  5. The vector-database landscape
  6. Vector index algorithms (HNSW, IVF, DiskANN, ScaNN)
  7. Distance functions: cosine, dot, L2, hybrid
  8. Hybrid search: BM25 + vector + reranker
  9. Reranking: cross-encoder + late-interaction
  10. Multi-tenant patterns (per-tenant namespaces, metadata filtering)
  11. Cost math: index storage, query, embedding generation
  12. Latency budget: where the ms go
  13. Scaling to billions of vectors
  14. Evaluation: MTEB, BEIR, MIRACL, and the limits
  15. Common production patterns
  16. Common anti-patterns
  17. When you don't need a vector database
  18. The 2026 outlook

Key takeaways

  • An embedding is a learned dense vector (typically 384-3072 floats) representing semantic meaning. Modern embedding models are decoder-only or encoder-only transformers trained with contrastive learning on billions of pairs.
  • The 2026 frontier: OpenAI text-embedding-3-large (3072d, $0.13/M tokens), Cohere Embed v4 (1024d multimodal + multilingual, $0.10/M), Voyage 3 (1024d, $0.06/M, best on most public benchmarks), Jina Embeddings v3 (1024d matryoshka, open weights, Apache), BGE-M3 / BGE-Reranker (BAAI, open weights, multilingual). All within ~5 nDCG points on MTEB / BEIR; the spread is smaller than between LLMs.
  • Open-weight embeddings (BGE, GTE, E5, Jina, Mixedbread) are within 1-3 points of closed leaders on most benchmarks and dramatically cheaper to operate at scale (no per-token cost). They've narrowed the gap faster than open-weight LLMs.
  • Matryoshka representation learning (MRL) lets you truncate a 3072-dim vector to 512 or 256 with graceful quality loss. Saves 6-12× on storage. text-embedding-3, Voyage 3, Jina v3 all support it.
  • Vector databases split into 5 archetypes in 2026: purpose-built (Pinecone, Qdrant, Weaviate, Milvus), Postgres-native (pgvector, AlloyDB AI), search-engine-native (Vespa, OpenSearch, Elasticsearch), columnar-cloud (Turbopuffer, LanceDB), and managed-hyperscaler (Vertex, AWS OpenSearch Service, Azure AI Search).
  • HNSW is the default index, but DiskANN wins at billion-scale on SSD, ScaNN wins on Google-internal-scale workloads, and IVF-PQ remains relevant for memory-constrained edge / mobile.
  • Hybrid (BM25 + vector + reranker) almost always wins pure-vector on production retrieval quality. The reranker is usually a cross-encoder (Cohere Rerank 3, Voyage Rerank 2, Jina Reranker v2) or a late-interaction model (ColBERT v2, ColPali for documents).
  • You probably don't need a dedicated vector database if you have <10M vectors and already run Postgres. pgvector or Postgres + the vchord / pgvecto.rs extension handles this scale fine.
  • Cost dominates at scale. Embedding generation is paid per token; index storage is paid per GB; queries are paid per QPS. At 1B vectors and 1000 QPS you're spending tens of thousands per month — choose the backend deliberately.
  • Multi-tenancy is the most under-discussed problem. Per-tenant namespaces with metadata filtering work to ~10k tenants; beyond that, sharding strategy matters. Some vector DBs handle this natively (Pinecone serverless, Turbopuffer), others don't.
  • Evaluation is the bottleneck. MTEB / BEIR / MIRACL leaderboards are reference points but your retrieval-quality reality is workload-specific. Build a 200-query gold set for your domain and re-run after every model or index change.

What an embedding actually is

An embedding is a fixed-length vector of floating-point numbers that represents the semantic content of a piece of text (or image, audio, code). Two embeddings of semantically similar inputs are close together in vector space; two of unrelated inputs are far apart. The geometry is the entire point — it's what makes "find similar things" become "find nearest neighbors in a vector index."

The vectors come from a neural network trained with contrastive learning: present the model with pairs of (anchor, positive, negative) triplets — anchor and positive are semantically related, anchor and negative are not — and adjust weights to pull anchor and positive together while pushing negative away. Modern training corpora are billions of pairs sourced from web data, search-click logs, paraphrase corpora, multilingual translation pairs, and synthetic LLM-generated pairs.

The output dimensionality depends on the model:

  • 384 — small, fast, mobile. all-MiniLM-L6-v2, bge-small-en-v1.5.
  • 768 — classic BERT-era size. bge-base-en-v1.5, mxbai-embed-large-v1.
  • 1024 — modern default. Voyage 3, Cohere Embed v4, Jina v3, BGE-M3, OpenAI text-embedding-3-small (when truncated).
  • 1536 — OpenAI text-embedding-ada-002 legacy default, text-embedding-3-small native.
  • 3072 — OpenAI text-embedding-3-large native (truncatable via Matryoshka).
  • Higher — research; rarely production.

Higher dimensions usually mean slightly better retrieval quality but proportionally higher storage and query cost. Picking the right dimension is one of the more impactful decisions you'll make; see the Matryoshka section below.

The 2026 embedding-model landscape

The frontier in mid-2026, ranked by composite signal (MTEB v2 average, BEIR average, MIRACL multilingual, vendor benchmarks). All are within ~5-8 nDCG points of each other on broad benchmarks — the differences matter more on specific verticals than on aggregate.

Closed / hosted-API:

  • OpenAI text-embedding-3-large — 3072d native, MRL-truncatable to 256/512/1024. $0.13 per 1M tokens. Strong all-rounder; widest ecosystem integration; SLA-backed. Default for "we'll pay for managed" choices. MTEB v2 ~67.
  • OpenAI text-embedding-3-small — 1536d, $0.02 per 1M. Best price/quality for cost-sensitive workloads. MTEB v2 ~63.
  • Cohere Embed v4 — 1024d, multilingual (100+ languages), multimodal (text + image). $0.10 per 1M tokens. Strongest multilingual leader. Native binary / int8 / int4 output for storage savings.
  • Voyage 3 / Voyage Multilingual 3 / Voyage Code 3 / Voyage Finance 3 / Voyage Law 3 — 1024d Matryoshka. $0.06 per 1M. Multiple domain-tuned variants are the headline differentiator. Voyage 3 is the most-cited "outperforms OpenAI" model on MTEB v2 in 2026.
  • Google text-embedding-005 (formerly Gecko) — 768d, $0.025/M for input. Strong on Google ecosystem; tightly integrated with Vertex AI Vector Search.
  • AWS Titan Embeddings v2 — 1024d, $0.10/M tokens via Bedrock. Practical mostly for AWS-shop convenience.
  • Mistral Embed — 1024d, $0.10/M, OpenAI-API-compatible. Strong European-language coverage.

Open weights:

  • BGE-M3 / BGE-Reranker (BAAI, Beijing) — 1024d, multilingual, MIT. Often the strongest open-weight on MTEB; supports dense + sparse + multi-vector embeddings in one model.
  • Jina Embeddings v3 — 1024d Matryoshka, Apache 2.0. Strong multilingual; deliberately compatible with cosine similarity at any truncation.
  • GTE-Qwen2-7B-instruct / gte-large-en-v1.5 — strong English; Apache 2.0.
  • E5-Mistral-7B-instruct (Microsoft) — 4096d, MIT. The "embed-with-an-LLM-encoder" pattern; very high quality, high inference cost.
  • Mixedbread mxbai-embed-large-v1 — 1024d, Apache. Strong general-purpose open weights.
  • Nomic Embed v2 — 768d, Apache. Pioneer of "fully open including training data."
  • Stella-en-1.5B-v5 — strong on MTEB English.
  • Snowflake Arctic Embed L — 1024d, Apache. Tuned for retrieval over enterprise data.
  • SFR-Embedding-Mistral (Salesforce) — research-grade, high quality.
  • ColBERTv2 / ColPali / ColEra — late-interaction (multi-vector) models; not direct competitors to single-vector models but a different paradigm worth knowing about.

Multimodal embedding models (text + image, sometimes audio/video):

  • Cohere Embed v4 — text + image in same space.
  • OpenAI CLIP (and successors via OpenAI multimodal embeddings) — original text-image space.
  • Google multimodalembedding@001 — Vertex multimodal.
  • SigLIP 2 / SigLIP-So400m (Google research, open weights) — open SOTA on image-text matching.
  • Nomic Embed Vision v1.5 — open multimodal.
  • JinaCLIP v2 — open, Apache.
  • ColPali / ColQwen2 — document-image retrieval (page images embedded for retrieval of PDFs).

Code embeddings (specialized):

  • Voyage Code 3 — best closed code embeddings.
  • CodeRankEmbed — open, strong code retrieval.
  • Jina Code Embeddings v1 — open, Apache.
  • Salesforce SFR-Embedding-Code — research-grade.

Domain-specific:

  • Voyage Finance 3 / Law 3 — verticalized.
  • MedCPT — medical literature retrieval.
  • BGE-M3-Legal, DRAGON-Multiturn — specialized retrievers.

Picking: default to text-embedding-3-large (closed) or BGE-M3 / Jina v3 (open) for general use. Go to Voyage when MTEB quality matters or when you need a domain-tuned variant. Go to Cohere Embed v4 when multilingual or multimodal is core. Go to E5-Mistral when retrieval quality is paramount and inference cost is acceptable. Always re-evaluate on your own data before committing.

Matryoshka, mixed-precision, and quantized embeddings

A 3072-dim float32 vector takes 12 KB. A billion of them takes 12 TB. Storage is the cost dominator at scale — Matryoshka, quantization, and binary embedding are how you cut it 10-100×.

Matryoshka Representation Learning (MRL) — train the model so the first K dimensions of the vector are themselves a useful (lower-quality) embedding. You can truncate post-hoc:

  • 3072 → 1024 = ~98% of full-dim quality, 3× less storage.
  • 3072 → 512 = ~95% quality, 6× less.
  • 3072 → 256 = ~90% quality, 12× less.
  • 3072 → 128 = ~80% quality, 24× less. Supported natively by OpenAI text-embedding-3-*, Voyage 3 (1024 → 256 / 512), Jina v3, Nomic v2.

Scalar quantization (float32 → int8 / int4):

  • Int8: 4× less storage, 1-3% quality loss. Negligible cost.
  • Int4: 8× less storage, 3-6% loss.
  • Cohere Embed v4 natively outputs int8 / int4.
  • Most vector DBs (Qdrant, Pinecone, Milvus, Weaviate, Turbopuffer) support scalar quantization in-index.

Binary quantization (float → 1-bit):

  • 32× less storage, 5-12% quality loss before reranking.
  • The trick: use binary for the coarse search, then re-rank top-N candidates with the full vector. Quality recovers to ~98%.
  • Cohere natively supports binary output; many DBs (Qdrant, Weaviate, Turbopuffer) support binary indexing.

Product Quantization (PQ) — compress vectors into a product of smaller sub-codes. Used inside IVF-PQ indexes. 16-32× compression with manageable quality loss.

Practical recipe for billion-scale at 2026 economics:

  1. Use OpenAI text-embedding-3-large MRL-truncated to 512 dim (or BGE-M3 at native 1024).
  2. Store as int8 in the vector index. Keep float32 in cold S3/object storage for the rerank fallback.
  3. Use binary quantization for the first-stage HNSW; rerank top-100 with int8.
  4. Quality: ~95% of full-fat; storage: ~60-100× less.

The vector-database landscape

Five archetypes in 2026, each with strengths and 2-4 credible options:

Purpose-built vector databases:

  • Pinecone — managed, serverless option since 2024. Strong on multi-tenancy (namespaces), metadata filtering, and operational simplicity. Pricier than alternatives at scale. The "default for teams who don't want to operate infrastructure."
  • Qdrant — open-source + managed cloud. Rust core; fast; rich filtering. Strong on hybrid search and quantization options.
  • Weaviate — open-source + managed. GraphQL query layer; first-class hybrid search; modular with embedding-model integrations.
  • Milvus / Zilliz — open-source + managed (Zilliz Cloud). Mature; multi-tenancy via "collections"; supports many index types (HNSW, IVF-PQ, DiskANN). Largest deployments by vector count.
  • Chroma — open-source, dev-first. Local-first SQLite-backed for prototypes; cloud-managed for production. Strong DX.
  • LanceDB — open-source columnar format on object storage. Embedded-first; serverless-friendly.

Postgres-native:

  • pgvector — extension for Postgres. HNSW + IVFFlat. Default for teams already on Postgres. Supports billion-scale with careful tuning + the right hardware.
  • pgvecto.rs — Rust-based alternative pgvector implementation; sometimes faster.
  • vchord — research-grade Postgres extension by TensorChord; aims for ScaNN-class quality on Postgres.
  • Supabase Vector — managed pgvector with serverless edge functions.
  • AlloyDB AI (Google) — Postgres-compatible managed offering with ScaNN integration; native vector support.
  • Aurora pgvector (AWS) — managed.
  • Neon pgvector — managed serverless Postgres with vector.

Search-engine-native:

  • Vespa — Yahoo's open-source engine. Hybrid (BM25 + vector + ML-ranker) is first-class; tensor-as-cell. Strong at recommendation / ad-rank / search.
  • OpenSearch / Elasticsearch — full-text + vector in one engine. KNN plugin matures every release; widely deployed.
  • Typesense — lightweight; strong filtering; hybrid.
  • Meilisearch — DX-focused; vector support added; small team.

Columnar-cloud:

  • Turbopuffer — serverless vector + full-text on object storage. Multi-tenant first; ~10× cheaper than dedicated DBs at scale. Strong on multi-namespace use cases.
  • LanceDB — same family; columnar on object storage; embedded mode for local prototypes.

Managed hyperscaler / search-as-a-service:

  • Vertex AI Vector Search (Google, formerly Matching Engine) — managed ScaNN. Tightly integrated with Google ecosystem.
  • AWS OpenSearch Service — managed OpenSearch with k-NN.
  • Azure AI Search — managed search with vector + semantic ranker. Tight Azure OpenAI integration.

Picking:

  • <10M vectors and you have Postgres: pgvector.
  • <10M vectors, no Postgres: Chroma (dev) → Pinecone serverless (prod).
  • 10M-100M vectors, multi-tenant SaaS: Turbopuffer, Pinecone serverless, or Qdrant Cloud.
  • 100M-1B vectors, in-house ops: Milvus, Vespa, or Weaviate self-hosted.
  • 1B+ vectors, custom needs: Vespa, Milvus on DiskANN, or roll-your-own with Faiss / DiskANN on shared storage.
  • Multilingual + multimodal: prefer Weaviate or Vespa (mature multi-vector); pair with Cohere Embed v4.

Vector index algorithms

The index determines query latency and recall. Five algorithms matter in 2026:

HNSW (Hierarchical Navigable Small World) — graph-based. The default. ~95-99% recall at 5-20 ms query latency for tens of millions of vectors. Memory-resident; needs RAM ≈ vector size + ~20-30% graph overhead. Tunable via M (graph connectivity) and efConstruction / ef (build/query depth). Most DBs default to HNSW.

IVF (Inverted File Index) — clustered. Partitions space into K clusters (Voronoi cells); query searches only nearby clusters. Faster index build than HNSW; lower memory; slightly lower recall. Usually paired with PQ (Product Quantization) for compression.

IVF-PQ — IVF + PQ. The classic recipe for memory-constrained billion-scale. ~85-92% recall; 16-64× compression. Used by Faiss-on-disk deployments; less common as RAM has gotten cheaper.

DiskANN (Microsoft Research) — SSD-resident graph index. Trades latency for cost: 90-95% recall at 20-100 ms p99 latency from SSD instead of RAM. Wins decisively at >100M vectors when RAM cost > SSD cost. Used by Pinecone's "p2" tier, Milvus, and several large deployments.

ScaNN (Google) — partitioned + asymmetric hashing + reranking. Google's internal-default; available externally via Vertex AI Vector Search and via AlloyDB AI. Often the highest recall at given latency, especially at billion scale.

SPANN (Microsoft, less common) — hybrid memory + disk; partitioned graph. Similar trade-offs to DiskANN.

Faiss — Meta's library implementing IVF, IVF-PQ, HNSW, and several others. Library, not a database. Most "self-hosted vector store on object storage" pipelines use Faiss internally.

Picking the index:

  • Default to HNSW if recall matters and RAM is available.
  • DiskANN when you have >100M vectors and SSD is much cheaper than RAM.
  • ScaNN if you're on Vertex AI / AlloyDB and want Google's best.
  • IVF-PQ for edge / mobile / very memory-constrained.
  • Faiss-on-S3 patterns (e.g. LanceDB, Turbopuffer) for cold-storage workloads.

Distance functions

Three matter in practice:

  • Cosine similarity — angle between vectors, ignores magnitude. The default for normalized embeddings (which most modern models output). Range -1 to 1; higher = more similar.
  • Dot product — magnitude-aware. Use when embeddings carry implicit weights (e.g. some recommendation models). If embeddings are L2-normalized, dot product ≡ cosine.
  • Euclidean / L2 — distance in absolute space. Less common for semantic search; still used in some image-similarity and clustering tasks.

Pick what the embedding model's docs recommend. OpenAI, Cohere, Voyage, BGE all use normalized vectors and cosine.

Hybrid search: BM25 + vector + reranker

Pure vector search loses on important workloads:

  • Exact-keyword matches (codes, identifiers, product SKUs) — vector embeddings smear these.
  • Long-tail rare terms — under-trained in the embedding model.
  • Adversarial queries — vectors miss exact phrasings.

Hybrid search combines:

  1. BM25 (lexical) — keyword-aware, exact-match strong.
  2. Vector — semantic-aware, paraphrase strong.
  3. Reranker (optional) — cross-encoder rerank of the top-N union.

The combination consistently outperforms either alone on most production benchmarks (BEIR, MTEB-retrieval, custom enterprise tests). Most modern vector DBs ship hybrid as first-class:

  • Weaviate — built-in hybrid query with alpha blend parameter.
  • Vespa — hybrid ranking is the design assumption.
  • Qdrant — hybrid via Query API (added late 2024).
  • Pinecone — sparse-dense hybrid via the sparse vector field.
  • Elasticsearch / OpenSearchknn + query_string blended via reciprocal rank fusion (RRF).
  • Postgres + pgvector — DIY: BM25-like ranking via ts_rank_cd plus vector cosine, blend at query time.

Reciprocal Rank Fusion (RRF) is the most-used blending algorithm: rank from each system, score = sum of 1 / (k + rank_i). Works without tuning weights; robust to score-distribution differences.

Reranking

Reranking takes the top-N candidates from a fast first stage (vector / hybrid) and reorders them with a more expensive, more accurate model. Two paradigms:

Cross-encoder rerankers — score (query, document) pairs through a transformer that sees them concatenated. Much higher quality than bi-encoder embeddings; much higher cost per pair (10-100ms each). Use only for top-N (typically 20-100). The 2026 frontier:

  • Cohere Rerank 3 / 3 Multilingual — closed, $2/1M searches.
  • Voyage Rerank 2 — closed.
  • Jina Reranker v2 / v3 — open weights.
  • BGE Reranker v2-m3 — open, multilingual, very strong.
  • MixedBread mxbai-rerank-large-v1 — open.
  • MonoT5 / MiniLM-Cross-Encoder — classic baselines.

Late-interaction (ColBERT-style) — produces a token-level multi-vector representation; scores via maxsim over query tokens. Higher quality than bi-encoder, lower cost than cross-encoder. Recent variants:

  • ColBERTv2 / PLAID — Stanford / Vespa-integrated.
  • ColPali / ColQwen2 — for page-image retrieval (great for PDF-heavy use cases).
  • ColEra / Jina-ColBERT-v2 — open implementations.

Recipe that wins on most workloads:

  1. Hybrid BM25 + vector → top-100.
  2. Cross-encoder rerank → top-10.
  3. (Optional) LLM-as-judge final filter for high-stakes use cases.

The added latency of reranking (50-200 ms for cross-encoder) is usually worth it on quality. The cost (cents to dollars per 1k queries) is usually negligible compared to LLM cost in the downstream generation step.

Multi-tenant patterns

Multi-tenancy in vector search has three patterns, scaling from simplest to most complex:

1. Metadata filtering — single index, all vectors mixed, every vector tagged with tenant_id, filter at query. Works to ~1k tenants and ~10M vectors total. Fails on noisy-neighbor performance and per-tenant query isolation.

2. Per-tenant namespaces / collections — one logical "index" per tenant within a single DB cluster. Pinecone's namespaces, Qdrant's collections, Weaviate's tenants, Milvus's collections, Turbopuffer's namespaces all work this way. Scales to ~10k-100k tenants depending on DB. Best balance of isolation + cost for most B2B SaaS.

3. Per-tenant clusters / serverless billing — every tenant gets a logical "instance" that scales independently. Pinecone Serverless, Turbopuffer, and modern Qdrant Cloud support this natively. Scales to millions of tenants; matches per-tenant billing exactly.

Picking:

  • <100 tenants: metadata filtering is fine.
  • 100-10k tenants, varied sizes: namespaces / collections.
  • 10k+ tenants: serverless-per-namespace (Turbopuffer, Pinecone Serverless).
  • <100 tenants, very heavy per-tenant volume: dedicated clusters per tenant.

Cost math

Three cost lines:

1. Embedding generation — paid per token to the embedding API (or amortized GPU cost if self-hosted).

  • text-embedding-3-large: $0.13 / 1M tokens. 1B documents × 500 tokens avg = $65k one-time + incremental.
  • Self-hosted BGE-M3 on a single A10G ($0.50/hr) processes ~200 docs/sec → ~17M docs/day. 1B documents = 60 days × $12/day = $720 + ops time. Drastically cheaper at scale.

2. Index storage — paid per GB-month.

  • 1B vectors at 3072d float32 = 12 TB. At Pinecone serverless $0.33/GB-mo = ~$4k/mo just storage.
  • Same 1B at 512d int8 (Matryoshka + quantization) = 0.5 TB. Same provider = ~$170/mo. 24× cheaper.

3. Query cost — paid per query (managed) or amortized compute (self-hosted).

  • Pinecone serverless: $0.001 per query at default tier. 1k QPS sustained = ~$86/day = $2.6k/mo.
  • Self-hosted HNSW on 64-core 256 GB RAM box ($1500/mo CoreWeave) handles 1k QPS sustained at <20ms p99.

Break-even:

  • <100k QPS-day: managed (Pinecone, Turbopuffer) wins on ops + total cost.
  • 100k-10M QPS-day: depends on data volume; managed serverless still competitive.
  • 10M QPS-day or >100M vectors: self-host starts to pay back vs managed by ~50-80%.

Latency budget

A typical RAG retrieval call:

  • Network round-trip to DB: ~5-20ms (region-dependent).
  • Query embedding generation: ~30-100ms (API) or ~10-30ms (self-hosted on GPU).
  • ANN search (HNSW, top-100): ~5-20ms in-memory; ~30-100ms DiskANN.
  • Hybrid BM25 step: ~5-20ms.
  • Reranker (cross-encoder top-100 → top-10): ~50-200ms.
  • Total: ~100-400ms p99 for hybrid+reranker; ~50-150ms for vector-only.

Budget pressure usually leads teams to:

  • Cache the embedding for repeated queries (free; 5-30% hit rate on most workloads).
  • Skip reranker for low-stakes queries (10-30% latency reduction).
  • Move embedding to a self-hosted GPU pool (cuts ~50-100ms vs API).
  • Use a closer region for the vector DB (cuts ~10-50ms).

Scaling to billions of vectors

Billion-scale is a different regime. Practical lessons:

  • Sharding is necessary. Single-node HNSW maxes out around 100-500M vectors depending on hardware; beyond that, you shard by tenant, geography, or hash.
  • DiskANN or partitioned ScaNN wins on cost. RAM at billion-scale is expensive; SSD is OK.
  • Build time matters. HNSW build on 1B vectors takes hours-to-days. Plan for incremental updates and re-builds.
  • Recall vs latency is a knob, not a constant. At billion-scale you often accept 90-92% recall to stay under 50ms p99.
  • Multi-vector / late-interaction is harder. ColBERT-style indexes are 5-10× more storage; only worth it for high-value verticals (legal, scientific, page-image PDFs).

Evaluation: MTEB, BEIR, MIRACL, and the limits

The public benchmarks:

  • MTEB v2 (Massive Text Embedding Benchmark) — 56+ tasks across retrieval, clustering, classification, STS, summarization. The default ranking on Hugging Face Spaces.
  • BEIR — 18 retrieval-only benchmarks for zero-shot generalization. Most-cited single retrieval benchmark.
  • MIRACL — 18-language multilingual retrieval. Best benchmark for non-English use.
  • mMARCO — multilingual MS MARCO.
  • MMTEB — newer extension of MTEB across 1000+ languages and 500+ tasks; less saturated than MTEB v1.

The limits:

  • Public benchmarks measure generalist quality; your workload is specific.
  • Many models train on subsets of the public benchmarks (contamination). Some leaderboard positions are inflated.
  • Reranker + embedder pairings matter; an embedder that wins solo may lose paired.
  • Build a 200-query gold set for your domain. Re-run on every model or index change.

Common production patterns

What works in 2026:

  1. Hybrid by default: BM25 + vector + RRF blend, top-100, then cross-encoder rerank to top-10. Cohere or Voyage if you want the best closed; BGE / Jina reranker if you want open weights.
  2. Chunk smartly: ~512-1024 tokens per chunk for general text; smaller (256) for code; larger (2048+) for long-form analytic content where context matters.
  3. Always store the source URL + offsets alongside each vector. Provenance saves you when a retrieved chunk is wrong.
  4. Embedding cache: hash query → cached vector. 10-30% hit rate on most chatty workloads. Free latency win.
  5. Tier hot vs cold: keep last-N-days vectors in fast HNSW + RAM; older in DiskANN + SSD.
  6. Use metadata filtering aggressively: tenant_id, language, date-range, type. Filtered ANN is much faster than post-filtering.
  7. Rebuild monthly: HNSW degrades over many edits; periodic rebuilds restore recall.
  8. Monitor recall: ground-truth a sample of queries via exhaustive search and compare to ANN results; if recall drops below your threshold, alert.

Common anti-patterns

What burns teams:

  1. Pure vector when hybrid would work: still depressingly common. Costs you 5-15 nDCG points on most real workloads.
  2. Picking a vector DB before measuring: every workload has different access patterns. Prototype with pgvector before committing to managed.
  3. Storing raw text in the vector record: blows up storage cost. Store an ID and join externally.
  4. Same embedding model for documents and queries without asymmetric trick: some models (E5, BGE) benefit from prefixing query vs doc with different tokens; check the model card.
  5. Skipping reranking because "it's slow" — usually only 50-150ms and recovers 5-15% nDCG.
  6. Ignoring multilingual queries: BGE-M3, Cohere v4, and Jina v3 cover 100+ languages well; English-only embedders fail silently on non-English queries.
  7. No eval pipeline: shipping a new embedder without measuring the impact. Even a 50-query gold set catches major regressions.

When you don't need a vector database

Not every RAG system needs a dedicated vector DB. Skip it when:

  • <10k vectors: numpy or a Python dict beats any DB on latency. Faiss-in-process works fine.
  • <10M vectors and you already run Postgres: pgvector. One less moving part.
  • Documents change very rarely and re-embedding is cheap: rebuild in batch; serve from S3 + Faiss in-process.
  • Search is only one step in a larger LLM workflow and quality is dominated by the LLM: a simple BM25 (Tantivy, MeiliSearch) may be enough.
  • Latency budget is not tight (>1s acceptable): even brute-force scan of 1M vectors is feasible.

This is the most under-emphasized point: vector DBs are infrastructure, and infrastructure that you don't need is a liability. Default to the simplest option that ships.

The 2026 outlook

Best-guess trajectory:

  • Embedding models continue to consolidate: 3-5 closed leaders (OpenAI, Cohere, Voyage, Google) and 5-8 open-weight leaders (BGE, Jina, GTE, E5, Snowflake Arctic, Mixedbread, Nomic).
  • Matryoshka + quantization becomes default, eliminating most "but storage is too expensive" objections.
  • Multi-vector (ColBERT-style) quietly wins in document-heavy verticals (legal, scientific, PDFs).
  • Multimodal embeddings become standard; text + image in one space; ColPali / ColQwen-style page-image retrieval continues to grow.
  • Postgres + pgvector absorbs most of the long tail; specialized DBs keep the high-scale and feature-rich end.
  • Turbopuffer-style serverless object-storage DBs continue to take share from per-instance managed DBs on cost.
  • Reranker quality plateaus close to LLM-as-judge quality at 10× the speed; cross-encoder rerankers become the default last-mile.
  • Hybrid search becomes universal expectation, not a feature flag.

Further reading

Internal:

External: