Prompt20
All posts
ragretrievalvector-dbembeddingsrerankingllm-servingguide

RAG in Production: The Complete Guide

The definitive 2026 guide to retrieval-augmented generation in production: when RAG beats long context, ingestion and chunking, dense + BM25 hybrid search, embedding models in 2026, vector databases compared (Pinecone / Qdrant / Milvus / Weaviate / pgvector / Vespa / Turbopuffer), rerankers (Cohere, BGE, JinaAI, ColBERT), citation grounding, multi-stage and agentic RAG patterns, eval (RAGAS, ARES), cost math, and the failure modes that kill production.

By Prompt20 Editorial · 110 min read

A RAG system is three boxes connected by two arrows: index, retrieve, generate. The boxes are easy to draw. The arrows are where everything actually breaks. By 2026 the field has run enough RAG in production to know which architectures survive — and which "ship it tomorrow" demos disintegrate the first time a customer asks a question that uses pronouns.

The take. Long context did not kill RAG. Long context made RAG cheaper to do well: with 128k–1M-token windows you can retrieve more, rerank harder, and stop micro-optimizing chunk size. But the bottleneck moved — from "fit it in the prompt" to "retrieve the right thing at all." The dominant production stack in 2026 is hybrid (BM25 + dense) retrieval → reranker → grounded generation with mandatory citations, evaluated on workload-representative traces, not on public RAG benchmarks. Everyone who skips the reranker regrets it. Everyone who skips eval ships hallucinations. The vector database is the least important decision; six different products in this space are good enough.

This is the production reference: where time actually goes in the request path, which embedding model and reranker combinations win in 2026, how the six top vector databases differ for the workloads that matter, chunking strategies that survive edge cases, citation and grounding patterns that survive lawyers, multi-stage and agentic RAG, eval frameworks (RAGAS, ARES, RAGAS-Auto, TruLens), and the failure modes that account for most of the production tickets. Cross-links: long-context attention, agent serving infrastructure, eval infrastructure, KV cache inference memory math, reasoning model serving, AI inference cost economics, multimodal serving.

Table of contents

  1. Key takeaways
  2. Mental model: RAG in one minute
  3. The RAG landscape in 2026
  4. When RAG beats long context (and when it doesn't)
  5. The production RAG architecture
  6. Ingestion: parsing, chunking, enrichment
  7. Embedding models in 2026
  8. Vector databases compared
  9. BM25, dense, and hybrid retrieval
  10. Rerankers: where most of the quality lives
  11. Citation, grounding, and faithfulness
  12. Multi-stage and agentic RAG
  13. Graph RAG and structured retrieval
  14. Evaluating RAG honestly
  15. Cost economics
  16. Failure modes that actually happen
  17. Parser deep dive: LlamaParse, Unstructured, Textract, Document AI, Reducto
  18. Chunking strategies: fixed, semantic, hierarchical, late, contextual
  19. Embedding deep dive: dim, Matryoshka, binary, quantization
  20. Sparse retrieval and SPLADE/ColBERT details
  21. Hybrid fusion: RRF, weighted, learned fusion
  22. Query rewriting: HyDE, multi-query, step-back, decomposition
  23. Contextual retrieval and contextual embedding
  24. Agentic RAG patterns
  25. Production cost stack: worked example
  26. Eval methodology: RAGAS, TruLens, golden sets
  27. Long-context vs RAG vs fine-tune decision math
  28. Observability for RAG
  29. Security: PII, row-level access, multi-tenant isolation
  30. 2026 trends and what's next
  31. Freshness and incremental indexing
  32. Domain-specific RAG: legal, medical, financial, code
  33. RAG SaaS and managed offerings
  34. Long-context-aware RAG: the 2026 pattern
  35. The bottom line
  36. FAQ
  37. Eighteen-month outlook
  38. Glossary
  39. References

Key takeaways

  • RAG is alive in 2026 because retrieval cost scales with document count; long-context cost scales with prompt length. Whichever number is smaller for your workload wins.
  • The default production stack: chunk → embed → hybrid (BM25 + dense) retrieve → rerank top-100 to top-5 → generate with citations. Skipping the reranker is the most common reason RAG quality plateaus.
  • Embedding model in 2026: Cohere embed-v4, OpenAI text-embedding-3-large, Voyage voyage-3-large, or BGE-M3 for open-weight. All within ~2 points on MTEB; differences are larger on domain-specific data than on benchmarks.
  • Vector DB choice is almost a tie at moderate scale (<100M chunks). pgvector / Qdrant / Milvus / Weaviate / Pinecone / Turbopuffer / Vespa all work. Pick on operational fit (managed vs self-hosted, hybrid search support, filtering performance).
  • Reranker is the cheap quality lever: a cross-encoder (Cohere Rerank 3.5, BGE-reranker-v2, JinaAI Reranker v2) on top-100 candidates raises recall@5 by 10–30 points on real workloads.
  • Chunking matters less than the internet thinks once you have a reranker. 512–1024 token chunks with 10–20% overlap is fine for most prose. Code, tables, and structured docs need their own paths.
  • Cite or die. Every claim in generation must point at a retrieved chunk. Force the model to emit [source:N] tokens, and reject answers without citations.
  • Eval with traces from your own product. Public RAG benchmarks (HotpotQA, NaturalQuestions, FinanceBench) are contaminated and don't predict your workload's behavior.
  • Failure modes are mostly upstream: bad parsing (PDFs), bad chunking (split tables), bad rewriter (query expansion that drifts), bad reranker threshold (too few or too many docs). Generation hallucination is usually the symptom, not the disease.

Mental model: RAG in one minute

The named problem is the context-mismatch problem: the model has been trained on a frozen, public, generic corpus, and your users are asking it about a moving, private, specific one. No amount of base-model scaling fixes this — your data was never in the training set. RAG closes the gap by fetching the relevant slice of your corpus at request time and putting it in front of the model.

The useful analogy is an open-book exam. The student (the LLM) is bright but does not know the textbook. You let them bring a book in and look things up. The exam is now about three skills: choosing the right book to bring (ingestion), flipping to the right page fast (retrieval + rerank), and reading the page accurately enough to answer (generation with citations). A clever student with a bad index loses to an average student with a great index. That is the RAG architecture in one sentence.

Stage What it does Failure mode if skipped
Parse Turn PDFs / HTML / Office into clean text Tables split, headings lost
Chunk Split into retrieval units Context shredded or too coarse
Embed Map chunks to dense vectors Lexical-only retrieval, semantic miss
Hybrid retrieve BM25 + dense, top-100 Recall collapses on rare terms
Rerank Cross-encoder picks top-5 Quality plateaus 10–30 points low
Generate LLM answers with citations Hallucination, ungrounded claims

The production one-liner. The reference request path:

q = rewrite(query)                                       # disambiguate, expand
candidates = bm25.search(q, k=100) | dense.search(q, k=100)
top = reranker.rerank(q, candidates, k=5)                # cross-encoder
context = "\n\n".join(c.text + f" [src:{c.id}]" for c in top)
answer = llm.generate(SYSTEM + q + context, require_citations=True)

Skipping the reranker is the most common reason a working RAG never gets better than mediocre.

The sticky number: Anthropic's Contextual Retrieval reports a 49% drop in failed retrievals when chunks are prefixed with a short LLM-generated context summary before embedding, and a 67% drop when combined with a reranker. That single technique is the largest free quality lever published in the last 18 months for production RAG.


The RAG landscape in 2026

The 2023 picture of RAG was one embedding model, one vector DB, one LLM. The 2026 picture is a layered pipeline where each layer has matured into its own product category.

Embeddings. Cohere embed-v4 (frontier closed), OpenAI text-embedding-3-large (closed default), Voyage AI voyage-3-large and voyage-3-code (closed, domain-specialised), BGE-M3 and BGE-large-v2 from BAAI (github.com/FlagOpen/FlagEmbedding) (open-weight default), Nomic nomic-embed-text-v2, Jina jina-embeddings-v3, Mistral mistral-embed. Most are 1024–3072 dim; matryoshka variants let you truncate to 256–512 dim with minor quality loss, which matters for storage cost.

Vector databases. Pinecone (managed), Qdrant (open-source, fastest growing), Milvus (open-source, scale leader at >1B vectors), Weaviate (managed and self-hosted), pgvector / pgvectorscale (Postgres extension, dominant choice for small-to-mid), Vespa (hybrid search legend, complex ops), Turbopuffer (cheap object-storage-backed serverless), LanceDB (embedded), Chroma (DX leader, less production-hardened). MongoDB Atlas Vector and Redis Vector exist; usable if you already run them.

Rerankers. Cohere rerank-3.5 (closed, best general-purpose), Voyage rerank-2, JinaAI jina-reranker-v2-base-multilingual, BGE bge-reranker-v2-m3 and bge-reranker-v2-gemma (open-weight defaults), ColBERTv2 and the late-interaction lineage (Khattab et al., arXiv:2112.01488) for high-recall over many chunks. Rerankers are universally cross-encoders or late-interaction models — bi-encoders (the embedding model itself) are inadequate as the final filter.

Retrieval engines. Vector DB search alone is bi-encoder retrieval. Production stacks pair it with BM25 (lexical, sparse) via Elasticsearch / OpenSearch / Tantivy / Lucene, or a hybrid-native engine like Vespa or Weaviate that does both in one query. SPLADE (Formal et al., arXiv:2107.05720) and similar learned-sparse retrievers are a third lane that has grown through 2024–2026; they need a dedicated index but combine BM25-style precision with embedding-model semantics.

RAG frameworks. LlamaIndex and LangChain dominate the framework conversation, but the production trend in 2026 is fewer abstractions, not more — most serious teams now write their own thin orchestration on top of a vector DB SDK, a reranker SDK, and an LLM SDK. The frameworks are useful for prototyping and for non-standard integrations (graph stores, document loaders), less useful in the request path. The shift is partly LCEL fatigue, partly that LLM SDKs (OpenAI, Anthropic) and vector DB SDKs (Qdrant, Pinecone, pgvector via psycopg) became good enough that the abstraction premium stopped paying for itself.

Hosted RAG-as-a-service. Vectara, Azure AI Search, Vertex AI Search, Pinecone Assistant, OpenAI Assistants (file search), Amazon Bedrock Knowledge Bases. These bundle parsing, chunking, embedding, retrieval, and generation behind one API. Useful for teams that want a working baseline in a day; weaker when you need control over chunking, reranking, or routing. Most graduate off the hosted offering once they hit quality limits or want per-tenant customisation.

Eval. RAGAS (Es et al., arXiv:2309.15217) is the de facto first stop for automated metrics (faithfulness, context precision/recall, answer relevance). ARES (Saad-Falcon et al., arXiv:2311.09476) trains domain-specific judges. TruLens, Phoenix (Arize), and Patronus AI are observability and eval platforms layered on top. None replaces workload-representative traces from your own users; all of them help you scale review beyond 50 examples.


When RAG beats long context (and when it doesn't)

The question every serious team gets asked: "now that Gemini does 2M tokens, do we still need RAG?"

Yes — for most workloads, for two reasons.

Cost scales with content, not corpus. A long-context prompt pays for every token in the window every time, regardless of relevance. A 1M-token prefill on Gemini 1.5 Pro or Claude 3.7 costs roughly $1–3 per request depending on input pricing. A 5k-token RAG context costs $0.005–0.015. If you have a corpus of any size, you cannot afford to pass it whole on every request.

Quality breaks before the limit. Effective context length is consistently 1/4–1/2 of advertised on retrieval-heavy tasks — "Lost in the Middle" (Liu et al., 2023, arXiv:2307.03172), RULER (Hsieh et al., 2024, arXiv:2404.06654), and NoCha (Karpinska et al., arXiv:2406.16264) all document this. A model with a 200k-token window may only reliably attend to the first 50k. RAG sidesteps this by handing the model 5–20k of relevant tokens.

Where long context wins.

  • Single-document tasks. Summarising a contract, drafting a response to a 200-page PDF, extracting structured data from a long report. The document is small enough to fit; retrieval would lose context cohesion.
  • Multi-turn reasoning over a fixed dossier. An agent that needs to reference the same set of documents across many turns. Pay the prefill once (with prefix caching), reuse for the conversation.
  • Code analysis on a whole repo. Modern code-task models work better with the full repo than with retrieved snippets, when the repo fits.
  • Workloads where retrieval can never be correct. Synthesis questions ("what changed between these two reports?") that require seeing both sources in full. RAG with top-k retrieval misses the comparison signal.

Where RAG wins.

  • Knowledge that doesn't fit. Corporate wikis, customer-support corpora, codebases >1M tokens, legal libraries, medical literature. The corpus is the size of a library; the model can't read the library on every request.
  • Freshness. Information that updates faster than you retrain. Pricing, news, internal docs.
  • Citability. Compliance, legal, healthcare, financial advisory — any domain where "the model said so" is not an acceptable answer. RAG gives you a source URL to point at.
  • Cost. The most reliable lever you have to keep per-request costs in cents instead of dollars.

The honest answer in 2026: most production knowledge-grounded systems are hybrid. They use long context for the response (let the model think) and RAG for retrieval (don't pass the whole corpus). The two are complements.

A decision table: RAG vs long context vs fine-tuning

Workload RAG Long context Fine-tuning
Corpus > 1M tokens, factual Q&A Default Too expensive Wrong tool
Single 200-page contract Skip Right tool Skip
Style adaptation (write like our brand) Wrong tool Few-shot ok Right tool
Frequently-updating prices/news Default Stale within hours Wrong tool
Compliance with citations to source Default Hard to cite Wrong tool
Code repo < 100k tokens Optional Right tool Skip
Multi-hop synthesis across 200 docs Graph RAG Lost-in-middle hurts Skip
Per-customer customisation at scale Skip Skip LoRA (see multi-tenant LoRA serving)

Prefix caching changes the math

Anthropic's prompt caching ($0.30/1M token cache writes, $0.03/1M token cache reads — 90% off input cost) and OpenAI's prompt caching (50% off input) shifted the cost curve. If you re-use the same 100k-token document across 50 queries, the amortised prefill cost drops by ~10×. This narrows the RAG cost advantage on dossier-style workloads but doesn't eliminate it once your corpus exceeds the context window or your queries don't share a common prefix.


The production RAG architecture

A request through a serious 2026 RAG system touches 7–10 services. The canonical path:

1. User query
2. Query understanding (rewrite, expand, classify, route)
3. Hybrid retrieval (BM25 + dense)
4. Filter (metadata, ACL, recency)
5. Rerank (cross-encoder, top-100 → top-5)
6. Context assembly (deduplicate, order, format)
7. Generation with citation
8. Post-generation grounding check
9. Trace logging (for eval and debug)
10. Response with sources

Each layer optimises a different metric.

Query understanding. Rewrite under-specified queries ("the issue we discussed"), expand for recall ("can => able to, capability, possibility"), classify for routing (which corpus to query), and decompose multi-hop questions ("compare A and B" → two queries). HyDE (Gao et al., arXiv:2212.10496) generates a hypothetical answer first and embeds that for retrieval; it works on out-of-distribution queries. Multi-query expansion (generate N rewrites, run all of them, union results) is cheap and consistently lifts recall.

Hybrid retrieval. BM25 over chunks (keyword precision), dense over chunks (semantic recall), combined by reciprocal-rank fusion (RRF — Cormack et al., 2009) or learned fusion. The hybrid step is the largest single quality lift over pure dense retrieval, and it's nearly free; both indices are small enough to query in parallel.

Filtering. Metadata filters (date range, document type, tenant ID), ACL filters (per-user access), recency boost (decay older docs). Vector DB metadata-filter performance varies wildly across products — Qdrant and Vespa are best-in-class, pgvector and Pinecone are competitive, Weaviate has historically been weaker on cardinality.

Reranking. The 100→5 cut. A cross-encoder sees query and document together, with full attention between them — much higher quality than the bi-encoder embeddings used in retrieval, much slower per pair, hence the funnel. Latency budget is typically 30–80 ms for the rerank stage. Cohere rerank-3.5 at top-100 is the production default.

Context assembly. Deduplicate near-duplicates (same content from multiple sources), order by relevance descending (or by document hierarchy if reading order matters), respect the model's context limit. Always include the citation pointer alongside each chunk so the generator can ground answers.

Generation with citation. System prompt instructs the model to cite every factual claim. Output is post-processed to verify citations point to actually-retrieved chunks (defends against hallucinated citations). Models in 2026 are competent at this; the failure mode is when the system prompt is weak or the chunks are formatted ambiguously.

Grounding check. A second LLM call (or a cheap entailment model) verifies the answer is supported by the cited chunks. RAGAS faithfulness or a custom NLI model both work. Optional but pays off in compliance-heavy domains.

Trace logging. Every query, retrieved chunks, citations, latency per stage. Required for eval, debugging, and incident response. Store traces in a queryable store (BigQuery, Snowflake, Clickhouse) keyed by request ID, with retention of at least 30 days for production and 365 days for compliance domains. The trace is the only artifact that lets you reconstruct what happened on a specific bad answer; without it, every postmortem becomes guesswork.

A request through this stack runs 300–1500 ms end-to-end depending on document length, reranker, and model. The retrieval portion is 50–300 ms; everything else is generation latency.

Latency budget by stage

Stage p50 p99 Where time goes
Query rewrite (cheap LLM) 80 ms 250 ms Single round-trip to Haiku/Flash/4o-mini
Hybrid retrieve (BM25 + dense, parallel) 20 ms 80 ms Network + ANN graph traversal
Metadata filter 5 ms 30 ms Filter cardinality dependent
Rerank top-100 40 ms 120 ms Cross-encoder forward pass
Context assembly 5 ms 20 ms String ops, dedup
Generation (5k context, 500 token output, Sonnet-class) 1500 ms 4500 ms TTFT + decode
Grounding check (optional) 200 ms 800 ms Second LLM call or NLI

The generation step dominates everything. Optimising retrieval below 50 ms p50 is rarely the right place to spend engineering effort — fix the reranker, the parser, or the prompt instead.


Ingestion: parsing, chunking, enrichment

Ingestion is offline. It's also where most production RAG systems quietly fail. The pipeline is conceptually simple — parse, chunk, embed, index — and each step has a sharp edge.

Parsing. PDFs are the worst format in widespread use. Layout-aware parsers (Unstructured, LlamaParse, AWS Textract, Azure Document Intelligence, Reducto, Vespa's document AI) recover structure that PyPDF and pdfminer lose. For technical documents with tables and figures, the parser choice is the biggest single quality lever in ingestion. HTML is the second-worst — strip nav, footer, ads, but preserve heading structure and list semantics. DOCX and Markdown are easy. Code requires its own path (AST-aware splitting; see below).

Chunking strategies.

  • Fixed-size sliding window (512–1024 tokens, 10–20% overlap). The default that works for prose.
  • Sentence- or paragraph-aware (split on sentence boundaries, group to target size). Preserves coherence; small quality lift over fixed-size.
  • Heading-aware (split on H1/H2/H3 boundaries). Pairs well with structured documents and helps preserve "this section is about X" context.
  • Late chunking (Günther et al., arXiv:2409.04701). Embed the long document first, then chunk the embedding by averaging across token spans. Preserves context across chunk boundaries. Works well with long-context embedding models like Jina v3.
  • Semantic chunking. Split where consecutive sentences' embeddings diverge. Conceptually appealing, empirically marginal over heading-aware.
  • Recursive chunking (LlamaIndex / LangChain default). Try paragraph splits; if too large, sentence; if still too large, word. Good fallback chain.
  • Code-aware chunking. Tree-sitter or LSP-driven splits on function and class boundaries. Critical for code RAG; naïve splitting cuts functions in half.

The reranker hides a lot of chunking sins. A 1024-token chunk with a sharp opening sentence will outrank a perfectly-segmented but worse-written 256-token chunk every time. Don't over-engineer; profile first.

Enrichment. Add metadata that filters can use: document type, author, date, ACL, version. Add synthetic summaries or titles to chunks ("this chunk is from the FY24 10-K, section 7, discussing revenue") — these short summaries can be embedded alongside the chunk and queried at retrieval time. Parent-child or contextual retrieval (Anthropic's "contextual retrieval" approach) prepends a one-sentence document context to each chunk before embedding; reduces retrieval failures by 30–50% on long-document corpora.

Deduplication. Near-duplicate chunks pollute retrieval and waste context. Hash-based exact dedup is free; min-hash or simhash catches near-duplicates. For prose, embed and cluster — anything above ~0.95 cosine is functionally a duplicate.

Incremental indexing. Production corpora change. Decide upfront whether you re-index nightly (simplest), stream updates with CDC (most robust), or batch on document edits. Most vector DBs handle deletes by tombstoning; periodic compaction matters at scale.

Parser benchmarks: which PDF tool to pick

On the OmniDocBench and DocLayNet evaluations through 2024–2025, a rough quality ranking for production PDF parsing:

Parser Tables Math/formulas Multi-column Cost per 1k pages
Reducto Excellent Excellent Excellent ~$5
LlamaParse Premium Very good Very good Very good ~$3
AWS Textract Very good Weak Good ~$1.50
Azure Document Intelligence Very good Good Good ~$1.50
Unstructured.io (hosted) Good Weak Good ~$1
Marker (open-weight) Good Good Fair Self-host
PyMuPDF / pdfplumber Fair Poor Poor Self-host
PyPDF Poor Poor Poor Self-host

For high-stakes documents (financial filings, scientific papers, legal contracts), the $5/1k pages cost of the best parsers is trivial compared to the cost of garbage chunks polluting your index.

Contextual retrieval: the cheap win

Anthropic's contextual retrieval recipe (Sept 2024) prepends an LLM-generated one-sentence summary of "what this chunk is from / about" to each chunk before embedding. Their numbers: 35% reduction in retrieval failures from contextual embeddings alone; 49% reduction combined with contextual BM25; 67% reduction with reranking added. The cost: one Haiku call per chunk at ingestion time, cached aggressively. For a 1M-chunk corpus that's ~$80 of one-time generation. Free, on the scale of any real corpus.


Embedding models in 2026

The MTEB leaderboard (huggingface.co/spaces/mteb/leaderboard) shows the top 20 models clustered within ~2 points of each other on aggregate. Differences on your specific domain are usually larger. Test on your data.

Top closed in 2026.

  • Cohere embed-v4 — 1536 dim, multilingual, strong instruction-tuned retrieval. Cohere also publishes a "documents" and "queries" input-type distinction that improves retrieval over symmetric embedding.
  • OpenAI text-embedding-3-large — 3072 dim with matryoshka truncation to 256/512/1024. Solid baseline; widely supported.
  • Voyage voyage-3-large — domain-leading on financial, legal, and code; voyage-3-code is the strongest code retrieval embedding in 2026.
  • Google text-embedding-005 — tight integration with Vertex AI; competitive on multilingual.

Top open-weight.

  • BGE-M3 (Chen et al., arXiv:2402.03216) — multilingual, multi-functionality (dense + sparse + ColBERT-style multi-vector), 1024 dim. The strongest single open-weight model for hybrid retrieval.
  • bge-large-en-v1.5 — English-focused, smaller, fast; a solid drop-in for self-hosted English-only stacks.
  • nomic-embed-text-v2-moe — MoE embeddings, sits between speed and quality.
  • jina-embeddings-v3 — 8192 context, supports late chunking out of the box.
  • mxbai-embed-large-v1 — Mixed-bread AI, popular self-hosted choice.

Practical knobs.

  • Dimensionality. 1024 is the sweet spot in 2026. 3072 helps marginally on long-document tasks and costs more in storage and compute. Matryoshka truncation lets you store at 512 and lose little.
  • Quantization. Binary (1-bit) and int8 quantization for stored vectors cut memory 8–32× with 2–8% recall loss; reasonable for cold tiers. Production hot path still serves at fp16/fp32 quantized to int8 at most.
  • Asymmetric encoding. Many 2026 embedding models distinguish "query" and "document" input types. Use them — symmetric retrieval is leaving accuracy on the table.
  • Domain adaptation. Fine-tuning a small embedding model on your own query-document pairs raises in-domain recall by 10–25% over generic models. The 2024 LoRA-style adaptation paths (sentence-transformers, GPL Wang et al., arXiv:2112.07577) are mature.

Embedding model price/quality table

Model Provider Dim Max tokens MTEB avg Price ($/1M tokens)
text-embedding-3-large OpenAI 3072 (matryoshka) 8191 ~65 $0.13
text-embedding-3-small OpenAI 1536 (matryoshka) 8191 ~62 $0.02
embed-v4 Cohere 1536 128k ~66 $0.10
voyage-3-large Voyage AI 1024 32k ~66 $0.18
voyage-3-code Voyage AI 1024 32k n/a (code-tuned) $0.18
text-embedding-005 Google 768 2k ~64 $0.025
mistral-embed Mistral 1024 8k ~63 $0.10
jina-embeddings-v3 Jina AI 1024 8192 ~64 $0.02
BGE-M3 BAAI (open) 1024 8192 ~64 Self-host
nomic-embed-text-v2-moe Nomic (open) 768 2k ~62 Self-host

For most teams, OpenAI text-embedding-3-small ($0.02/1M) is the right default if you want closed and cheap; BGE-M3 is the right open-weight default. The 2 MTEB points between this and the frontier models do not predict your workload performance.

Embedding storage math

Storage cost for a 100M-chunk corpus at common dimensions:

Dim Precision Bytes/vector Total raw With HNSW (2–4×)
384 fp16 768 73 GB 150–290 GB
768 fp16 1536 140 GB 280–560 GB
1024 fp16 2048 190 GB 380–760 GB
1536 fp16 3072 290 GB 580–1150 GB
3072 fp16 6144 570 GB 1150–2280 GB
1024 int8 (quantized) 1024 95 GB 190–380 GB
1024 binary 128 12 GB 24–48 GB

Binary quantization (1 bit per dim) costs 2–8% recall but cuts memory 16×. Combined with a rescoring step using fp16 vectors for the top-100 candidates, you get full quality at fraction of the RAM cost. Most major DBs (Qdrant, Milvus, pgvector) support this two-tier setup natively.


Vector databases compared

The honest assessment: at <100M chunks, every major vector DB is fast enough. Choose on operational fit. At >1B vectors the picks narrow to Milvus, Vespa, and managed offerings (Pinecone, Turbopuffer).

DB License Best at Weak at Notes
pgvector / pgvectorscale Postgres Small-to-mid (<50M), already-on-Postgres, ACID Hybrid search, very large scale Default if you already run PG. pgvectorscale adds StreamingDiskANN for >100M.
Qdrant Apache 2.0 Metadata filtering, hybrid, single-binary ops Petascale Fastest-growing OSS choice; high-quality Rust core; managed cloud available.
Milvus Apache 2.0 Petascale, GPU indexing, multi-tenant Operations complexity Scale leader. Zilliz is the managed offering.
Weaviate BSD-3 Built-in hybrid, modules (embedders, rerankers) Filter performance at scale Strong DX; managed cloud is mature.
Pinecone Managed only Hands-off ops, multi-cloud, hybrid Lock-in, cost at scale The "no infra team" default. Pinecone Serverless changed the cost curve.
Vespa Apache 2.0 Hybrid search, learned ranking, scale Operations complexity Yahoo-bred; the most powerful retrieval engine in the list, also the steepest learning curve.
Turbopuffer Managed only Cheap large-scale, object-storage-backed Latency floor (~50ms) Cost/GB an order of magnitude below Pinecone. Right for archives and large corpora; not for sub-50ms hot path.
LanceDB Apache 2.0 Embedded, single-process, no server Multi-node, concurrent writes Right for desktop apps, notebooks, edge.
Elasticsearch / OpenSearch Various Already running for logs, mature hybrid Pure-vector performance If you already operate ES at scale, vector + BM25 in one index is compelling.

ANN index choice. HNSW (Malkov & Yashunin, 2018) is the default for memory-resident workloads — sub-10ms latency, 95%+ recall, but full RAM cost. DiskANN (Subramanya et al., NeurIPS 2019) and its streaming variant trade some latency for SSD-residence; the right pick at >100M vectors. IVF/PQ live on for memory-constrained edge cases. Most vector DBs auto-tune; you rarely need to pick.

Hybrid search support. Vespa, Weaviate, Qdrant (since 1.10), Elastic/OpenSearch, and Milvus 2.4+ support BM25 + dense natively. Pinecone added sparse-dense in 2024. pgvector + paradedb or tsvector for BM25 in Postgres. You can always do hybrid externally with two indices and a fusion step, but native is operationally simpler.

Recall vs latency on a 10M-vector benchmark

Rough numbers from public benchmarks (ANN-benchmarks, big-ANN-benchmarks) and vendor-published results, normalised to a single c7g.4xlarge-class node for self-hosted and managed-tier defaults for cloud:

DB Recall@10 p50 latency p99 latency Notes
pgvector (HNSW, m=16) 0.96 12 ms 45 ms Single-node Postgres
pgvectorscale (StreamingDiskANN) 0.97 8 ms 30 ms Disk-backed, scales further
Qdrant 0.98 6 ms 22 ms Rust core, very consistent tail
Milvus (HNSW) 0.97 7 ms 28 ms Scales to 10B+ vectors
Weaviate 0.96 9 ms 38 ms Hybrid native
Pinecone Serverless 0.97 25 ms 80 ms Network + object-storage backed
Turbopuffer 0.95 60 ms 250 ms Object storage; cheap at rest
Vespa 0.98 8 ms 25 ms Hybrid + learned ranking

At <10M vectors the differences are mostly noise. At >100M vectors only Milvus, Vespa, Pinecone, Turbopuffer, and pgvectorscale stay sane. The operational difference (managed vs self-hosted; how much YAML you wrote) matters more than the latency tail.


BM25, dense, and hybrid retrieval

Dense retrieval embeds query and documents in the same space and finds nearest neighbors. BM25 (Robertson & Walker, 1994) scores documents by weighted term frequency × inverse document frequency. They fail in opposite directions.

Dense recall but bad precision on exact tokens. Embedding models conflate near-synonyms — a query for "GPT-4o" may retrieve documents about "GPT-4" because they cluster nearby in embedding space. For acronyms, product names, error codes, identifiers, and SKUs, BM25 wins.

BM25 recall but bad precision on paraphrase. A query for "how do I cancel a subscription" misses a document titled "ending recurring billing" if the lexical overlap is poor.

The hybrid combination — query both indices in parallel, fuse the results — recovers both failure modes. Reciprocal-rank fusion is the simplest: each document gets a score of 1/(k + rank_BM25) + 1/(k + rank_dense) for a small constant k (60 is the common default). Weighted variants exist (α·dense + (1-α)·BM25); RRF is more robust because it normalizes away score-scale differences.

Production results. Across published RAG evaluations (BEIR (Thakur et al., arXiv:2104.08663) and internal traces from teams that publish results), hybrid lifts recall@10 by 5–15 points over pure dense retrieval on most domains. The lift is largest in technical, legal, and code domains where exact terminology matters.

SPLADE and learned sparse. SPLADE produces a sparse-vector representation that lives in the BM25 index but encodes semantic meaning via term expansion. It can replace or augment BM25, gives near-dense quality with sparse-index efficiency, and pairs well with dense retrieval for hybrid. Production adoption is growing but Vespa and Pinecone are the main mature paths.

Multi-vector retrieval. ColBERT and its successors (ColBERTv2, PLAID — Santhanam et al., NAACL 2022) store one vector per token and score query-document pairs with late interaction. Higher quality than single-vector dense retrieval; ~10× the storage cost. Worth it for high-precision domains with budget for the index size.

RRF tuning: when defaults break

Reciprocal rank fusion with k=60 is the published default. It works because k=60 dampens the contribution of low-rank documents in both lists, making the fusion robust to one retriever having a long tail of noise. Two cases where defaults fail:

  • One retriever is much stronger than the other. If BM25 nDCG is 0.3 and dense nDCG is 0.6, equal weighting under-uses dense. Either use weighted RRF (w_dense / (k + r_dense) + w_bm25 / (k + r_bm25)) or normalised score fusion. Tune weights on a held-out eval set.
  • Tiny candidate lists. When you fuse top-10 from each retriever, k=60 swamps rank signal. Drop k to 10 or 20 for small lists.

For most workloads, default RRF is correct and tuning is premature. Measure before optimising.

Sparse retrievers compared

Retriever Type Index size Latency Quality (BEIR avg) Production maturity
BM25 (Lucene/Tantivy) Lexical 0.3–0.5× text size <5 ms 0.42 Decades
BM25 + RM3 / pseudo-relevance Lexical + expansion Same +10 ms 0.45 Mature
SPLADE-v3 Learned sparse 2–5× BM25 10–30 ms 0.51 Vespa, Pinecone, Qdrant
TILDE / TILDEv2 Learned sparse 2–3× BM25 10–30 ms 0.50 Research
uniCOIL Learned sparse 1.5× BM25 10–30 ms 0.49 Anserini

SPLADE-v3 in a hybrid with dense retrieval is the strongest non-cross-encoder retrieval stack in 2026 outside Vespa's bespoke learned ranking. It costs more index space and more inference at query time than BM25, but the quality gain is substantial on technical and multilingual corpora.


Rerankers: where most of the quality lives

The biggest underrated lever in production RAG. A cross-encoder reranker on top-100 candidates raises recall@5 by 10–30 points on real workloads — often the difference between a system that works and one that doesn't.

Why rerankers work. A bi-encoder (the embedding model) encodes query and document independently, then compares. A cross-encoder attends across the concatenation of query and document, with full attention between them. The cross-encoder sees how every query token interacts with every document token. It is strictly more powerful, at the cost of being too slow to run over a whole corpus.

The pipeline pattern: bi-encoder for cheap recall (top-100), cross-encoder for expensive precision (top-5).

Models in 2026.

  • Cohere rerank-3.5 — multilingual, strong general performance, ~30ms for 100 docs. Production default.
  • Voyage rerank-2 — competitive with Cohere; better on code and finance domains.
  • JinaAI jina-reranker-v2-base-multilingual — open-weight, ~140M params, runs on CPU.
  • BGE bge-reranker-v2-m3 — open-weight default; ~568M params with strong multilingual support.
  • BGE bge-reranker-v2-gemma — larger (~2.5B), slower, higher quality.
  • ColBERTv2 / PLAID — late-interaction reranker, can replace the bi-encoder entirely at the cost of index size.

Latency budget. A cross-encoder rerank of 100 documents (each ~1024 tokens with the query) is one batched forward pass — typically 20–80ms on a single GPU or 100–300ms on CPU for the smaller models. Don't rerank more than 100 candidates unless you have a clear quality reason; the recall ceiling from bi-encoder retrieval is the binding constraint above that.

Threshold tuning. Don't return all top-k unconditionally. If the reranker score for rank-5 falls below ~0.3 (model-dependent), the chunk is probably noise; truncate the context rather than padding it with irrelevant material. A short context with three highly-relevant chunks beats a long context with one relevant and four noisy chunks.

Don't skip it. Every team that says "the reranker didn't help us" has either (a) not implemented one yet, (b) compared against a workload where the bi-encoder happens to be sufficient, or (c) skipped the threshold tuning. The fourth case — the workload truly doesn't need rerankers — exists but is rare in production.

Reranker model comparison

Model Params Languages Latency (100 docs, 1×H100) Cost (closed) Quality (BEIR avg nDCG@10)
Cohere rerank-3.5 n/a (closed) 100+ ~30 ms (API) $2.00/1k searches ~0.59
Voyage rerank-2 n/a (closed) English-strong ~40 ms (API) $0.05/1M tokens ~0.58
BGE-reranker-v2-m3 568M 100+ ~80 ms Self-host ~0.57
BGE-reranker-v2-gemma 2.5B English+ ~250 ms Self-host ~0.59
Jina-reranker-v2 278M Multilingual ~50 ms $0.02/1k searches ~0.55
ColBERTv2 / PLAID 110M English ~20 ms (indexed) Self-host ~0.56
MixedBread mxbai-rerank-large-v1 435M English ~70 ms Self-host ~0.56

Quality numbers are approximate aggregates from BEIR and MTEB reranking subsets through 2024–2025 and shift with each release. The closed APIs win on operational simplicity; BGE-reranker-v2-m3 is the open-weight default that most self-hosted stacks land on.


Citation, grounding, and faithfulness

A RAG answer without citations is a chatbot answer; the retrieval system might as well not exist. Citations are the contract between the retrieval layer and the generation layer.

Citation patterns.

  • Inline citation. The model emits [N] or [doc_id] after each factual statement. Simple, well-supported, requires only a system prompt.
  • Sentence-level grounding. Each sentence in the output maps to one or more retrieved chunks. Stricter, harder to enforce, useful in compliance.
  • Span-level grounding. Specific spans (numbers, names, dates) cite specific chunks. Highest precision; used in legal and medical.

System prompt template.

You are an assistant that answers questions strictly from the provided
sources. For every factual claim, cite the source ID in brackets like
[source:N]. If the sources don't answer the question, say so. Do not use
prior knowledge outside the sources.

Sources:
[source:1] <chunk text>
[source:2] <chunk text>
...

Question: <user query>

This template, in some form, is in every production RAG system. The phrasing matters more than people think — "strictly from the provided sources" and "do not use prior knowledge" measurably reduce hallucination over softer versions.

Post-generation verification.

  • Citation existence. Parse the output for [source:N] patterns; verify each N is in the retrieved set. Reject otherwise.
  • Faithfulness check. A second LLM call: "given these sources and this answer, is every claim in the answer supported by the sources?" RAGAS faithfulness metric formalises this. Catches hallucinated content that nominally has a citation but isn't actually supported.
  • NLI-based grounding. A small entailment model (DeBERTa-NLI, BGE-reranker as a classifier) checks if each claim is entailed by the cited chunk. Cheaper than a full LLM call.

Confidence and refusal. Train the system to refuse rather than guess. If retrieval returns nothing above the reranker threshold, the right answer is "I don't have information about that," not a fabricated response. This is hard to get right — models default to helpfulness — but is the single largest gain available for production quality.

Legal and compliance. In regulated domains (finance, healthcare, legal), every response must be traceable to a source. Persistent storage of (query, retrieved chunks, response, citation map) per request becomes a legal requirement. Plan for this from day one.

What good citations look like in practice

A good RAG response has three properties that a bad one lacks. First, every numeric or proper-noun claim ends with a bracketed source ID that maps to a chunk the model actually saw — not a URL the model invented. Second, the citation density scales with claim density; a paragraph of seven facts should have something like five to seven citations, not one trailing citation at the end. Third, when the sources contradict each other, the response says so explicitly rather than picking one and ignoring the other. Citation hallucination, where the model emits [source:9] after no [source:9] was retrieved, is the failure mode that kills compliance use cases — and post-generation validation catches it for the cost of a regex. See AI hallucinations for the broader picture.

Faithfulness vs answer relevance: don't conflate them

A response can be faithful (every claim supported by the sources) and irrelevant (doesn't answer the question). A response can be relevant (answers the question) and unfaithful (claims facts the sources don't support). Eval frameworks like RAGAS measure these separately; production systems should too. The expensive failure is "faithful and irrelevant" — the model summarises the retrieved chunks correctly but doesn't address what the user asked. Fix by tightening the system prompt to start with the question, and by adding answer-relevance scoring to the eval loop.


Multi-stage and agentic RAG

Single-pass retrieval-then-generate solves a narrow class of questions. Production systems increasingly use multi-stage patterns.

Query decomposition. A multi-hop question ("compare the revenue trends of Apple and Microsoft from 2020 to 2024") is decomposed by the model into sub-queries, each retrieved separately, then synthesised. Decomp-and-retrieve patterns (Press et al. 2023, Self-Ask, ReAct-RAG) have matured into stable production patterns.

Iterative retrieval. The model generates a partial answer, identifies what it still needs, retrieves again, continues. Useful for long-form responses that require many distinct sources. The challenge is termination — when to stop retrieving. Hard limits (max 5 iterations, max 30 retrieved chunks) plus a "I have enough" classifier keep it bounded.

Routing. Different query types go to different retrieval paths. A "what is the policy on X" goes to the wiki index; "how do I fix error Y" goes to the support corpus; "what was the Q3 result" goes to the financial filings. A small classifier (or a cheap LLM call) makes the routing decision. Routing dramatically improves retrieval quality on heterogeneous corpora.

Agentic RAG. The retrieval tool is one of several the agent can call. The model decides whether to search, when, and how to refine. This is a agent serving infrastructure problem more than a RAG problem; the right framing is "retrieval is a tool the agent uses," not "the agent is wrapped around RAG."

Memory-augmented patterns. Conversational RAG stores prior turns alongside the corpus, so follow-up questions retrieve from both. The MemGPT-style (Packer et al., arXiv:2310.08560) pattern of treating context as a managed working memory is now mainstream in agent products. Practical implementation: a per-user "memory" index alongside the global corpus, with retrieval routing to both and the per-user index weighted higher when query intent suggests personal context (pronouns, references to prior turns).

Self-correction. A second pass verifies the first pass's answer against the retrieved chunks, optionally requesting more retrieval if the verification fails. Adds latency; cuts hallucination on hard queries. CRAG (Yan et al., arXiv:2401.15884) formalised this with a lightweight retrieval evaluator.

Multi-stage RAG patterns: when each pays off

Pattern Latency cost Quality lift When to use
Query rewrite (single call) +80–250 ms +5–15 pts recall@5 (multi-turn) Multi-turn chat, anything with pronouns
Multi-query expansion (N=3) +150 ms (parallel) +3–8 pts recall@5 Out-of-distribution queries
HyDE +200–500 ms +5–15 pts (OOD only) Domains where queries are very short, docs long
Query decomposition +300–800 ms 2–5× on multi-hop Comparative or analytic questions
Iterative retrieval (up to 3 hops) +1–3 s 2–5× on long-form synthesis Research-style tasks
Self-correction / CRAG +500–1500 ms -30 to -60% hallucinations Compliance, healthcare, legal
Agentic retrieval (model decides) Variable High variance Open-ended agent tasks

Adding all of these stacks doesn't make a better system; it makes a slower, more expensive one. Production systems pick the two or three patterns that match their query distribution and stop.


Graph RAG and structured retrieval

Plain RAG retrieves chunks of text. Graph RAG (Microsoft's GraphRAG, Edge et al., arXiv:2404.16130) builds an entity-relationship graph from the corpus during ingestion and retrieves subgraphs relevant to the query. Useful for synthesis queries that need to span many documents ("summarise the regulatory exposure across these 200 contracts") rather than answer-from-one-doc queries.

When graph RAG pays off.

  • Synthesis questions across many documents.
  • Entity-centric questions ("what do we know about Customer X across all their tickets, calls, and contracts").
  • Domains where relationships matter as much as content (legal, medical, scientific literature).

When it doesn't.

  • Pointed factual questions answerable from one chunk.
  • Frequently-updating corpora — graph construction is expensive.
  • Small corpora where chunk retrieval is already sufficient.

The cost: graph construction can run 10–100× the cost of plain chunking. The lift on synthesis queries can be 2–3×. Run it only where the workload justifies it.

Structured retrieval. SQL-over-tables and Cypher-over-graph are increasingly part of the retrieval layer. Text-to-SQL with verified execution (Pourreza & Rafiei, arXiv:2304.11015) covers analytic questions that pure text retrieval can't. Production systems route between text retrieval and structured retrieval based on query classification.

GraphRAG vs LightRAG vs Microsoft GraphRAG

Three graph-retrieval implementations dominate in 2026:

System Construction cost Query cost Best for
Microsoft GraphRAG High (full LLM extraction + community detection) High (multi-hop traversal) Synthesis over hundreds-to-thousands of docs
LightRAG (HKU) Medium (dual-level entity + relation extraction) Medium Mid-size corpora with entity-centric queries
LlamaIndex KnowledgeGraphIndex Low–medium (configurable) Low–medium Smaller corpora, easier setup
Plain vector RAG Low Low Pointed factual queries

Microsoft's published numbers show GraphRAG winning ~70% of comparative judgments against vector RAG on "global sensemaking" queries — questions about themes and patterns across a whole corpus. For "what does the contract say about indemnification," plain RAG matches or beats it.

The 2026 production pattern: route. Classify queries into "pointed-fact" and "synthesis" buckets, send pointed to vector RAG (cheap, fast), synthesis to graph RAG (expensive, slow). A small classifier or an LLM-as-router decides.


Evaluating RAG honestly

Public RAG benchmarks (HotpotQA, NaturalQuestions, FinanceBench, FiQA, MS MARCO) are useful for tracking the field. They predict your production behaviour about as well as MMLU predicts your customer-support quality.

The eval that matters. A curated set of 100–500 query-answer pairs from your own workload, with the correct answer and the correct retrieved sources tagged. Run your system end-to-end on this set, weekly, after every meaningful change. Track recall@k for retrieval, faithfulness for generation, and end-to-end correctness for the overall system.

Automated eval frameworks.

  • RAGAS (Es et al., arXiv:2309.15217) — context precision, context recall, faithfulness, answer relevance. LLM-as-judge under the hood. Start here.
  • ARES (Saad-Falcon et al., arXiv:2311.09476) — trains a domain-specific judge model from a small labelled set; more accurate on domain-specific tasks than generic RAGAS.
  • TruLens, Phoenix (Arize), Patronus AI, LangSmith — observability platforms that wrap eval into a UI and log every trace. Pick on operational fit.

The metrics that matter.

  • Recall@k for retrieval (ground-truth chunks tagged). Did the right chunks make it into the top-k?
  • Reranker uplift. Recall@5 after reranking minus recall@5 without. Should be 10–30 points; if not, debug the reranker.
  • Faithfulness. Is every claim in the answer supported by the cited chunks? Catches hallucination.
  • Answer relevance. Does the answer actually address the question? Catches off-topic responses.
  • Citation accuracy. Do the citations point to chunks that actually contain the cited material? Catches fabricated citations.
  • Refusal rate on out-of-corpus queries. How often does the system correctly refuse to answer when the corpus doesn't cover the question? Catches over-eager guessing. See AI hallucinations for the broader treatment.
  • Latency p50/p99 per stage. Wedge for performance regressions; alerts when any stage's p99 doubles.
  • Cost per query. Wedge for cost regressions; alerts when generation tokens, reranker calls, or retrieval candidates exceed budgets.

The eval-loop discipline. Every regression you see in production becomes a new eval case. Every novel failure mode becomes a new eval category. The eval set is a living artifact; it grows with the product. Teams that don't do this build eval-set rot, where the eval becomes increasingly disconnected from real workload behaviour.

LLM-as-judge: when to trust it, when not to

RAGAS, ARES, and most production graders use an LLM to score outputs. This works well for binary-ish judgments (was claim X supported? yes/no) and pairwise preference (is response A better than B?). It breaks down on absolute quality scores, niche domain expertise, and adversarial cases where the grader and the generator are the same model and share blind spots.

Three rules for LLM-as-judge:

  1. Use a different model family for grading than for generation. Sonnet generates, GPT-4o grades — or vice versa. Reduces shared-bias failures.
  2. Calibrate the judge against human labels. A 200-example human-labelled set, scored by your judge, tells you the judge's accuracy. If judge accuracy is below 85% on your domain, switch judges or write a stricter rubric.
  3. Prefer pairwise to absolute. "Is A better than B" is more reliable than "rate A out of 5." Pairwise preference also matches how you'll actually use the eval — to pick between candidate pipelines.

Eval cost and frequency

A 500-example eval against Sonnet-class generation and Haiku-class grading costs roughly:

  • Generation: 500 × $0.025 = $12.50
  • Grading (3 metrics × 1 judge call each): 500 × 3 × $0.002 = $3.00
  • Total: ~$15 per full eval run.

At that price, run the eval on every meaningful PR. Most teams gate on the eval before merging retrieval-affecting changes. The cost discipline that breaks is when teams add 50 metrics and the eval costs $500 — that's when it stops running and the rot starts.


Cost economics

A request through a 2026 RAG system has a cost stack that scales with three things: documents stored, queries served, and tokens generated.

Storage cost (per-document, monthly).

  • Embeddings at 1536 dim, fp16: 3 KB/chunk. 1M chunks = 3 GB. Cheap; the index overhead dominates.
  • HNSW index overhead: 2–4× the raw vector size. 1M chunks at 1536 dim ≈ 9–12 GB RAM.
  • BM25 index: roughly 30–50% of raw text size on disk.
  • Managed vector DB (Pinecone, Turbopuffer, Zilliz): $0.05–$0.50 per GB per month for hot tiers; Turbopuffer-style object-storage-backed is ~$0.005/GB/month for cold.

Per-query cost.

  • Vector DB query: $0.0001–$0.001 depending on managed vs self-hosted.
  • Reranker (Cohere rerank-3.5 on 100 docs): ~$0.002.
  • Embedding the query (Cohere embed-v4): ~$0.0001.
  • LLM generation (5k context, 500 token output, Claude Sonnet 4.6): ~$0.025.
  • Total: $0.025–$0.030 per query for a typical retrieval-heavy domain.

Long context comparison. Same query, no retrieval, 200k-token prompt on Gemini 1.5 Pro: ~$0.25. Ten times the cost of RAG, with worse quality on factual recall.

Where costs grow.

  • Reranking 1000 candidates instead of 100. 10× cost on the reranker stage. Almost never worth it.
  • Reasoning models on top of RAG. o3 or DeepSeek-R1 over RAG context costs 5–20× a standard LLM. Use only where the reasoning budget is justified by task quality lift.
  • Re-embedding the corpus. Switching embedding model or fine-tuning your own means re-embedding everything. At 100M chunks and $0.0001/embedding, that's $10,000 per re-index. Plan accordingly.

Capacity planning. Hot vector index in RAM is the binding constraint at scale. A 100M-chunk corpus at 1024 dim, fp16, with HNSW overhead, needs ~600 GB RAM. That's the size of the index. Distribute across nodes; replicate for availability. Disk-based indices (DiskANN) push the constraint to SSD bandwidth, which is easier to scale.

Full cost stack for a real workload

Take a SaaS support chatbot: 10M chunks, 100k queries/day, 5k context, 500 token output, Claude Sonnet 4.6.

Line item Monthly cost
Embedding (one-time + incremental, Cohere embed-v4) $200
Vector DB (Qdrant Cloud, 30 GB hot) $400
BM25 index (managed OpenSearch t3.large × 2) $250
Reranker (Cohere rerank-3.5, 100k searches × $0.002) $200
Query LLM (Sonnet 4.6, ~$0.025 × 100k × 30) $75,000
Grounding check (Haiku 4.5, ~$0.002 × 100k × 30) $6,000
Observability + storage $300
Total ~$82,350

Generation is 98% of the cost. Every "should we move to Pinecone or Milvus" debate is rearranging deck chairs. The cost levers that actually move the needle: smaller model on the easy 80% of queries, prompt caching on the static system prompt and few-shot examples, output token limits, and refusal on out-of-scope queries. See AI inference cost economics for the full breakdown.


Failure modes that actually happen

Production RAG postmortems cluster into a handful of categories. The taxonomy here is from incidents across several large RAG deployments — each one shows up more often than the next.

Bad parsing. PDF tables shredded into garbage text. HTML nav and footers slipping into chunks. Code with broken indentation. The fix is investing in the parser; the symptom is "the model can't find the answer that's clearly in the document."

Chunk boundaries through critical content. A table split across two chunks; a definition separated from its term; code where the signature is in one chunk and the body in another. Heading-aware chunking and parent-child context windows mitigate. Late chunking helps if the embedding model supports it.

Query-corpus mismatch. Embedding models trained on web text retrieve poorly from legal, financial, or medical corpora. Fine-tune the embedding model on in-domain data, or use a domain-specialised model (Voyage's code/legal/finance variants).

Pronoun and reference drift in queries. "What about for the second one?" — the system has no context for "the second one." Multi-turn RAG requires query rewriting that resolves references against conversation history before retrieval.

Retrieval-generation mismatch. Retrieval returns the right chunks; the generator ignores them and answers from prior knowledge. The fix is a strict system prompt, citation enforcement, and refusal training.

Stale index. The corpus updated; the index didn't. Document deletes that didn't tombstone properly. The fix is operational: change-data-capture from source systems, dead-letter queues for failed ingestion, dashboards for index freshness.

Reranker threshold off. Too low: noise floods the context. Too high: relevant chunks dropped, the system refuses queries it should answer. Tune empirically; revisit when the corpus or model changes.

Citation hallucination. The model emits [source:7] when only [source:1..5] were retrieved. Post-generation validation catches this; many production systems silently fail to validate.

The Long-Tail of one specific format. A customer's PDFs come from a legacy system that produces unparseable output. The whole pipeline works except for this one customer. Detect early; build format-specific paths.

Acronym and identifier failures

Embedding models squash near-token-identical strings into the same neighbourhood. A query for "GPT-4o" can pull back chunks about "GPT-4," "GPT-3.5," and "Gemini 1.5" because the model's representation of model names is fuzzy. Same story for SKUs, error codes, regulation references (HIPAA vs HITECH), and CVE IDs. The mitigation is hybrid retrieval — BM25 handles exact-string match perfectly — plus a metadata field for the canonical identifier when you can extract one at ingestion.

Hallucinated structure

The model generates a confident table or list that the source document doesn't contain. The retrieved chunks support some of the rows; the others are fabricated. This is sneaky because the response looks correctly cited but the structure has invented rows. Catch it with a post-generation NLI check at the claim level, not the paragraph level.

Multi-language drift

A multilingual corpus where some chunks are English, some Mandarin, some Spanish. A Spanish query retrieves English chunks because the embedding model has stronger English representations. Fix: route by detected query language, or use a model with stronger multilingual symmetry (BGE-M3, Cohere embed-v4 multilingual). Same problem in reverse for code with comments in non-English languages.

How to debug a failing RAG query in 10 minutes

A debug procedure that catches most production issues, in order:

  1. Did retrieval return the right chunk? Log top-20 retrieved chunks for the query. If the right chunk isn't in top-20, the retrieval layer is broken (chunking, embedding, BM25 stopword, query rewrite).
  2. Did the reranker keep the right chunk? Check top-5 after rerank. If the right chunk dropped out of top-5 but was in top-20, tune the reranker or its threshold.
  3. Did the generator see the right chunk? Log the assembled prompt. Truncation, dedup, or ordering bugs can drop chunks before they reach the model.
  4. Did the generator use the chunk? If the right chunk is in the prompt but the answer ignores it, the system prompt is too weak or the chunk format is ambiguous. Tighten the prompt; add explicit "answer from sources only" language.
  5. Did the generator hallucinate a citation? Validate citations against retrieved IDs in post-processing.
  6. Is the eval signal real? Confirm the "correct" answer in your eval set is actually correct. Eval-set rot is real.

Most teams blame the LLM first and the retrieval layer last. Reverse that order and you'll resolve incidents faster.


Parser deep dive: LlamaParse, Unstructured, Textract, Document AI, Reducto

The single largest determinant of RAG quality on real corpora is not the embedding model or the reranker. It is the parser. A perfect retriever cannot recover information that was destroyed by a bad PDF extractor. The 2026 parser landscape:

LlamaParse (LlamaIndex)

A managed service tuned for LLM ingestion. Handles complex PDFs with multi-column layouts, tables, embedded images, and footnotes. Output is markdown with preserved table structure. Premium tier uses vision models for layout understanding. Pricing $3 per 1000 pages (free tier 1000 pages / day) makes it tractable for moderate corpora; for hyperscale ingestion the cost adds up.

Best for: legal contracts, financial filings (10-Ks), scientific papers with tables, anything where layout matters. Not the best for code or HTML.

Unstructured.io

Open-source-core parser with a paid managed service. Supports 60+ document formats out of the box (PDF, DOCX, PPTX, HTML, EML, images via OCR). The "hi-res" mode uses YOLOX / layout-parser models for layout detection, producing structured output with elements tagged as Title, Header, NarrativeText, ListItem, Table, Image. The "fast" mode is pdfminer-based and skips layout detection.

Best for: heterogeneous corpora where you need format coverage more than per-format excellence. The structured-element output integrates cleanly with downstream chunking.

AWS Textract

AWS's managed OCR and form/table extraction service. Best-in-class for hand-filled forms, scanned receipts, and any document where OCR quality is the bottleneck. Table extraction is solid for grid-shaped tables, weaker for nested or merged-cell tables. Pricing is per-page and accumulates quickly at scale ($1.50 per 1000 pages for synchronous, less for asynchronous).

Best for: form-heavy workflows (insurance claims, government documents, scanned legal records). Less compelling for born-digital PDFs where simpler parsers do better.

Azure Document Intelligence (formerly Form Recognizer)

Azure's equivalent. Strong on tables, key-value extraction, and pre-built models for invoices, receipts, IDs. The "Layout" model is the general-purpose option; pre-built specialized models handle the rest. Pricing $1.50 per 1000 pages for the general model, more for specialized ones.

Best for: Azure-native deployments, regulated industries that need the Microsoft compliance stack, multi-language documents (handles 100+ languages competently).

Google Document AI

Google's offering. The "Document OCR" baseline is fine; the value is in the custom-trainable processors that can be tuned for specific document types. The 2025+ Gemini-powered processors handle layout reasoning end-to-end with reasonable quality.

Best for: GCP-native deployments, custom document types where you can afford to train a processor.

Reducto

A newer entrant (2024) focused specifically on LLM-grade PDF extraction. Marketed on accuracy at table and chart extraction. Independent benchmarks (2025) show Reducto outperforming Textract and LlamaParse on table-heavy documents by 5-15 points in F1. Pricing is in the same range.

Best for: production workflows where table accuracy is the dominant quality metric (financial documents, scientific tables, supply chain documents).

ChunkR

Open-source parser focused on extracting structured chunks ready for embedding, including layout-aware chunk boundaries. Less polished than LlamaParse, free, and runs locally — useful for on-prem deployments where data cannot leave the network.

Parser selection matrix

Parser Best fit Cost / 1k pages Open-source Strengths Weaknesses
LlamaParse Mixed corpora, complex PDFs $3 No (free tier) LLM-grade markdown, tables Cost at scale
Unstructured.io Heterogeneous formats $0 (OSS) or $0.50 (cloud) Yes Format breadth, structured elements Tables weaker than specialists
AWS Textract Forms, scanned docs $1.50 No OCR quality, form extraction Born-digital PDFs
Azure Doc Intelligence Multi-language, regulated $1.50 No Languages, compliance Vendor lock
Google Document AI Custom trainable $1.50+ No Custom processors GCP-only
Reducto Table-heavy $2-3 No Table F1 Newer, less ecosystem
ChunkR On-prem, OSS $0 Yes Local control Less polished
pdfminer.six Last-resort fallback $0 Yes Free, simple Loses all structure

The pragmatic decision: try LlamaParse or Unstructured first for breadth, escalate to Reducto/Textract for table-heavy verticals, fall back to pdfminer + custom logic when budget is zero. Many production pipelines use two parsers — a primary for most documents, a fallback for edge cases the primary rejects.

OCR vs layout-aware: when each wins

OCR-only (Tesseract, basic Textract OCR) extracts text from images but loses layout. Layout-aware parsers (Unstructured hi-res, LlamaParse, Reducto) preserve reading order, table structure, headers. For born-digital PDFs (PDFs made from Word, never scanned), layout-aware parsers extract from the PDF directly without OCR — faster and more accurate. For scanned PDFs (camera photos, fax records), OCR is unavoidable; pair it with a layout model for best results.

The classic failure: parsing a multi-column legal document with a single-column extractor produces interleaved column 1 / column 2 text that is gibberish to embeddings. Always use a layout-aware parser on documents with non-trivial layout.


Chunking strategies: fixed, semantic, hierarchical, late, contextual

After parsing, chunking decides what units the retriever sees. Five mainstream strategies, each with different cost / quality trade-offs.

Fixed-size chunking

Split text into N-token windows (typically 256-1024 tokens) with M-token overlap (10-20%). Trivially fast, deterministic. The baseline that everyone starts with.

The problems: ignores semantic boundaries, can split a sentence mid-clause or a table mid-row, treats a code block and a paragraph the same way. Despite this, with a good reranker on top, fixed-size chunking at 512 tokens is acceptable for most prose workloads.

Semantic chunking

Use embeddings to detect semantic breakpoints. Encode each sentence, compute cosine similarity between adjacent sentences, place a chunk boundary where similarity drops below a threshold. The intuition: chunks should be internally coherent.

Implementations vary. LlamaIndex's SemanticSplitterNodeParser and LangChain's SemanticChunker both exist; both depend on the embedding model used. The cost: embedding overhead at ingestion time (small), occasional missed boundaries on technical content where adjacent sentences are unrelated but should be in the same chunk.

Realistic quality lift over fixed-size: 2-5% recall@5 on prose, larger on long-form content with topic shifts. Not transformative on its own.

Hierarchical chunking

Maintain multiple chunk granularities: paragraph, section, document. At retrieval time, search at the smallest granularity but optionally expand to the parent section for context. LlamaIndex's HierarchicalNodeParser exposes this pattern.

Useful when documents have clear hierarchy (legal contracts with clauses, scientific papers with sections, code with functions and modules). The retriever uses the small chunks for precision and expands to the parent for context. Increases retrieval complexity but recovers context that fixed-size loses.

Late chunking (Jina, 2024)

Embed the entire document first (or large windows), then chunk the resulting token embeddings. The chunk embedding is a pool over the chunk's tokens, but each token's representation was computed with full-document context. The chunks "know" what comes before and after them.

Requires an embedding model that supports long context (Jina v3, Voyage, BGE-M3). Benchmark wins: 5-10% recall@5 over plain fixed-size chunking on documents where context flow matters.

Practical caveat: most embedding-model APIs charge per token, so late chunking costs more (the full document is embedded once, then chunk embeddings are derived from those token reps). On-prem embedding makes it cheaper.

Contextual retrieval (Anthropic, September 2024)

For each chunk, generate a 50-100 token context summary using a small LLM (Claude 3.5 Haiku in the original paper) that locates the chunk within its document. Prefix the summary to the chunk before embedding.

Example: a chunk that says "Revenue grew 23% YoY" becomes "From the Q3 2024 earnings call discussing the EU region: Revenue grew 23% YoY." The embedding now captures both the fact and its location, making retrieval more accurate when the query asks about Q3 EU revenue.

Anthropic's published result: 49% reduction in failed retrievals from contextual embedding alone, 67% reduction when combined with a reranker. This is the largest single quality lever published in the last 18 months and worth the implementation cost (one LLM call per chunk at ingestion, cached forever via prompt caching).

Document-hierarchy chunking

For highly structured documents (legal contracts, scientific papers, technical manuals), chunk along the explicit hierarchy (Section, Subsection, Paragraph) and attach the path metadata to each chunk. At retrieval time, the path can be used as a filter or rerank signal. This is the right answer for docs where the structure is reliable; not useful for free-form prose.

Chunking strategy comparison

Strategy Ingestion cost Quality vs fixed Best for
Fixed-size $0 baseline Baseline Prose, fast prototypes
Semantic +5% (embedding for boundaries) +2-5% recall Long-form prose with topic shifts
Hierarchical +10% (multiple granularities) +5-15% recall Structured docs (legal, scientific)
Late chunking +100-300% (full-doc embedding) +5-10% recall Context-flow heavy
Contextual retrieval +200-500% (one LLM call per chunk) +30-60% recall reduction in failures Anything important
Document hierarchy +20% (structure detection) +10-25% recall Explicit structure (legal, manuals)

The 2026 default for serious deployments: contextual retrieval + fixed-size chunking + a strong reranker. The combination delivers most of what's available.


Embedding deep dive: dim, Matryoshka, binary, quantization

Embedding model choice is overstudied; the differences between top contenders are smaller than the chunking strategy or the reranker. What does matter more than people realize: dimension, Matryoshka support, and quantization.

Dimension

Common dimensions in 2026: 768 (BGE-large), 1024 (Cohere v4, Voyage v3, Jina v3), 1536 (OpenAI ada-002, text-embedding-3-small), 3072 (OpenAI text-embedding-3-large), 4096 (NV-Embed v2).

Higher dimension stores more information but costs more in HBM, network, and vector-DB indexing time. The relationship is sublinear — going from 768 to 3072 typically gets you 2-5% recall improvement, not 4×.

The pragmatic choice: 1024 dim is the sweet spot for most production workloads. 768 is acceptable if storage cost matters. 3072 is only worth it for high-stakes retrieval where every point of recall counts.

Matryoshka Representation Learning (MRL)

Models trained with MRL produce embeddings that can be truncated to shorter dimensions while preserving most of the quality. OpenAI text-embedding-3, Nomic embed-v2, BGE-M3 all support MRL. Truncating a 3072-dim embedding to 512 dim typically loses 2-4% recall vs full-dim.

This is the right answer for storage-constrained deployments: store the truncated embedding, retrieve at the truncated dim (fast), optionally rerank with the full-dim embedding for higher quality on the top candidates.

Binary quantization

Quantize each embedding dimension to a single bit (sign of the float). Storage drops 32× (a 1024-float embedding becomes a 1024-bit = 128-byte vector). Retrieval via Hamming distance is extremely fast (single XOR + popcount per comparison).

Quality loss: 5-15% recall@5 on standard benchmarks, often much smaller on real workloads. Combine with reranking on a small top-k set retrieved via binary distance, then re-score with full-precision dot product, and you recover most of the lost quality.

Vector DBs supporting binary embeddings: Qdrant, Milvus, pgvector (via bit-vector), Weaviate (preview), Pinecone (binary indexes in Serverless). The technique is mature enough for production at scale.

Scalar quantization (int8)

A middle ground: quantize each dimension to 8 bits instead of 32. Storage drops 4×. Recall@5 loss is typically under 2%. Supported by most vector DBs natively. The right default for cost-conscious deployments that are not ready for binary.

Matryoshka + binary combined

Cohere's embed-v4 and BGE-M3 support both: use a truncated, binary-quantized embedding for the broad retrieval, then full-dim float for reranking. This is the state-of-the-art for production at scale; storage cost drops by 50-100× vs naive 1024-float, recall drops by 5-10% which a reranker recovers.

Per-model 2026 quick reference

Model Dim MRL Binary Languages Strengths
Cohere embed-v4 1024 Yes Yes 100+ Best general, multilingual
OpenAI text-embedding-3-large 3072 Yes No 100+ Strong default, MTEB high
Voyage voyage-3-large 1024 Yes Yes 100+ Domain-tuned variants
BGE-M3 1024 Yes Yes 100+ Best open-weight, hybrid
Jina embeddings v3 1024 Yes Yes 89 Long-context (8k tokens)
Nomic embed v2 768 Yes No English-heavy Best small open
NV-Embed v2 4096 Yes No English Top MTEB, large
E5-Mistral-7B 4096 No No 100+ LLM-based, strong

For most production deployments in 2026, the choice is between Cohere embed-v4 (managed, multilingual, highest quality) and BGE-M3 (open-weight, self-hostable, comparable quality). The price differentiator is operational, not quality.


Sparse retrieval and SPLADE/ColBERT details

Dense retrieval (embeddings) wins on semantic queries. Sparse retrieval (BM25 and learned-sparse) wins on lexical precision — exact term matches, rare technical terms, code identifiers. Production stacks combine both.

BM25 (Okapi)

The classical baseline. Term-frequency × inverse-document-frequency with length normalization. Implementations: Elasticsearch, OpenSearch, Tantivy, Lucene, Whoosh. Tuning parameters: k1 (term frequency saturation, typically 1.2-2.0) and b (length normalization, typically 0.75).

BM25 is the workhorse for keyword retrieval. Every production RAG stack should have a BM25 lane, even if dense retrieval is the primary one. Cost is nearly zero (lexical indexes are small and fast).

SPLADE (Sparse Lexical and Expansion Model)

Learned-sparse retriever. A neural model produces a sparse vector over the vocabulary for each document and query, where each non-zero entry represents a term and its learned weight (including terms not in the original text — expansion). Retrieval uses the same inverted-index machinery as BM25 but with learned weights.

SPLADE++ (2022) is the practical implementation. Quality is typically between BM25 and dense embeddings; combined with dense it can outperform either alone.

Operational caveat: SPLADE indexes are 5-10× larger than BM25 indexes (more non-zero terms per document). The retrieval is still fast but storage matters at scale.

ColBERT v2 (late interaction)

ColBERT (Khattab and Zaharia, 2020, v2 Santhanam et al., 2021) is a different paradigm. Instead of a single embedding per document, ColBERT stores per-token embeddings. At retrieval, the query's per-token embeddings are compared against each document's per-token embeddings via MaxSim (for each query token, find the most similar document token, sum the similarities).

The result: higher recall than bi-encoder retrieval, with the cost of much larger storage (per-token embeddings) and more complex retrieval. ColBERT v2 introduces compression (centroid-based quantization) that makes it feasible at scale.

Use cases: high-stakes retrieval where every recall point matters and storage is not the bottleneck. Stanford's Stella, OpenAI internal systems, and some legal-search products use ColBERT-style retrieval. For general production RAG, the simpler dense + reranker stack is usually a better cost/quality trade-off.

When to use which sparse path

  • BM25 only: legacy compatibility, very small corpora, code search where exact-match is critical.
  • BM25 + dense (hybrid): 95% of production workloads. The right default.
  • SPLADE + dense: high-stakes search where the few-point quality lift over BM25 is worth the storage cost.
  • ColBERT: research, high-stakes search, legal / medical retrieval where recall is dominant.

Hybrid fusion: RRF, weighted, learned fusion

After retrieving K candidates from both BM25 and dense, you need to merge them into one ranked list. Three fusion methods:

Reciprocal Rank Fusion (RRF)

For each candidate document, score it as the sum of 1 / (k + rank_i) across the retrieval lists it appears in, where rank_i is its rank in list i and k is a constant (typically 60). The intuition: documents that appear high in multiple lists score highest.

RRF is parameter-light, requires no training data, and works robustly across heterogeneous score distributions (BM25 scores and cosine similarities live on different scales). It is the default fusion method in most production stacks.

Weighted score fusion

Compute a weighted sum of normalized scores: final = alpha * normalize(bm25_score) + (1 - alpha) * normalize(dense_score). Requires score normalization (min-max or z-score) since BM25 and cosine live on different scales. The weight alpha is workload-specific.

Tunable, but the optimization typically requires labeled training data. Production stacks that tune alpha typically find values in the range 0.3-0.5 (slight preference for dense).

Learned fusion

Train a lightweight model (often a gradient boosting model like XGBoost) that takes per-document features (BM25 score, dense score, both ranks, term overlap, document length, freshness) and predicts relevance. The standard learning-to-rank pattern.

Wins another 2-5% recall@5 over RRF when trained on representative labeled data. Costs: labeling, training pipeline, monitoring for drift. Only worth it for high-stakes applications where the marginal recall matters.

Fusion strategy comparison

Method Training data needed Quality vs RRF Operational cost
RRF None Baseline Trivial
Weighted scores Some (for alpha tuning) -2 to +3% Low
Learned fusion Substantial labeled data +2 to +5% High

For most production workloads, RRF is the right default. Reach for learned fusion only when you have a labeled relevance dataset and the marginal quality matters.


Query rewriting: HyDE, multi-query, step-back, decomposition

User queries are messy. They use pronouns ("how does it work"), abbreviations ("RAG"), domain jargon, typos. The retriever sees these queries cold. Pre-processing queries before retrieval can improve recall substantially.

HyDE (Hypothetical Document Embeddings)

Gao et al., 2022. For each query, use an LLM to generate a hypothetical answer document (a few sentences pretending to answer the query). Embed the hypothetical document instead of the query and retrieve against it. The intuition: the hypothetical document is closer in embedding space to real answer documents than the query itself is.

Empirical results: 5-15% recall@5 improvement on out-of-domain queries. Cost: one LLM call per query. For latency-sensitive deployments, the rewriter LLM should be a small fast model (Claude 3.5 Haiku, GPT-4o mini); rewriter latency is added to every request.

Multi-query expansion

Use an LLM to generate N (typically 3-5) paraphrased versions of the query. Retrieve against each, union the results, rerank. The diversity of phrasings catches different relevant documents that any single phrasing might miss.

Wins 3-10% recall on workloads where the original query has poor vocabulary match. Cost is N retrieval round trips plus one LLM call.

Step-back prompting

Zheng et al., 2023. For complex queries, first generate a more abstract "step-back" question, retrieve documents relevant to it, then use both the original and step-back queries together. The technique trades precision on the original query for broader context.

Most useful for analytical or reasoning-heavy queries where the relevant documents do not lexically match the query (e.g., "What caused the Q3 revenue decline?" -> step-back: "What factors affect quarterly revenue?").

Query decomposition

For multi-hop queries, decompose into sub-questions answerable by single retrievals. "What was the highest-grossing film by the director of Pulp Fiction?" -> ("Who directed Pulp Fiction?", "What is the highest-grossing film by Quentin Tarantino?"). Run retrievals serially or in parallel, then synthesize.

This is the entry point to agentic RAG. Done well, it dramatically improves multi-hop accuracy. Done badly, it explodes latency and cost.

Rewriting strategy comparison

Strategy Latency cost Quality lift Best for
None (raw query) 0 Baseline Simple short queries
HyDE 1 LLM call +5-15% recall OOD vocabulary mismatch
Multi-query 1 LLM + N retrieve +3-10% recall Ambiguous queries
Step-back 1 LLM call +5-15% on analytical Reasoning queries
Decomposition 1 LLM + N retrieve Large on multi-hop Complex multi-step questions

The pragmatic stack: route queries by type. Simple lookups skip rewriting; complex analytical queries get step-back; multi-hop questions get decomposition.


Contextual retrieval and contextual embedding

Worth a dedicated section because the technique is the single largest published quality lever for RAG in the last 18 months.

The Anthropic technique (2024)

For each chunk in the index, use a small LLM to generate a 50-100 token context summary that locates the chunk within its parent document. Prefix the summary to the chunk text before embedding. Optionally also include the summary in the chunk text shown to BM25.

The cost at ingestion: one LLM call per chunk. Anthropic uses Claude 3.5 Haiku, with prompt caching to amortize the document content across all chunks of that document. With caching, the marginal cost per chunk is ~$0.0001 — tractable for millions of chunks.

Why it works

Two reasons. First, embeddings of context-prefixed chunks are more discriminative: a chunk about "revenue growth" embedded alongside its document context ("Q3 2024 EU earnings") embeds differently than the same chunk in a different document context, so retrieval matches the right one. Second, BM25 indexing of the context summary adds lexical hooks the original chunk lacked.

Reported quality results

Anthropic's published numbers: 49% reduction in failed retrievals from contextual embedding alone, 35% from contextual BM25 alone, 67% from both, 96% reduction when combined with a reranker. The "96%" number is the dramatic one: combining contextual retrieval with a strong reranker effectively eliminates retrieval failures on the evaluated workloads.

Implementation pattern

# Pseudocode
for doc in documents:
    chunks = chunk(doc, size=512)
    for chunk in chunks:
        context = llm.generate(
            f"Document title: {doc.title}\n"
            f"Full document content: {doc.content}\n"  # prompt-cached
            f"Chunk: {chunk.text}\n"
            f"Summarize where this chunk fits in 50 tokens:"
        )
        chunk.embedding = embed(f"{context}\n{chunk.text}")
        chunk.bm25_text = f"{context}\n{chunk.text}"

The doc.content portion is shared across all chunks of the document and benefits from prompt caching at the LLM provider (Anthropic, OpenAI, Gemini all support caching now). The marginal cost per chunk is just the per-chunk text and the output tokens.

Contextual retrieval cost math (worked example)

A corpus of 10M chunks, 100K source documents. Each document averages 100 chunks. Each chunk averages 500 tokens; each context summary averages 75 tokens.

Per chunk: prompt = ~2000 tokens (cached document) + 500 tokens (chunk) + 100 tokens (instruction); response = 75 tokens.

With Claude 3.5 Haiku at 2024 prices ($0.80 / M input, $0.10 / M cached input, $4.00 / M output): cached prompt cost = $0.20 / 1M tokens = essentially free per chunk. Per-chunk input not cached (chunk text + instruction): ~600 tokens at $0.80 / M = $0.0005. Output: 75 tokens at $4.00 / M = $0.0003. Total per chunk: ~$0.0008.

10M chunks × $0.0008 = $8,000 one-time ingestion cost. For a production RAG handling thousands of QPS, this is a rounding error.


Agentic RAG patterns

The frontier of RAG in 2026 is moving from "retrieve-then-generate" to agentic patterns where an LLM decides what to retrieve, when, and how many times.

ReAct over knowledge

ReAct (Yao et al., 2022) interleaves reasoning and retrieval. The model emits Thought / Action / Observation cycles: a Thought reasons about the next step, an Action is a retrieval query, an Observation is the retrieved content. Repeat until the model emits an Answer.

Implementation: the model is given a tool definition (search(query)) and decides when to use it. Production frameworks (LangGraph, LlamaIndex Agent, OpenAI Assistants) provide the orchestration. Costs are per-step LLM calls plus retrievals; latency grows linearly with the number of steps.

Multi-hop with reflection

For multi-hop questions, the agent decomposes the query, retrieves for each sub-question, optionally reflects on whether the retrieved content actually answers the sub-question, retries with refined queries if not, and synthesizes a final answer. The reflection step (a self-critique pass) is where modern agents get robustness.

Self-RAG (Asai et al., 2023)

The model decides whether to retrieve at all per-token (or per-segment). Generated text that the model has high confidence in skips retrieval; uncertain segments trigger a retrieval. Reduces retrieval cost on questions where the model already knows the answer.

Plan-and-execute

The agent first emits a plan (a sequence of retrievals and reasoning steps), then executes the plan. Separating planning from execution allows the plan to be validated, cached, or reused. Useful for complex investigative queries.

Production trade-offs

Agentic RAG dramatically increases quality on hard questions and equally dramatically increases cost and latency. A simple retrieve-generate path is ~1 LLM call + 1 retrieve; an agentic path can be 3-15 LLM calls + 3-15 retrieves. The economics work for high-value queries (medical research assistance, legal investigation, financial analysis); they do not work for high-volume chat workloads.

The pragmatic deployment pattern: route queries by complexity. Simple lookups go through the single-shot path; complex queries (detected by a small classifier or by the model's own complexity estimation) go through the agentic path. Most production RAGs in 2026 use this two-tier routing.


Production cost stack: worked example

A realistic RAG deployment, May 2026. Workload: enterprise document Q&A, 1M source documents, 100M chunks after parsing, 1000 queries / second peak (average 200 QPS).

Ingestion costs (one-time + delta)

  • Parsing: 1M documents × $0.003 / page × 20 pages avg = $60K (LlamaParse).
  • Contextual retrieval: 100M chunks × $0.0008 = $80K (Claude 3.5 Haiku).
  • Embedding: 100M chunks × 600 tokens / chunk × $0.13 / M tokens = $7.8K (Cohere embed-v4).
  • Indexing into vector DB: included in storage cost.

Total one-time: ~$150K. Recurring delta (assume 10% of corpus updates monthly): ~$15K / month.

Storage costs (recurring)

  • Vector DB: 100M chunks × 1024 dim × 4 bytes = 410 GB raw. With binary quantization + scalar reranking: ~50 GB. Pinecone Serverless: ~$2K / month for 50 GB. Qdrant self-hosted: ~$500 / month for the equivalent.
  • BM25 index: ~50 GB on disk. Elasticsearch cluster: ~$1K / month.
  • Source document storage: 1M × 1 MB avg = 1 TB. S3: ~$25 / month.

Total storage: ~$3K / month managed, ~$1.5K / month self-hosted.

Query path costs (recurring)

Per query: 1 query embedding ($0.0001), 2 retrievals (BM25 + dense, ~$0.0002 amortized infrastructure), 1 reranker call on top-100 ($0.0001 with Cohere Rerank 3.5), 1 generation call (assume Llama-3-70B FP8 self-hosted, ~$0.0005 per query at $0.50 / M token equivalent for 800 input + 200 output tokens). Total per query: ~$0.001.

At 200 QPS × 86400 s × 30 days = ~520M queries / month. Total query cost: ~$520K / month. Generation dominates (50%); reranking is ~10%; retrieval is ~20%; embedding is ~20%.

Cost optimization levers

Lever Effect on monthly cost
Switch generation to Llama-3-8B for simple queries (50% of traffic) -25% (~$130K saved)
Cache identical queries (assume 20% hit rate) -20% generation + retrieval cost
Skip reranker for top-3 high-confidence retrievals -5%
Binary embedding + reranking -50% storage, ~0% query cost
Self-host embedding (BGE-M3) -90% embedding cost (~$15K / month at this scale)
Use Cohere rerank-3-nano on cheap queries -50% rerank cost

A well-optimized production stack at this scale runs at ~$300-400K / month, dominated by generation. RAG infrastructure proper (retrieval, reranking) is typically 20-30% of total spend.


Eval methodology: RAGAS, TruLens, golden sets

Eval is where most production RAG deployments fail silently. The disease: shipping a system that demos well, then watching customer questions reveal failure modes the team never tested.

RAGAS

RAGAS is the open-source standard for RAG evaluation. Computes per-query metrics including:

  • Faithfulness: how much of the answer is supported by retrieved context (LLM judge).
  • Answer relevance: how directly the answer addresses the query (LLM judge).
  • Context precision: proportion of retrieved chunks that are relevant (LLM judge with ground truth).
  • Context recall: proportion of ground-truth relevant chunks that were retrieved.
  • Answer correctness: against a ground-truth answer (LLM judge).

RAGAS provides a complete eval harness. The judge is configurable (GPT-4o, Claude 3.5 Sonnet, etc.). The metrics correlate reasonably with human judgment but are noisy on edge cases; treat them as directional, not absolute.

TruLens

TruLens takes a similar approach but with a richer observability layer. It instruments your RAG pipeline, captures intermediate states (retrieved chunks, reranked top-k, generated text), and runs evaluations on them. Useful for production observability — running TruLens on a sample of live traffic surfaces drift over time.

Custom golden sets

The most reliable eval is a curated set of 50-500 question-answer pairs from your actual workload, hand-validated by domain experts. Run the RAG against the golden set after every change; track recall@5, precision@5, and answer correctness. A small golden set updated weekly with new edge cases catches more production bugs than any benchmark.

Build the golden set from production traces. Sample diverse queries (cluster embeddings, sample one from each cluster), have a domain expert write the ideal answer, validate that the retrieval surfaces the relevant chunks. Maintain the set; rotate in new traces, retire stale ones.

Public benchmarks: useful but contaminated

HotpotQA, NaturalQuestions, FinanceBench, MultiHop-RAG. All are useful for cross-system comparison and for catching obvious regressions. All are contaminated to some degree in modern LLM training data; absolute numbers are inflated. Use them as one signal among many, not as your primary eval.

Eval frequency in production

Recommended cadence: golden-set eval before every deploy (block on regression > 2 points), continuous TruLens-style sampling on live traffic (alert on drift > 5 points), weekly review of failure cases with the product team. The combination catches most production regressions before they hurt users.


Long-context vs RAG vs fine-tune decision math

Three ways to give an LLM your data: put it in the prompt (long context), retrieve and put the relevant slice in the prompt (RAG), or bake it into the weights (fine-tuning). The economic and quality trade-offs in 2026:

Long context

Cost: input tokens × per-token price, every query. For a 100K-token corpus at $3 / M input tokens, every query costs $0.30 just for the context. For 1M queries / month: $300K / month, ignoring all other costs.

Quality: depends on the model's long-context performance. Gemini 2.5 Pro and Claude Opus 4.x handle 1M tokens with reasonable retrieval-from-haystack quality. GPT-5 at 200K context is similar. Long-context quality degrades on multi-needle queries and lost-in-the-middle effects.

Best for: small corpora (< 500K tokens), one-shot tasks, prototyping.

RAG

Cost: storage + retrieval infrastructure (small fixed cost) + per-query embedding + retrieval + reranking + generation. For typical workloads, per-query cost is 1-10% of long-context cost.

Quality: depends on retrieval quality. With contextual retrieval + strong reranker, RAG matches or beats long-context on most retrieval-style queries. Worse on synthesis-heavy queries that need the full document.

Best for: moderate to large corpora (>1M tokens), high query volumes, anything where storage is feasible.

Fine-tuning

Cost: training cost (one-time, $1K-100K depending on model size and data volume) + inference at fine-tuned weight prices (usually similar to base model prices). Updates require re-training or LoRA adapter swaps.

Quality: best for style and format adaptation, mediocre for factual recall. Fine-tuning does not reliably "teach" facts; it teaches patterns. For factual knowledge, RAG is more reliable.

Best for: style adaptation, domain-specific tone, output format consistency. Not for factual lookup.

Decision matrix

Workload Corpus size Query volume Recommended approach
Chatbot over docs < 100K tokens Any Long context
Chatbot over docs 100K - 1M tokens < 10K / day Long context if budget allows, RAG otherwise
Chatbot over docs > 1M tokens Any RAG
Enterprise search Any High RAG
Style adaptation Any High Fine-tune (LoRA)
Mixed Any Any RAG + fine-tune for style

The 2026 reality: most non-trivial production deployments use RAG + LoRA fine-tuning. Long-context is for prototyping and small corpora; pure fine-tune is for narrow style use cases.


Observability for RAG

Production RAG needs observability beyond standard web service metrics. Key signals:

Per-stage latency breakdown

Track median and P99 latency for each stage: parse (ingestion only), embed, BM25 retrieve, dense retrieve, fusion, rerank, generate. The bottleneck shifts as the system grows; without per-stage tracking you cannot identify which stage to optimize.

Recall proxies

Track distribution of top-k similarity scores. A drop in the distribution (top-1 score median falling) usually signals a retrieval quality regression — either bad new documents in the corpus or a model drift. Alert on distribution shifts.

Citation rate

Track the fraction of generated answers that include valid citations (citations that reference actual retrieved chunks). A drop in citation rate is an early signal of generation drift or prompt changes that broke citation discipline.

Failed retrievals

Track queries where the top-k retrieved chunks have low similarity scores (no result is confidently relevant). For these queries, the LLM is likely to hallucinate. Production stacks should detect this and respond with "I don't know" rather than fabricating.

Distribution monitoring

Track the distribution of query embeddings over time. Sudden shifts (clusters appearing or disappearing) indicate workload changes that may require re-tuning chunking, embedding model choice, or retrieval thresholds.

Trace sampling

Sample 1-5% of production traces, store full context (query, retrieved chunks, generated answer, citation validity), use them for offline eval and root-cause analysis. The sample is large enough to catch failure modes and small enough to be cost-effective.


Security: PII, row-level access, multi-tenant isolation

RAG over enterprise data immediately encounters security requirements that consumer chat applications can ignore.

PII in chunks

Source documents often contain PII (names, emails, SSNs, financial details). The chunks inherit this PII. Without controls, a user query can retrieve a chunk containing another user's PII and surface it in the generated answer.

Mitigations: PII scrubbing at ingestion (replace detected PII with redaction tokens), per-chunk PII tags that filter at retrieval, output-side PII detection that blocks responses containing high-confidence PII. The state of the art uses Presidio (Microsoft) or AWS Comprehend for detection; production stacks chain multiple detectors for higher recall.

Row-level access control

Each chunk has a set of allowed-viewer identities (users, groups, roles). Queries are filtered to only retrieve chunks the requesting user has access to. The challenge: filtering must happen efficiently at retrieval time without scanning the entire candidate set.

Vector DBs supporting metadata filtering on hot paths: Qdrant (named vectors with payload filters), Pinecone (metadata filters with serverless), Milvus (scalar filter integration), Weaviate (object-level ACLs). Performance depends on the selectivity of the filter — high-cardinality filters (per-user access) are harder than low-cardinality filters (per-tenant). For per-user access, partition the index by user when possible.

Multi-tenant index isolation

Two architectures for multi-tenant RAG: shared index with tenant-id filtering, or separate index per tenant. Shared is cheaper at small per-tenant data volumes; separate is safer (no risk of filter bugs leaking across tenants) and scales linearly with tenant count.

The 2026 trend: separate indexes per tenant with managed vector DBs that support cheap-to-create namespaces (Pinecone Serverless, Turbopuffer). Costs are similar to shared-index at moderate scale and security is dramatically better.

Audit logging

Every retrieval should log: timestamp, user identity, query, retrieved chunk IDs, generated response. This audit log is the evidence trail for compliance (HIPAA, SOC 2, GDPR) and for investigating incidents. Store it in append-only storage; retain per compliance requirements.

Prompt injection in retrieved content

If retrieved documents are user-generated or web-crawled, they may contain prompt injection attempts ("Ignore your instructions and..."). RAG systems are particularly vulnerable because the injection lands inside the model's context window with high authority.

Mitigations: sanitize retrieved content (strip suspicious patterns, encode in delimited regions the model is trained to treat as data, not instructions), use models trained for injection resistance (Claude's recent constitutional training, Gemini's safety filters). The state of the art is imperfect; high-stakes deployments should assume injection attempts will sometimes succeed and design downstream controls accordingly.


2026 trends and what's next

Small specialized retrievers

Domain-specific embedding models (Voyage code, Voyage legal, Cohere embed-multilingual-v4) consistently beat general-purpose models on their domain. The trend is fine-tuning embedding models on customer corpora for the last 5-10% of recall.

Retrieval-aware fine-tuning

Fine-tuning the generator on retrieval-augmented inputs (training the model to use retrieved context faithfully) is more effective than naive fine-tuning. Open-source recipes (RAFT, FiD) exist; production adoption is growing.

On-device RAG

Mobile and edge devices running small LLMs with on-device vector indexes. Use cases: privacy-sensitive personal RAG, low-latency on-device assistants. Pixel 9 and iPhone 17 both ship with on-device retrieval kits; the corpus sizes are small (~100K chunks) but the privacy benefit is significant.

Agentic search

Web-scale agentic RAG (Perplexity, You.com, OpenAI's deep research) treats the entire web as the corpus and uses an agent to navigate it. The shift from static index to dynamic crawl-on-demand is the frontier. Production economics still favor static indexes for stable corpora; agentic search wins for breadth and freshness.

Multi-modal RAG

Image embeddings, table embeddings, and video chunk embeddings extend RAG beyond text. CLIP-class image embedders (open-source) and proprietary multi-modal embedders (Gemini, GPT-5 multi-modal) make this feasible. The retrieval pattern is the same; the embedding models change.

Retrieval as attention

Research direction (Reformer, Memorizing Transformers, MEGABYTE-RAG): replace traditional attention with retrieval over a large memory store. By 2027 expect early production deployments where the line between "long context" and "retrieval" blurs further.


Freshness and incremental indexing

Many RAG production failures are stale-data failures: the model confidently answers from a version of the corpus that's three weeks behind reality. Freshness is its own engineering discipline.

Update cadence by use case

  • News, market data, status pages. Seconds-to-minutes. Streaming ingestion required.
  • Internal docs (wiki, Confluence, Notion). Minutes to hours. CDC from the source or polling at short intervals.
  • Customer support corpora. Hours to a day. Tickets close and new ones open continuously.
  • Legal, medical guidelines. Days to weeks. Updates come in distinct events.
  • Historical archives. Quarterly or annual. Mostly write-once.

Incremental indexing patterns

  • Upsert by document ID. When a document changes, recompute its chunks, embed them, and upsert into the vector DB. Old chunks deleted by ID. Simple and works for most workloads.
  • Versioned chunks. Keep multiple versions of each chunk with timestamps; retrieval filters by "latest" or by a specific point-in-time. Useful for compliance ("what was the answer on date X").
  • Delta indexing. Compute hashes of each chunk; only re-embed chunks whose hash changed. Saves cost on large documents where most content is stable.
  • Hot / cold index split. Recent documents live in a small hot index (fast updates, slightly worse retrieval); historical documents in a large cold index (rebuilt nightly). Hybrid query searches both.

Stale-answer detection

  • Recency-weighted ranking. Boost scores for newer documents. Trade-off: may favor recency over relevance.
  • Conflict detection. When retrieved chunks contradict each other, surface the conflict to the user or pick the most recent. Requires entailment or NLI checks on retrieved sets.
  • Source-modified-date in prompts. Pass document modification dates with each chunk; instruct the model to prefer recent sources and to disclose source dates when relevant.
  • TTL on cached answers. Semantic-cached answers expire after a window appropriate for the domain.

Background re-indexing for embedding upgrades

When you upgrade the embedding model, every chunk in the corpus needs re-embedding. Patterns:

  1. Dual-index live migration. Run old and new indices in parallel. Route a percentage of traffic to new; ramp up as quality validates. Cut over and decommission old.
  2. Background batch re-embed. Kick off a batch job that re-embeds in priority order (newest docs first, popular docs second). Switch to the new index for new queries once a threshold is reached.
  3. Lazy re-embed on query. Only re-embed chunks that get retrieved; over time the most-accessed chunks migrate. Slow but cheap.

For corpora of 100M+ chunks, planned re-embedding is a quarterly capacity event. Budget the cost; preserve the old index for rollback.

Index hygiene

  • Garbage collection. Documents deleted from the source must be deleted from the index. Add a daily reconciliation job.
  • Duplicate detection. Same document indexed twice (different paths, soft-deleted-and-restored). De-duplicate by content hash.
  • Outlier removal. Empty chunks, chunks with mostly whitespace, chunks that are just URLs. Periodic audit and cleanup.
  • Embedding distribution monitoring. Track the mean and variance of embeddings over time. Drift indicates corpus shift; outlier detection finds garbage.

Domain-specific RAG: legal, medical, financial, code

Generic RAG advice covers maybe 70% of production use cases. The other 30% are vertical domains with their own corpus shapes, accuracy bars, and regulatory constraints.

Legal RAG

Corpora: case law, statutes, regulations, contracts, internal precedent libraries. Distinctive properties: documents are long (50–500 pages each), cite-heavy (footnotes, internal cross-references), structurally rigid (numbered sections, defined terms), and accuracy bar is hard — a wrong citation or misquoted section is malpractice.

Production patterns:

  • Hierarchical chunking with section preservation. Chunk by legal section, not by token count. Keep the section number and document title as metadata.
  • Citation graph as retrieval feature. When the question is "what did Court X say about Y," follow the citation graph from a seed document. Tools: Neo4j or Memgraph for the graph, vector DB for content similarity, hybrid query that combines both.
  • Defined-term resolution. "The Company" in a contract refers to a specific party defined on page 1. RAG must resolve definitions before searching.
  • Domain embeddings. Voyage's legal embedding model, Cohere's domain-tuned variants, or fine-tuned BGE on legal corpora. Generic embedders miss legal jargon and statute citations.
  • Eval against attorney-validated golden sets. No public legal RAG benchmark adequately covers production needs; build your own with practicing attorneys.
  • Mandatory citations with click-through verification — surface the source statute or case section verbatim alongside the answer.

Vendors in the legal RAG space include Harvey, Casetext (now part of Thomson Reuters), Lexis+ AI, Spellbook, Bloomberg Legal AI, and a long tail of vertical AI startups. The pattern across products: more parsing engineering than retrieval engineering.

Medical RAG

Corpora: clinical guidelines (NCCN, USPSTF), drug references (UpToDate, DailyMed), peer-reviewed literature (PubMed, NEJM), EMR data, internal hospital protocols. Distinctive: rapidly evolving (guidelines update monthly), structured terminologies (ICD-10, SNOMED, RxNorm), high precision required, regulated (HIPAA, FDA).

Production patterns:

  • Terminology-aware embeddings. Models trained or fine-tuned on biomedical corpora — Voyage's medical embedding, BiomedBERT, MedCPT. SapBERT for entity linking.
  • Code system integration. Map free-text symptoms/conditions to ICD-10 / SNOMED codes; retrieve over the code-tagged index. Improves recall on synonym-heavy clinical text.
  • Evidence-grade tagging. Cite by evidence quality (RCT > observational > expert opinion). Surface in the answer.
  • HIPAA-grade infra. BAA with all vendors. PHI redaction before any prompt that leaves the BAA boundary. Audit logs of every retrieval.
  • Guideline versioning. Track which guideline version was used for each answer; if guidelines update, flag prior answers for review.
  • FDA considerations. Software that diagnoses or recommends treatment may fall under medical device regulation (FDA Clinical Decision Support guidance). Most production medical RAG deployments today are scoped to "information retrieval" not "clinical decision support" to stay out of medical-device territory; legal review required.

Financial RAG

Corpora: filings (10-K, 10-Q, 8-K, S-1), earnings transcripts, research reports, market data, internal trading documents, compliance manuals. Distinctive: time-sensitive (a stale answer is wrong), structured (tabular financial statements), regulated (SOX, FINRA, MiFID II), citation-heavy (every number traces to a source filing).

Production patterns:

  • Time-indexed retrieval. Every document tagged with effective-as-of date; queries filtered to relevant date range.
  • Table-aware parsing. Financial documents are mostly tables. LlamaParse + dedicated table-extraction (Reducto, ChunkR) outperform generic parsers by 10–20 points of recall on table-grounded questions.
  • Number provenance. Every numeric claim in the answer must cite the exact cell, line item, or paragraph in the source filing.
  • Regulatory disclaimers. Most production financial RAG attaches a "not investment advice" disclaimer; some jurisdictions (EU, UK) have specific disclosure requirements.
  • Compliance review. A second pipeline reviews answers for non-compliant content (recommendations, predictions, guarantees) before delivery.

Bloomberg, Refinitiv, FactSet, AlphaSense, Hebbia, and Brightwave all ship financial RAG products in 2026. The differentiator is data licensing more than architecture — owning rights to filings and transcripts is a moat.

Code RAG

Corpora: source code repositories, API documentation, internal libraries, code review comments. Distinctive: structured (ASTs), highly cross-referenced (function calls, imports), evolving rapidly (commits per minute), tokenization-sensitive (tokenizers handle code differently).

Production patterns:

  • AST-aware chunking. Tree-sitter for language-aware splits on function and class boundaries. Preserve scope metadata (the surrounding class, the file path, the imports).
  • Code embeddings. Voyage voyage-3-code, BGE-Coder, Jina embeddings v3 code mode, OpenAI's text-embedding-3-large with code-aware training. Generic embedders score ~5–15 points lower on code retrieval benchmarks.
  • Repo-level long context as alternative. For repos under 200k tokens, long-context with prefix caching is often better than RAG. The cutover point in 2026 is roughly 200k–500k tokens of repo content; above that, RAG wins.
  • Symbol resolution. Combine vector retrieval with LSP/Sourcegraph-style symbol search. "Where is foo() defined?" doesn't need embeddings; it needs symbol indexing.
  • Test-aware retrieval. When the agent is debugging, retrieve test files for the relevant module alongside the implementation.
  • Diff-aware updates. On git push, only re-embed changed files. Incremental indexing is essential for active repos.

Cursor, Sourcegraph Cody, GitHub Copilot Workspace, Cline, Codeium, and Continue all run some form of code RAG in 2026. The architectural commonality is heavy reliance on symbol search alongside embedding search.

Multilingual and cross-lingual RAG

Corpora that span languages, or queries in language A against documents in language B. Patterns:

  • Multilingual embeddings. BGE-M3 (100+ languages), Cohere embed-v4 multilingual, jina-embeddings-v3 (multilingual variant). Generic English models lose 10–25 points on non-English retrieval.
  • Tokenizer-aware BM25. Many BM25 implementations use whitespace tokenization, which fails on CJK languages. Use language-specific tokenizers (kuromoji for Japanese, jieba for Chinese, mecab for Korean).
  • Cross-lingual retrieval. Query in English, retrieve from Spanish corpus. Multilingual embedders handle this; quality is 5–10 points below single-language retrieval but acceptable.
  • Translation as a fallback. Translate retrieved chunks to the user's language before passing to the generator if the generator is weaker in the corpus language. Adds latency and cost; only useful when retrieval quality demands it.

RAG SaaS and managed offerings

For teams that don't want to build a RAG stack, managed offerings have multiplied through 2024–2026. They trade flexibility for time-to-production.

The 2026 lineup

Service Stack Sweet spot Limitations Price model
Vectara Proprietary end-to-end Mid-market RAG, citations-first Less flexibility on retrieval tuning Tiered + per-request
OpenAI Assistants File Search OpenAI proprietary Tight OpenAI integration, prototyping Limited to OpenAI models Bundled with API
Anthropic Files API Anthropic proprietary Claude users wanting RAG with contextual retrieval baked in Limited to Anthropic models Bundled with API
AWS Bedrock Knowledge Bases Bedrock + OpenSearch Serverless AWS-native, multimodal Tied to AWS stack Per-request
Vertex AI Search GCP + Cloud Storage GCP-native, multilingual strong Vertex ecosystem dependence Per-query
Azure AI Search + OpenAI Azure stack Microsoft 365 / enterprise Enterprise pricing Per-document + per-request
Pinecone Assistant Pinecone + their orchestration Pinecone users wanting RAG Single-vendor lock-in Premium tier
Cohere RAG (Command R+) Cohere end-to-end Citation quality, multilingual Smaller model selection Per-token
LlamaIndex Cloud LlamaIndex + LlamaCloud parsing Document-heavy workloads Newer service Per-document
Glean Enterprise search + RAG Enterprise knowledge search across SaaS sources Enterprise pricing Per-seat
Hebbia Vertical financial/legal Highly specialized verticals Niche; pricey Enterprise

When to use managed vs roll-your-own

  • Use managed when: prototype phase, no in-house ML infra team, single cloud, RAG is not a core differentiator, modest scale (<10M chunks).
  • Roll your own when: RAG is a product moat, you need per-tenant customization, you have specific retrieval quality requirements that off-the-shelf can't meet, scale exceeds managed-tier economics, multi-cloud deployment.

The 2026 trend: hybrid. Teams start managed, migrate to roll-your-own once they have the data to know what to optimize.

Migrating off a managed RAG service

A common 2026 maturation arc:

  1. Year 1: ship on a managed service (Vectara, Bedrock KB, or OpenAI Assistants). Get to product-market fit.
  2. Year 1.5: hit quality or cost ceiling. Decide what to optimize.
  3. Year 2: roll your own pipeline component-by-component. Often start by replacing the reranker (highest leverage); then the embedding model; then the orchestration; then the vector DB last (least leverage).

The pattern is the same as managed databases: managed for the long left-hand of the adoption curve, custom for the right.


Long-context-aware RAG: the 2026 pattern

Long context didn't kill RAG, but it changed how the two compose. The 2026 production pattern blends them.

Hierarchical retrieve-then-read-whole

  • Retrieve top-K small chunks via hybrid search + reranking.
  • Identify the parent documents containing the top chunks.
  • Inject the top 1–3 parent documents whole into the long-context prompt (with prefix caching).
  • Generate.

The model sees full document context for the most relevant sources, not isolated chunks. Quality on synthesis questions ("compare these two contracts") improves by 15–30 points over chunk-only RAG on long-document corpora. Cost: more tokens in the prompt, partially offset by prefix caching.

Long-context as a reranker

A research direction worth watching: use a long-context model as the reranker itself. Retrieve top-100 chunks; pass all 100 with the query to a long-context model; ask it to identify the most relevant. Quality is excellent on hard synthesis questions; cost is high. Used in 2026 mostly for offline eval and high-stakes legal/medical workloads.

Dossier mode

For agent workflows that reference the same document set across many turns, prefill the documents once (with prompt caching) and reuse for the conversation. Cost amortizes across turns. Common in legal research, financial analysis, and code review assistants.

Sliding window over a corpus

For corpora that fit in long context but are too big for a single prompt, slide a window across the corpus, summarize each window, and use the summaries as a higher-level index. The summaries are short; the index is small; queries hit summaries first, then drill into full documents. Microsoft GraphRAG uses a related pattern (community summaries).


The bottom line

The context-mismatch problem is structural: your data was never in the model's training set, and no model upgrade fixes that. RAG is the standard answer, and the standard answer in 2026 is a layered pipeline where the reranker is the single biggest quality lever and citations are the single biggest trust lever. Most production RAG plateaus because teams skip one or both.

Five takeaways to leave with:

  • Default stack: chunk → embed → hybrid (BM25 + dense) retrieve top-100 → cross-encoder rerank → generate with mandatory citations. Skip nothing in that chain.
  • Contextual prefixes before embedding (Anthropic's recipe) are the largest free recall win currently published. Add them before chasing model upgrades.
  • Long context is not a RAG replacement at corpus scale. Cost and recall both favor retrieval beyond a few hundred thousand tokens.
  • Failure is almost always upstream of generation. Parsing, chunking, and reranker thresholds account for the majority of quality issues; hallucination is usually a symptom.
  • Evaluate with your own traces. Public RAG benchmarks are partly contaminated and do not predict production behavior.

For neighboring topics: long-context attention is the alternative architecture this whole pipeline trades against, and eval infrastructure is the discipline that keeps the pipeline honest as your corpus and traffic evolve.


FAQ

Is RAG dead now that models have 1M-token context? No. Long context made RAG cheaper to do well, not obsolete. Retrieval cost scales with corpus size; long-context cost scales with prompt length per request. Whichever scales worse for your workload determines the architecture. Most production systems are hybrid — long context for the response, RAG for retrieval.

Which embedding model should I use? Cohere embed-v4 or OpenAI text-embedding-3-large for closed. BGE-M3 for open-weight. Voyage voyage-3-large for domain-specialised (legal, finance, code). The MTEB leaderboard's top 20 are within 2 points of each other; test on your data.

Pinecone vs Qdrant vs pgvector? Pinecone if you don't want infra ops. Qdrant if you want self-hosted with strong filtering. pgvector if you already run Postgres and your corpus is <50M chunks. At >1B vectors, look at Milvus, Vespa, or managed offerings.

Do I need a reranker? Yes. The bi-encoder + cross-encoder funnel is the largest single quality lever you have. The only workloads that don't benefit are ones where bi-encoder retrieval is already near-perfect — rare in practice.

Chunk size: 256 vs 512 vs 1024 tokens? 512–1024 tokens with 10–20% overlap is the default for prose. Code wants AST-aware splits. Tables want structure preservation. The reranker hides a lot of chunking sins; don't over-engineer.

BM25 or dense — which one? Both. Hybrid (BM25 + dense, fused via RRF) consistently outperforms either alone by 5–15 points of recall@10 on technical and code domains.

How do I evaluate RAG? RAGAS for automated metrics (faithfulness, context precision/recall, answer relevance). A curated 100–500 query set from your own workload for the eval that actually predicts production behaviour. Public RAG benchmarks are contaminated; don't trust them for production decisions.

Citation: how strict? For chat: inline [source:N] after each factual claim, post-validate that N exists in the retrieved set. For compliance domains: sentence-level grounding with NLI verification. For chat-only consumer products: less strict is acceptable.

When does graph RAG pay off? Synthesis queries across many documents, entity-centric workloads, domains where relationships matter as much as content. For pointed factual questions, plain chunk RAG is sufficient and cheaper.

RAG over a codebase? Tree-sitter or LSP-driven chunking (split on functions/classes), code-specialised embeddings (Voyage voyage-3-code, BGE-Coder), and a reranker. For repos that fit in long context, consider whole-repo prompting instead.

How fresh can RAG be? As fresh as your ingestion pipeline. Most teams batch nightly. Streaming ingestion (CDC from source systems) gets you to minutes; harder ops, worth it for news, prices, status pages.

Multi-tenant RAG? ACL via metadata filters at retrieval time. Per-tenant indices for hard isolation if regulation requires. Most vector DBs support filter-based isolation efficiently; per-tenant indices burn more storage but win on auditability.

LangChain or LlamaIndex? For prototypes, either. In production, most serious teams write thin orchestration directly on top of vector DB and reranker SDKs. The frameworks bring abstraction cost that production stacks eventually shed.

How do I handle PDFs? Use a layout-aware parser (Unstructured, LlamaParse, AWS Textract, Reducto, Azure Document Intelligence). Naïve text extraction loses tables and figure context. Parser quality is one of the top three quality levers in ingestion.

What about Anthropic's Contextual Retrieval? Prepend a one-sentence document-level context to each chunk before embedding. Reduces retrieval failures by 30–50% on long-document corpora. Cheap to implement; production-worth it.

Where does the latency go in a RAG request? Roughly: 10–50ms hybrid retrieval, 30–100ms rerank, 200–1000ms generation. Generation dominates. Retrieval optimisation usually isn't the right place to spend engineering hours.

Should I fine-tune the embedding model? Only if you've already done the hybrid + reranker + contextual retrieval work. Fine-tuning an embedding model on in-domain query-document pairs lifts in-domain recall by 10–25%, but it's not the first lever. The order of operations: parser fix, chunking fix, hybrid retrieval, reranker, contextual retrieval, then embedding fine-tune. If your retrieval is still bad after those five, the embedding model is the suspect.

What's the right top-k for retrieval before the reranker? 100 to the reranker is the production default. Below 50 and you bottleneck on bi-encoder recall; above 200 and reranker latency dominates with diminishing returns. Tune empirically by measuring recall@5 of the reranker output as you vary the input k.

Do I need a query rewriter? For single-turn chat, often no. For multi-turn (where pronouns and references depend on prior turns), absolutely yes. The rewriter is one cheap LLM call (Haiku, Flash, 4o-mini) that takes the conversation history and produces a self-contained search query. Multi-turn RAG without query rewriting routinely loses 30+ points of recall@5 versus single-turn baselines.

HyDE vs multi-query expansion? HyDE (generate a hypothetical answer, embed that) helps on out-of-distribution queries where the query and documents are stylistically far apart (short question, long technical document). Multi-query expansion (generate N rewrites, union results) is simpler and works well on most workloads. Most production stacks use multi-query; HyDE is a domain-specific tool.

What about RAG for non-English corpora? Use a multilingual embedding model (BGE-M3, Cohere embed-v4 multilingual, jina-embeddings-v3) and a multilingual reranker (Cohere rerank-3.5 multilingual, bge-reranker-v2-m3). Don't translate the corpus to English first — embedding quality on the original language usually beats English-via-translation. BM25 over the source language; ensure your tokeniser handles the language's whitespace and morphology.

How do I deal with PII in retrieved chunks? Two layers. At ingestion, redact or pseudonymise PII before embedding (Presidio, AWS Comprehend, custom regex for known formats). At retrieval, post-process retrieved chunks against the requesting user's permissions — if the user doesn't have access to document X, never return its content even if it ranked well. ACL filters at the vector DB layer are the standard implementation.

Should I use a single embedding model or one per domain? Single model is the operational default; specialised models only when you have a domain where the generic model demonstrably underperforms (legal, biomedical, code). Voyage's domain-tuned variants and BGE's domain checkpoints are the standard picks. Splitting embeddings across models means splitting indices, which complicates retrieval.

RAG for agents — anything different? Yes. In an agent setting, retrieval is one tool among many, called when the agent decides it needs evidence. The agent often issues multiple retrieval calls per task with different queries. Caching across calls (same query → same retrieved set within the same session) matters more. See agent serving infrastructure for the broader pattern.

Can I run RAG on a reasoning model like o3 or DeepSeek-R1? Yes, and the quality on hard synthesis questions is markedly better. The cost is 5–20× a standard LLM call, and latency is 10–60 seconds per query. Use for the 5–10% of queries that demonstrably need it; route everything else to a cheaper model. See reasoning model serving.

Should I store both the chunk text and the chunk embedding in the vector DB? Yes. The text is needed at retrieval time to assemble context; refetching from a separate document store adds latency and a failure mode. The 1–5 KB per chunk of text is trivial compared to the vector and index overhead.

How often should I re-embed the corpus? Only when you change the embedding model. Adding documents uses the existing model; recomputing existing documents costs the corpus × per-embedding price. Plan re-embedding events as quarterly or annual capacity decisions; budget the spend in advance. Some teams keep two indices live during migration and switch atomically once the new index is validated.

What's the right way to handle code in a RAG corpus? Tree-sitter or LSP-based chunking on function and class boundaries. Use a code-specialised embedding model (Voyage voyage-3-code, BGE-Coder, jina-embeddings-v3 code mode). Include the file path, language, and surrounding scope as metadata. For repos that fit in long context, whole-repo prompting often beats retrieved snippets on code-comprehension tasks.

How do I A/B test RAG changes safely? Shadow traffic: run the new pipeline on a copy of live queries, log the differences, eval offline. Once offline metrics are stable, ramp a percentage of live traffic to the new pipeline with a kill switch. Track per-tenant quality, not just aggregate — a change can lift average recall while breaking one customer's workload.

Is contextual retrieval worth the cost on a small corpus? For corpora under 10K chunks, the ingestion cost is negligible (~$10) and the quality lift is substantial. Always worth it. For very large corpora (>100M chunks), the cost scales but is still typically a small fraction of total infrastructure spend. The only case where it does not pay is when retrieval is already near-perfect (rare in practice).

Why does my RAG work in eval but fail in production? Three common causes. First, eval queries are typically clean and well-formed; production queries have typos, pronouns, abbreviations. Add query rewriting. Second, eval corpora are static; production corpora drift. Monitor embedding distribution. Third, eval is single-turn; production is multi-turn with conversational context the retrieval cannot see. Add conversation summarization to the retrieval input.

Should I use a multi-modal embedder for text-only RAG? No. Multi-modal embedders are tuned across modalities and typically underperform pure-text embedders on text-only retrieval by 3-8 points. Use them only when the corpus actually contains images or other modalities that need retrieval.

What's the right top-k for retrieval and rerank? For retrieval: top-100 to top-200 candidates from each lane (BM25 + dense). For rerank: top-5 to top-10 to feed the generator. The retrieval top-k should be large enough that the reranker can find the relevant chunks; the rerank top-k should be small enough that the generator does not get confused by tangentially relevant context.

How do I handle freshness? Two patterns. Tag each chunk with a timestamp, filter at retrieval to recent-N-days for time-sensitive queries (detected by a classifier or by explicit user intent). Or maintain a "hot" index of recent documents and a "cold" index of historical ones, query both, weight recent results higher. Both work; the first is simpler if your vector DB has efficient metadata filtering.

Does prompt caching change RAG economics? Yes, materially. For RAG with stable system prompts and per-query retrieved context, the system prompt is cached and only the retrieved chunks + query incur full input cost. At 90% cache hit rate, the effective input price is ~5-15% of nominal. Always design RAG prompts to maximize cacheability: stable system prompt, then user query, then retrieved context in a deterministic order.

Can I skip BM25 if I use dense retrieval? For most workloads, no. BM25 catches lexical matches (exact terms, proper nouns, rare technical vocabulary) that dense embeddings often miss. The cost is low; the quality lift is consistent. The exception: pure semantic-matching workloads (e.g., "find documents similar in meaning to this paragraph") where lexical match is not relevant.

Why does my reranker not help? Three diagnostics. First, check that the retriever is producing diverse top-100 candidates — if all 100 are near-duplicates, the reranker has nothing to choose from. Increase retrieval diversity (MMR, deduplication). Second, ensure the reranker is appropriate for your domain — try a domain-tuned reranker (Voyage rerank-2-finance, etc.). Third, verify the reranker is actually being called — production bugs where the reranker is silently bypassed are not rare.

How do I handle very long source documents (entire books)? Hierarchical retrieval: chunk into small units for the primary index, retrieve the small units, expand to the parent section or chapter for context. Or, if the model supports long context, retrieve top-K small chunks, identify the parent documents, and inject the most-relevant parent document whole. The latter is the "long-context-aware RAG" pattern that has gained traction in 2026.

What's the right way to handle citations in regulated industries? Mandatory citations with click-through verification. Every claim must reference a specific chunk, the chunk must be displayed to the user (or available on demand), and the source document must be retrievable. The model is instructed to refuse questions that cannot be answered from cited sources. Production deployments in healthcare and legal use this pattern; it is more conservative than general-purpose RAG and the quality bar is higher.

Should I fine-tune my embedding model? For high-volume, high-stakes deployments with proprietary data, yes — domain-tuned embeddings consistently outperform general-purpose ones by 3-8 points on domain queries. The cost: collecting a training set (query-document pairs, typically 10K-100K examples), running the fine-tune ($1K-10K), and re-embedding the corpus. Worth it when the marginal recall matters.

How do I monitor for prompt injection in retrieved content? Run a classifier (regex + small model) on retrieved chunks before they reach the generator. Flag chunks containing suspicious patterns (instruction-like text, role-switching attempts, "ignore previous"). For flagged chunks, either skip them, mark them as untrusted in the prompt (so the model knows to treat them as data not instructions), or escalate to human review. Production stacks chain multiple detectors and accept some false positives in exchange for injection robustness. See production safety guardrails for the broader injection-defense pattern.

Is contextual retrieval just better chunking? Not quite — it's chunk augmentation via a separately generated context. Anthropic's recipe (Sept 2024) prepends a 50–100 token summary describing the chunk's role in the parent document, then embeds the combined string. The chunk text is unchanged; the embedding sees more context. This is why the lift is large: the embedding space sees relevant disambiguation rather than naked chunks.

How do I size my vector DB capacity? Start with vectors-per-chunk × bytes-per-vector × replication-factor × index-overhead-factor. For 1024-dim float32 (4096 bytes per vector), 100M chunks, 2× replication, 1.5× HNSW index overhead: ~1.2 TB. At 4096-dim, double it. With Matryoshka truncation to 256 dim and int8 quantization, you can cut this by 16×. Plan for 3× the cold capacity as working memory for HNSW.

Can I quantize my embeddings to save storage? Yes. int8 quantization (per-dimension scale + bias) costs ~1 point of recall on most workloads. Binary embeddings (1-bit per dimension) cost 2–5 points but reduce storage by 32×. Matryoshka representation learning lets you truncate dimensions on the fly. Combine these: Matryoshka to 512 dim + int8 quantization gives an 8× storage reduction at <2 points recall cost on most production workloads.

Should I use a managed vector DB or self-host? Below 100M chunks and 100 QPS, managed (Pinecone, Qdrant Cloud, Weaviate Cloud, Turbopuffer) is the right call — operational simplicity. Above 1B vectors or 1000+ QPS, self-hosted on Kubernetes with Qdrant, Milvus, or Vespa amortizes the engineering cost. The 100M–1B middle range is a toss-up; pick on team skill rather than economics.

What about caching the LLM responses themselves? Semantic caching — cache the (query, retrieved-context-hash, response) triple. On a new query, check if a similar query produced a response with overlapping context recently. Cache hit returns the cached response with optional re-validation. Saves 30–60% of LLM cost on chat workloads with repetitive questions. Risk: stale answers if the corpus updates; mitigate with TTL.

How do I handle conversational RAG with multi-turn context? Three patterns. (1) Query rewriting: use a cheap LLM call to produce a self-contained search query from the conversation history. (2) Conversation summarization: maintain a rolling summary of the conversation that's part of the retrieval input. (3) Context-aware retrieval: pass the recent user turns directly to the retriever (works if the retriever handles long queries). Pattern 1 is the most common; pattern 3 is rising as long-context retrievers improve.

Does the embedding dimension actually matter? Yes but less than expected. 256-dim Matryoshka-truncated embeddings recover 95–98% of 1024-dim quality on most benchmarks. 1024 is the production default; below 256 quality degrades visibly. Above 3072, diminishing returns. The bigger drivers of retrieval quality are reranking, hybrid search, and contextual retrieval — not embedding dimension.

What's the right retention for RAG traces? For production: 30 days hot (queryable), 90 days warm (compressed in object storage), 1+ year cold (archival) for compliance. Privacy: scrub user PII from traces before long-term retention. Some regulated industries require 7+ years (financial, healthcare); plan for it.

Should I evaluate RAG with LLM-as-judge? Yes for scale (LLM judges run on thousands of examples cheaply), but calibrate against human ratings on a sample of 100–200 examples per quarter. Judge models have biases (length, position, formality); the calibration adjusts. RAGAS, Patronus, and TruLens all support LLM-as-judge with calibration hooks.

How fresh can streaming ingestion get me? Minutes to seconds. CDC (Change Data Capture) from your source database → event stream → embedding worker → vector DB upsert. End-to-end latency: 30 seconds to 5 minutes depending on batch size and throughput. Required for news, pricing, status pages. For most corporate docs, nightly batch is sufficient.

Is hybrid retrieval worth the operational complexity? Yes for most workloads. The recall lift is 5–15 points consistently, the operational cost is one extra service (Elasticsearch / OpenSearch / Tantivy), and most vector DBs (Qdrant, Weaviate, Vespa) now ship hybrid natively. Skip hybrid only for pure-semantic workloads where lexical match is irrelevant.

How do I tune my reranker threshold? Empirically per workload. Plot precision-recall curves on a labeled set: as you raise the threshold (admit fewer chunks), precision rises and recall falls. Pick the elbow that matches your product's tolerance for "no answer" vs "wrong answer." Typical thresholds: 0.6–0.8 for cross-encoder rerankers normalized to [0,1].

What's the practical difference between Voyage-3 and Cohere embed-v4? Voyage offers domain-specialized variants (voyage-3-code, voyage-law, voyage-finance) that beat general-purpose models on their domains by 5–15 points. Cohere embed-v4 is stronger on multilingual and on conversational retrieval. OpenAI text-embedding-3-large is the safe default if you can't decide. Try all three on your data; the winner often depends on corpus shape rather than headline MTEB score.

Is there a downside to Anthropic's contextual retrieval? Cost — you LLM-summarize every chunk during ingestion. For a 1M chunk corpus at $0.001 per summary (Haiku), that's $1000. Cheap relative to the recall lift, but a real budget line for very large corpora. The other cost is ingestion latency (an extra LLM call per chunk); not a problem for batch ingestion, an issue for streaming.

How do I detect retrieval failure in production? Three signals: (1) the generator's answer doesn't cite any retrieved chunk (retrieval returned nothing useful), (2) the generator says "I don't have information about that" (graceful failure), (3) per-query recall@5 against a labeled subset drops over time. Alert on all three. Log retrieval queries that produced zero high-confidence results — they're your re-indexing or query-rewriting backlog.

Do I need a separate search system for "I want to find a document" vs "answer a question"? Often yes. Question-answering RAG needs reranking and citation grounding. Document-discovery search needs faceting, sorting, and a different UX. They can share the same retrieval index but typically have different front-end logic. Many production systems run both modes against the same vector DB.


Glossary

  • Bi-encoder — embedding model that encodes query and document independently. Cheap retrieval, lower precision.
  • BM25 — Okapi BM25 ranking function for keyword retrieval. The default sparse retriever.
  • Chunking — splitting documents into smaller passages for indexing.
  • Cross-encoder — reranker that scores a query-document pair jointly. Expensive, high precision.
  • Dense retrieval — embedding-based vector similarity retrieval.
  • HNSW — Hierarchical Navigable Small World, the dominant ANN graph index.
  • Hybrid search — combination of dense + sparse retrieval, typically fused with RRF.
  • Late chunking — embed long document first, derive chunk embeddings from spans of the long embedding.
  • Matryoshka embeddings — embeddings trainable to be truncatable to lower dimensions without re-training.
  • Recall@k — fraction of relevant documents that appear in the top-k retrieved set.
  • Reranker — cross-encoder model that scores query-document pairs after retrieval, narrowing top-100 to top-5.
  • RRF (Reciprocal Rank Fusion) — score combination method for hybrid retrieval.
  • SPLADE — learned-sparse retriever that produces sparse vectors with semantic meaning.

Eighteen-month outlook

The architecture above is stable in 2026. The pieces that are moving:

  • Native long-context retrieval models. Models that take a corpus and a query and produce a grounded answer without an explicit retrieval step. Research-stage; production-rare in 2026. Likely to matter for small-to-mid corpora over the next two years.
  • Encoder-decoder retrievers like RankT5 returning. As reasoning models get expensive, cheap encoder-based retrievers that match cross-encoder quality on a budget are getting attention again.
  • Multimodal RAG. Image + text retrieval where the embedding space spans both modalities. Cohere embed-v4 multimodal, Voyage's multimodal embeddings, and various open-weight efforts. Production use cases: technical documentation with diagrams, e-commerce, medical imaging notes. See multimodal serving.
  • Retrieval-aware fine-tuning. Train the generator with retrieval in the loop so it learns to use citations and refuse out-of-corpus questions. Replaces brittle system-prompt-only enforcement. Production-rare but the academic results are strong.
  • Agentic retrieval as a first-class API surface. OpenAI's Responses API, Anthropic's tool-use loop, and Google's Vertex agents all push retrieval into the agent's tool layer. The architectural impact: less monolithic "RAG pipeline," more "retrieval tool + agent runtime."

The bones — chunk, embed, retrieve, rerank, generate with citations, eval — are unlikely to change materially. The skin keeps shifting.


References

  • Lost in the Middle — Liu et al., 2023. arXiv:2307.03172. Quality degradation across long-context positions.
  • RULER — Hsieh et al., 2024. arXiv:2404.06654. Effective context length benchmark.
  • BGE-M3 — Chen et al., 2024. arXiv:2402.03216. Multi-functionality multilingual embeddings.
  • BEIR — Thakur et al., 2021. arXiv:2104.08663. Heterogeneous retrieval benchmark.
  • SPLADE — Formal et al., 2021. arXiv:2107.05720. Learned sparse retrieval.
  • ColBERTv2 / PLAID — Santhanam et al., NAACL 2022. Late-interaction retrieval.
  • HyDE — Gao et al., 2022. arXiv:2212.10496. Hypothetical document embeddings.
  • RAGAS — Es et al., 2023. arXiv:2309.15217. Automated RAG evaluation framework.
  • ARES — Saad-Falcon et al., 2023. arXiv:2311.09476. Domain-specific RAG judge training.
  • GraphRAG — Edge et al., 2024. arXiv:2404.16130. Microsoft's entity-graph-based RAG.
  • CRAG — Yan et al., 2024. arXiv:2401.15884. Corrective retrieval augmented generation.
  • Late Chunking — Günther et al., 2024. arXiv:2409.04701. Long-context-embedded chunking.
  • MemGPT — Packer et al., 2023. arXiv:2310.08560. LLMs as operating systems for memory.
  • Anthropic Contextual Retrievalanthropic.com/news/contextual-retrieval. Per-chunk document-level context prepending.
  • DiskANN — Subramanya et al., NeurIPS 2019. SSD-resident ANN search.
  • NoCha — Karpinska et al., 2024. arXiv:2406.16264. Long-context narrative comprehension benchmark.