Prompt20
All posts
inferenceservingprefilldecodekv-cachedisaggregationvllmsglangmooncaketrt-llmradixattentiongoodputguide

How Modern LLM Inference Works: Prefill, Decode, KV, Disaggregation — The Complete Guide

The definitive guide to how modern LLM inference actually works: the two-phase prefill/decode structure, the KV cache, continuous batching, paged attention, and the full serving landscape from single-node vLLM through Mooncake/DistServe/Splitwise disaggregation, SGLang, TRT-LLM, and multi-region routing.

By Prompt20 Editorial · 92 min read

A modern LLM inference request looks simple from the outside — text in, tokens out — and the cost structure underneath is almost the opposite of what most engineers expect. Two workloads with completely different bottlenecks share one GPU. A cache the size of the model itself sits in HBM for the duration of every generation. Batching means something different in this stack than in any prior server architecture. And the layout of the cluster — which GPU holds which phase of which request — determines per-token cost more than the choice of model does.

This guide is the authoritative answer to how modern LLM inference actually works in 2026. It walks the full stack from first principles (what prefill and decode are, why they have opposite arithmetic intensities, what the KV cache actually contains) through every layer of the serving architecture (PagedAttention, continuous batching, prefix caching, chunked prefill), into the production patterns that frontier labs converged on: same-node and multi-node disaggregation, distributed KV pools (Mooncake), goodput-optimized scheduling (DistServe), phase splitting (Splitwise), and the differences between vLLM, SGLang, and TensorRT-LLM as serving stacks.

The take: get continuous batching, paged attention, prefix caching, and FP8 KV cache right first — those are the largest wins for most workloads. Same-node disaggregation (different GPUs on one NVLink-connected box) is the right next step. Full multi-node disaggregation is overkill until you're at hosted-provider scale. The literature reports 1.5–3× throughput improvements from disaggregation alone (DistServe, Splitwise), but most of that win is captured by the same-node version at a fraction of the engineering cost.

Table of contents

  1. Key takeaways
  2. Mental model: disaggregated inference in one minute
  3. The LLM serving landscape in 2026
  4. The two phases of inference
  5. Why colocation hurts
  6. The disaggregated architecture
  7. KV-cache transfer mathematics
  8. Layer-wise streaming and overlap
  9. Scheduling, routing, and prefix caching
  10. Hardware mix: prefill vs decode pools
  11. Comparison with co-located serving
  12. Production deployments in 2026
  13. When disaggregation matters
  14. When it doesn't
  15. Open research and engineering questions
  16. KV transfer mechanics: NIXL, NCCL, RDMA, GDRCopy
  17. P/D ratio optimization: workload-driven sizing
  18. Per-stack support: SGLang, vLLM, TRT-LLM, DistServe, Splitwise, Mooncake
  19. Cost math worked example: prefill + decode pool sizing
  20. Mixed B200/H200 pools and disaggregation
  21. Prefix caching with disaggregation
  22. Speculative decoding with disaggregation
  23. Reasoning models with disaggregation
  24. Reference designs: Mooncake, DistServe, Splitwise
  25. Failure handling in disaggregated serving
  26. P/D scheduling: per-request routing and signals
  27. Cross-rack disaggregation
  28. Observability for disaggregation
  29. The "fused KV" alternative: SARATHI and chunked prefill batching
  30. 2026 trends: B200 NVL72 and multi-DC
  31. The bottom line
  32. FAQ
  33. Glossary
  34. References
  35. Prefill vs decode mechanics in depth
  36. KV transfer mechanics deep dive
  37. P/D ratio optimization in depth
  38. Per-stack disaggregation support
  39. Cost math worked example
  40. Mooncake / DistServe / Splitwise / Dynamo deep dives
  41. P/D scheduling: routing and signals
  42. Prefix caching with disaggregation
  43. SARATHI: chunked prefill alternative
  44. 2026 trends: NVL72 and the disagg shift
  45. Disaggregation in multi-tenant serving

Key takeaways

  • Prefill is FLOPs-bound; decode is HBM-bandwidth-bound. Their arithmetic intensities differ by 10-100×. No single GPU is optimal for both.
  • Disaggregation runs them on separate pools and streams the KV cache between. Yields 1.5-3× throughput, 30-50% lower TTFT.
  • Cost: gigabytes of KV cache transfer per request. Needs NVLink or RDMA-class networking. Adds router and scheduler complexity.
  • Layer-wise KV streaming hides nearly all transfer latency behind ongoing prefill compute. This is what makes disaggregation practical at scale.
  • Prefix caching is the other half of the win: many requests share system prompts. Reusing prefix KV cache yields another 5-10× on prefill-heavy workloads.
  • Do it if: prompts > 500 tokens average, output > 100 tokens, QPS > 5/node, RDMA available, prefixes shared.
  • Skip it if: short prompts, low QPS, slow inter-pool network, single-tenant deployment.
  • Production reality: DeepSeek, Moonshot (Mooncake), Together, Fireworks, and every major hosted provider run disaggregated. vLLM and SGLang ship first-class support. The architecture is no longer experimental.

Mental model: disaggregated inference in one minute

The problem has a name: the prefill/decode mismatch. An LLM inference request has two phases with opposite hardware appetites sharing one GPU. Prefill processes the prompt in parallel — tens of thousands of tokens through every layer at once — and saturates tensor cores. It's FLOPs-bound; HBM bandwidth is mostly idle. Decode generates one token at a time — reads 140 GB of weights through HBM to do almost no math — and saturates memory bandwidth. It's bandwidth-bound; tensor cores are mostly idle. The two arithmetic intensities differ by 10–100×. When they share a GPU and a batch, one phase always stalls the other: a long prefill blocks decode mid-stream (latency spike), and decode batches dilute prefill throughput (lower goodput).

The fix is to split them onto separate pools. Prefill runs on FLOPs-rich GPUs in big batches; decode runs on bandwidth-rich GPUs with high concurrency; the KV cache produced by prefill streams over NVLink or RDMA to the decode pool. The analogy is a kitchen with separate prep and assembly stations — vegetables get chopped in bulk in the back, plates get assembled to order at the line, and a runner moves prepped ingredients between them.

Co-located vs disaggregated — side-by-side:

Aspect Co-located Disaggregated
Phases on same GPU Both Separated
Prefill interference with decode Yes (TTFT spikes) No
Hardware tuning per phase One compromise Independent (FLOPs vs HBM)
KV transfer cost Zero GBs/request over NVLink/RDMA
Operational complexity Single pool Router + scheduler + KV transport
When it pays off Short prompts, low QPS Long prompts, high QPS, RDMA

Production one-liner (vLLM): --kv-transfer-config '{"role":"producer","kv-connector":"NixlConnector"}' on the prefill node and the matching consumer config on decode; modern vLLM/SGLang ship first-class disaggregation flags.

Sticky number: Disaggregation delivers +30–60% throughput on long-context workloads (1.5–3× goodput in DistServe and Mooncake reports), with same-node disaggregation capturing most of the win at a fraction of the engineering cost.

The rest of this guide is the layer-wise streaming that hides KV transfer, the prefix-caching interaction, and when same-node beats full multi-node.


The LLM serving landscape in 2026

"LLM inference" is shorthand for a stack of about seven techniques, layered. Each one independently buys a 1.3-2× improvement on some axis; the combined stack is what gets advertised as "10× cheaper inference" by hosted providers. Here is the full landscape, roughly in the order it has to be deployed.

1. Vanilla autoregressive decode. One forward pass per token. Reads the entire model from HBM every step. The unoptimized baseline; nobody runs this in production beyond toy demos.

2. KV cache. Store K and V tensors from each layer so attention doesn't recompute them every step. Memory cost grows linearly with sequence length; for a 70B model at 128k context, the cache is ~43 GB per request — comparable to the weights themselves. For the math, see our KV cache memory guide.

3. PagedAttention (Kwon et al. 2023, vLLM). Treat the KV cache as virtual memory: page it in fixed-size blocks, allocate non-contiguously, evict cleanly. Doubles or triples effective batch size at long context. This is the substrate every modern serving stack assumes.

4. Continuous batching. Admit new requests into a running batch token-by-token instead of waiting for the slowest sequence in a fixed batch. Pushes decode GPU utilization from ~5% of peak to ~50% on production workloads. Original implementation: Orca. Now standard in vLLM, SGLang, and TRT-LLM.

5. Prefix caching. Reuse KV cache across requests that share a prompt prefix. System prompts, few-shot examples, retrieved RAG documents — all candidates. vLLM's automatic prefix caching and SGLang's RadixAttention (Zheng et al. 2023) implement this with different data structures. 5-10× speedup on prefix-heavy workloads.

6. Chunked prefill. Split a long prefill into smaller chunks so it can interleave with ongoing decode steps. Improves TTFT for new requests when other requests are mid-generation. Standard in vLLM since 2024.

7. KV cache quantization. Drop K and V tensors to FP8 or INT4 to halve or quarter HBM footprint. FP8 KV is nearly free quality-wise; INT4 KV is workload-dependent. See our quantization tradeoffs guide.

8. Speculative decoding. Draft K candidate tokens cheaply, verify in one expensive forward pass. EAGLE-2 is the dominant variant in 2026. Provably identical output distribution. Full treatment in our speculative decoding guide.

9. Same-node disaggregation. Run prefill on some GPUs of an 8-GPU node, decode on the others. KV cache moves over NVLink (essentially free). Captures the scheduling benefits without any cross-node network engineering. The recommended starting point.

10. Full disaggregation (Splitwise, DistServe, Mooncake). Prefill and decode on separate pools of nodes, KV cache streams over RDMA. Layer-wise streaming hides transfer latency behind ongoing prefill compute. The Mooncake design adds a distributed KV-cache pool shared across the prefill side. Reported 1.5-3× goodput improvements over colocated baselines.

11. Multi-region routing. Route requests across geographically distributed serving regions for latency and cost. Prefix-cache affinity becomes a routing constraint. Adds capacity planning complexity.

Where each technique fits

Layer Bottleneck it addresses Typical speedup Operational cost Where to start
KV cache Recomputing attention Foundational None — just turn it on Always
PagedAttention KV memory fragmentation 2-3× batch size at long context Already in vLLM/SGLang Always
Continuous batching Decode GPU utilization 5-10× throughput Already in vLLM/SGLang Always
Prefix caching Redundant prefill compute 5-10× on prefix-heavy traffic Light (cache mgmt) If you have shared prefixes
Chunked prefill TTFT under load 2-5× p99 TTFT Light High-QPS, mixed prompt lengths
FP8 KV cache HBM capacity 2× concurrent requests Light (negligible quality cost) Long context, dense workloads
Speculative decoding Decode bandwidth 1.5-3× Medium (draft model) Large targets (≥30B)
Same-node disaggregation Phase scheduling interference 1.4× decode throughput Medium (worker layout) 4+ GPU nodes
Full disaggregation Phase + hardware mismatch 1.8-2.5× throughput, lower TTFT High (RDMA, routing) Hosted-provider scale
Multi-region routing Geographic latency, capacity Workload-dependent High Global products

The serving stacks (vLLM, SGLang, TensorRT-LLM) differ on which of these they ship out of the box, how well-tuned each one is, and how they compose:

  • vLLM — broadest community support, fastest to adopt new techniques (EAGLE, lookahead, disaggregation). Defaults are sane. The right choice for most teams.
  • SGLang — RadixAttention makes prefix sharing exceptionally efficient on workloads with branching prompts (agents, multi-turn RAG, evaluation pipelines).
  • TensorRT-LLM — best raw kernel performance on NVIDIA hardware (FlashAttention-3, FP8 throughout), tightly coupled with Triton Inference Server. Most engineering to operate.
  • Hosted APIs — OpenAI, Anthropic, Google, Together, Fireworks, Groq, Cerebras all run proprietary stacks that combine most of the above. Their prompt-caching features are user-visible surfaces of (5).

A common end-state stack in 2026: vLLM or SGLang, paged attention + continuous batching + automatic prefix caching + FP8 KV cache + EAGLE-2 speculative decoding + same-node disaggregation, with multi-tenant LoRA serving on top. That stack alone is roughly 8-15× cheaper per token than naïve PyTorch generate(). Full multi-node disaggregation à la Mooncake adds another 1.5-2× on top, but only at the scale where the engineering pays back. For the wider context on serving infrastructure, see our vLLM and PagedAttention deep-dive and on the rack-scale topology that makes disaggregation viable, see NVLink and rack-scale topology.

Version pinning for a reproducible 2026 stack

Specific versions matter because the disaggregated-serving APIs are still moving. As of May 2026, a reproducible reference stack is: vLLM 0.7.x or 0.8.x with --enable-disaggregated-serving and the V1 scheduler, SGLang 0.4.x with --disaggregation-mode on Hopper, TensorRT-LLM 0.18.x with executor API and kv_cache_transfer enabled, NCCL 2.23 or newer for GPUDirect RDMA over RoCEv2, CUDA 12.6 with the matching driver (560.x), and PyTorch 2.6 with CUDA Graphs enabled in the decode path. Pin them. Mixing a 0.5.x vLLM with a current Hopper driver works but the chunked-prefill heuristics diverge from anything published in the last twelve months. For the kernel layer that sits underneath, see our Triton kernel primer.

What the serving stack does not solve

Continuous batching, paged attention, and prefix caching do not fix tokenizer cost (a 32k-token prompt with a slow SentencePiece tokenizer still spends 10-40 ms on the CPU before any GPU work begins), do not fix Python overhead on the request path (uvicorn + fastapi adds 1-3 ms per request even at zero load), and do not fix slow model loading (a cold 70B in FP16 is 140 GB of NVMe-to-HBM traffic, 30-90 s on PCIe Gen4). For high-throughput deployments, push tokenizer onto a separate worker thread, replace the Python request handler with a Rust or C++ shim where it matters, and pre-warm weights in HBM before joining the load balancer. These are mundane and easy to forget; they regress p99 TTFT more reliably than any kernel choice.


The two phases of inference

A transformer inference request has two distinct phases, and confusing them is the source of almost every serving inefficiency.

Prefill

Prefill takes the entire prompt and produces a KV cache covering it. One forward pass, all tokens in parallel.

Compute profile: large GEMMs, very high arithmetic intensity. For a 70B-class model prefilling a 4k-token prompt on an H100 SXM, you'll see ~80% of peak FP16 FLOPs — comparable to training utilization. Tensor cores are saturated.

Bottleneck: compute.

Time complexity: O(n²) for attention (softmax over n×n logits), O(n·d) for everything else, where n is sequence length and d is model dimension. Wall time: tens to hundreds of milliseconds for typical prompts; seconds for very long ones.

Memory pattern: a single sequence with parallel processing. Reads weights from HBM once per layer, reuses them across all tokens in the sequence. Bandwidth pressure is modest.

Prefill compute breakdown

A prefill forward pass for a 70B model on 4k tokens spends its time roughly as: 55% in QKV and MLP projections, 25% in attention compute (FlashAttention-3 on H100/H200 keeps this from blowing up quadratically by tiling), 12% in MLP activation, 5% in layer norms and residuals, and 3% in embedding and output projection. On B200 with FP8, the projection share drops to ~45% and attention rises to ~30% as the FP8 matmul throughput outruns the attention kernel's improvements. Knowing this breakdown matters when you are kernel-tuning — the leverage is in the matmuls, not in shaving microseconds off layer norm.

Decode

Decode generates output tokens one at a time. Each step is a forward pass over a single new token attending to the existing KV cache.

Compute profile: tiny matmuls with massive weight loads. For a 70B model decoding on an H100 with batch size 1, the GPU achieves 5% of peak FP16 FLOPs but ~80% of peak HBM bandwidth (2.7 TB/s on H100, ~4.8 TB/s on H200). FLOPs are nearly idle; memory bus is the bottleneck.

Bottleneck: HBM bandwidth.

Time per token: a 70B model in FP16 reads 140 GB per forward pass. At 3.35 TB/s HBM bandwidth (H100), that's a hard floor of ~42 ms per token at batch size 1. Larger batches amortize this — at batch size 64, the same weight read serves 64 tokens.

Memory pattern: many requests, each contributing a small slice of work, each requiring its own KV cache to be present in HBM. (For the underlying memory math, see our KV cache memory guide.)

Why decode batching is hard

Decode batching is fundamentally a packing problem. Each request in the batch is at a different sequence position with a different KV cache length, and the attention kernel must handle this without padding to a common length (which would waste HBM and FLOPs). PagedAttention solves this by giving each request a non-contiguous block list and writing kernels that read those blocks directly. The cost is roughly a 5-10% kernel overhead vs perfectly contiguous attention, recouped many times over by the throughput gain from larger effective batches. FlashAttention-3 with paged-KV support is the production kernel for this in 2026; vendor-specific alternatives (TRT-LLM's xqa kernel) close the gap on NVIDIA hardware.

The prefix that is not really prefill

A subtle but important distinction: when a request reuses a cached prefix, the "prefill" of that prefix has already happened on some prior request. What the new request needs is just the prefill of the suffix — the user-specific tail after the cached prefix. This means that prefix-hit prefill is genuinely cheap (often under 5 ms even for prompts whose total length is 4k tokens), and from the scheduler's perspective such a request is closer to a long decode than a true prefill. Disaggregation routers exploit this: prefix-hit requests can sometimes be routed directly to a decode worker that already holds the prefix KV, skipping the prefill pool entirely. The hit-rate for this fast path is workload-dependent but reaches 30-60% on agent and RAG workloads with stable system prompts.

The arithmetic-intensity gap

The cleanest way to see why these are different workloads is arithmetic intensity — FLOPs performed per byte loaded from HBM.

Phase Operation Arithmetic intensity Bound by
Prefill Long-sequence GEMM 100-500 FLOPs/byte Compute
Decode Batch-1 GEMM 1-5 FLOPs/byte Memory
Decode Batch-64 GEMM 30-60 FLOPs/byte Mixed

On the H100, the "ridge point" — where compute and bandwidth balance — is around 290 FLOPs/byte (989 TFLOPs / 3.35 TB/s) for FP16. Prefill sits comfortably above the ridge; decode at small batch sits far below. Same hardware. Opposite regimes.

Why a single hardware choice cannot bridge the gap

Hardware roadmaps do not narrow this gap; if anything they widen it. B200's FP8 compute over its HBM bandwidth gives a ridge near 575 FLOPs/byte, which means even more of the decode regime sits below the ridge. MI325X with 256 GB at 6 TB/s shifts the ridge down to ~167 FLOPs/byte for BF16, which makes it a better decode-only chip but leaves prefill underutilized in the opposite direction. The structural answer — separate prefill and decode hardware — is the only one that scales with each new HBM generation. For the chip-by-chip breakdown, see H100 vs H200 vs B200 architecture and the 2026 NVIDIA AI GPU lineup.

Decode arithmetic intensity by batch size, model, and quantization

A more useful table than the abstract one above is what the operating point actually looks like in production. For Llama-3-70B FP16 decode on H100 SXM:

Batch Tokens/s/GPU HBM utilization FLOP utilization Notes
1 24 78% 4% Pure bandwidth bound
8 165 82% 11% Bandwidth bound; KV per request hurts
32 540 79% 35% Approaching ridge; FP8 weights help
64 880 71% 52% Compute and memory both pressing
128 1,150 58% 73% Compute bound, KV cache near limit
256 1,260 41% 85% KV cache eviction begins

The decode pool's job is to push this curve as far right as KV memory allows. That is why FP8 KV cache and KV-cache offload (CPU, hierarchical) are non-negotiable at scale: every doubling of the per-GPU concurrent batch is a roughly linear reduction in $/output-token until you hit the compute ceiling around batch 256.


Why colocation hurts

In a naive serving system (early vLLM, raw HuggingFace, classic TGI), both phases run on the same GPU. Three concrete problems.

Hardware mismatch

You pick one GPU. The choices that maximize prefill throughput per dollar (high FLOPs density: H100 SXM, B200) waste capacity during decode. The choices that maximize decode throughput per dollar (high HBM capacity and bandwidth: H200, MI325X) underperform on prefill.

Cloud bill consequence: you're paying for one phase's bottleneck on hardware that's wrong for the other phase. Typical opportunity cost is 20-40% on the dollar.

Scheduler interference

This is the user-visible problem. Prefill is synchronous — one long sequence saturates the GPU for the duration of the forward pass. Decode tokens for other in-flight requests have to wait.

When an 8k-prompt request arrives, ITL for every other request in the batch spikes by 100-300 ms. Users perceive the model stalling. Continuous batching (vLLM, TGI, Triton's in-flight batching) softens this by interleaving micro-batches, but doesn't eliminate it: while prefill is using tensor cores, decode is using HBM bandwidth, and there's only one GPU.

Batch-size conflict

Prefill wants small batches — one long sequence already saturates the GPU. Adding a second sequence to the same forward pass mainly pads to a common length and wastes compute on the padding.

Decode wants large batches — at batch 1, decode achieves ~5% of peak FLOPs; at batch 64, ~50%; at batch 256, ~75%. The HBM weight load amortizes across the batch.

A unified scheduler has to pick. It typically picks something in the middle and loses both ways.

The combined cost

Sum these and a colocated stack runs at maybe 40-50% of disaggregated throughput on production workloads. The exact number depends on prompt and output length distributions, but the direction is consistent across every benchmark and every production deployment that's been measured.

Goodput vs throughput: the metric that actually matters

Raw tokens/s/GPU is the wrong number to optimize. The relevant metric is goodput: requests/s that meet a target TTFT and per-token ITL SLA. DistServe (Zhong et al., 2024) introduced this framing explicitly. A colocated system can post strong tokens/s numbers while violating its TTFT SLA on 30% of requests because prefill chokes the decode stream. A disaggregated system sacrifices a few percent of nominal throughput but holds its SLA on 99% of requests because the phases never compete for the same kernel slot. When you re-plot the same benchmarks in goodput-at-SLA, the 1.5-3× DistServe gain stops looking like marketing and starts looking like a floor.

Head-of-line blocking under bursty load

Real chat traffic is Poisson-ish with bursts: a viral moment, a scheduled cron of agent batch jobs, a regional incident that flushes a queue all at once. In a colocated system, a single 32k-token prompt arriving during a burst stalls the decode batch for 800 ms - 2 s, and because decode is also where ITL is measured, every other user perceives a hiccup. Disaggregation moves that hiccup off the latency-critical path: the prefill pool absorbs the spike, the decode pool keeps grinding its existing batch. The size of this effect is workload-dependent but consistently large at p99 — it is the reason most hosted providers' p99 TTFT graphs flattened in late 2024 when they finished rolling out disaggregation.


The disaggregated architecture

Two physical pools, one router, one cache-transfer plane.

                client request
                      │
                      ▼
              ┌──────────────┐
              │ Router       │  picks prefill + decode workers
              └──────────────┘
                      │
       ┌──────────────┼──────────────┐
       ▼                             ▼
┌──────────────┐              ┌──────────────┐
│ Prefill pool │              │ Decode pool  │
│ FLOPs-heavy  │── KV cache ──│ HBM-heavy    │
│ small batches│   stream     │ large batches│
└──────────────┘   (RDMA)     └──────────────┘
                                     │
                                     ▼
                              tokens stream to client

Request flow

  1. Request arrives at the router.
  2. Router selects a prefill worker (load-balanced, with prefix-cache affinity).
  3. Router selects a decode worker (load-balanced by expected remaining decode work).
  4. Prefill worker reads the prompt, runs one forward pass, produces KV cache layer by layer.
  5. KV cache streams to the decode worker as each layer completes.
  6. Decode worker enters the active batch and generates tokens, streaming each one back to the client.
  7. When generation completes, decode worker frees the KV cache slot.

Pool sizing

Pools are sized independently. The ratio depends entirely on workload:

prefill_GPUs / decode_GPUs ≈ (avg_prompt_len × prefill_throughput⁻¹)
                              ───────────────────────────────────────
                              (avg_output_len × decode_throughput⁻¹)

For a typical chat workload (1k average prompt, 200 average output, modern GPUs), this works out to roughly 1 prefill GPU per 4-8 decode GPUs. For RAG with long contexts, the ratio shifts toward more prefill. For agent workloads with long generations, toward more decode.

A common operational pattern is autoscaling each pool independently against its own queue depth.

Router architecture: stateful vs stateless

The router can be stateless (every request is a fresh routing decision based on current pool metrics) or stateful (the router maintains a model of which prefill workers hold which prefix-cache entries, which decode workers have which sessions). Stateful routers win on prefix-cache hit rate by 20-40 percentage points on workloads with stable prefixes (system prompts, RAG indexes), but they require consistent hashing or explicit session affinity and they break gracefully only if you have built explicit failover. Most production routers in 2026 are stateful with a short-TTL session table backed by Redis or an in-process store, replicated across two or three router replicas behind a stateless L4 load balancer.

Failure modes specific to disaggregation

Three failure modes are unique to disaggregated serving and worth designing for explicitly:

  1. KV transfer stall. RDMA queue pair degrades, congestion in the fabric, or a misconfigured PFC priority causes transfer to slow without dropping. Symptom: TTFT degrades without queue depth changing. Detection: per-layer transfer time histograms with alerts on tail movement. Mitigation: detect, evict the in-flight request, retry on a different prefill/decode pair.

  2. KV slot leak on decode worker. A request fails between admission and first token, but the decode worker has already reserved a KV slot. If the cleanup path is not bulletproof, slots leak slowly until the worker can no longer accept new requests. Detection: KV slot free count diverges from in-flight request count. Mitigation: periodic reconciliation pass that frees orphaned slots.

  3. Router state desync. Stateful routers maintain a mental model of prefix-cache locations. If a worker restarts and loses its cache without notifying the router, the router keeps routing prefix-affinity requests to a worker that no longer holds the prefix. Symptom: prefix-cache hit rate drops without traffic changing. Mitigation: short TTLs on the router's cache-location table, with workers heartbeating their current cache state.

These are debugged with the same observability investments that pay back in normal operation. Skip them at your peril.

Cold start and warm pool sizing

Cold-starting a 70B decode worker takes 30-90 s from NVMe and 5-15 s from CPU memory. Cold-starting a prefill worker is the same plus a graph warm-up. Autoscaling that adds workers reactively in response to a load spike will miss the spike entirely. Production stacks pre-warm a buffer of standby workers (typically 10-20% of steady-state capacity), keep them in the load balancer with weight zero, and promote them to active weight when the queue depth crosses a threshold. The warm pool itself burns money, so the right size is a tradeoff between SLA risk and idle cost — most teams settle around 15% warm overhead for chat workloads. For the loading-side of this picture, see checkpoint storage and recovery.

Disaggregated inference in 2026 — infographic comparing the monolithic inference stack (routing, scheduler, model execution, KV cache, storage all coupled) with a disaggregated architecture that splits frontend router, request scheduler, prefill workers, decode workers, KV cache store, and model store into independently scalable components, plus workload patterns for chat / streaming and batch / offline, key components explained, benefits in practice (GPU utilization, cost optimization, independent scaling, resilience, technology choice), challenges and considerations (network latency, KV cache bandwidth, consistency, operational complexity, security boundaries, cost visibility), when to use disaggregation, and getting-started checklist.
Disaggregated inference at a glance. Splitting the inference stack into independent specialized components — frontend router, scheduler, prefill workers, decode workers, KV-cache store, model store — lets each layer scale on its own load profile. Prefill is compute-heavy but not latency-critical; decode is latency-sensitive and needs fast KV-cache access. Benefits: better GPU utilization, cost optimization via spot/preemptible prefill, resilience (failures isolated to one layer), and freedom to mix frameworks and hardware. Challenges: inter-component network latency, KV-cache bandwidth, consistency, operational complexity, security boundaries, and per-layer cost visibility. Use it for high-concurrency variable loads, cost-sensitive environments, heterogeneous infra, large batch + real-time mixes, and multi-region deployments.

KV-cache transfer mathematics

The defining constraint of disaggregation is that the KV cache produced by prefill must reach the decode worker before decode can begin (or in time for it to begin without stalling).

KV cache size

kv_bytes = 2 × n_layers × n_kv_heads × head_dim × seq_len × bytes_per_elem

The factor 2 is for both K and V. For a 70B model (80 layers, 8 KV heads with grouped-query attention, head_dim 128) at FP16:

Prompt length Per-request KV cache
1k tokens 335 MB
2k tokens 670 MB
8k tokens 2.7 GB
32k tokens 10.7 GB
128k tokens 43 GB

For a 405B model (126 layers, 8 KV heads, head_dim 128) at FP16:

Prompt length Per-request KV cache
8k tokens 4.2 GB
32k tokens 16.9 GB
128k tokens 67.6 GB

These are large. For long-context workloads, the KV cache for a single request can rival the model weights themselves.

Aggregate transfer bandwidth

At steady state, the inter-pool fabric must move KV cache at the rate prefill produces it:

required_bandwidth = QPS × avg_kv_bytes_per_request

For 100 req/s with mean 4k-token prompts on a 70B model:

100 × 1.34 GB ≈ 134 GB/s

Achievable on NVLink (within node, ~900 GB/s aggregate) or on multiple 400G InfiniBand links (50 GB/s each unidirectional, so ~3 NICs for unidirectional, double for full duplex). Painful on 100G Ethernet (12.5 GB/s). For the underlying fabric tradeoffs, see InfiniBand vs RoCE.

KV cache wire format and compression

The on-the-wire format affects both bandwidth and latency. Three practical options:

Format Bytes/token (70B GQA, 80L, 8 KV heads) Quality impact Notes
BF16/FP16 327 KB None (reference) Default; widely supported
FP8 E4M3 164 KB ~0% on chat; ~1-2% on long-context retrieval Per-tensor scale, common production choice
INT8 per-channel 164 KB ~0.5% across most benchmarks Slightly better than FP8 on some workloads
INT4 grouped (g=128) 82 KB 1-4% workload-dependent Aggressive; verify on your eval set
Sparse + FP8 (top-k, k=0.5) ~82 KB 1-3% on most tasks Adds compute on send side

Most production stacks ship FP8 KV transfer in 2026; INT4 is opt-in. The transfer-side quantization can be different from the in-HBM storage format — many stacks store FP16 on the decode side and accept FP8 over the wire, dequantizing on receive. See our quantization tradeoffs guide for the broader picture.

NIXL, MOoNCake transfer engine, and the standardization story

Through 2024 every disaggregated stack rolled its own KV-transfer code. In 2025 NVIDIA's NIXL (NVIDIA Inference Transfer Library) and the Mooncake transfer engine open-sourced two competing abstractions over GPUDirect RDMA, UCX, and NVLink. NIXL is what TensorRT-LLM and most NVIDIA partners ship; Mooncake's engine is what SGLang and parts of vLLM consume. Both expose a layer-wise async-send primitive with completion callbacks and queue-depth control. If you are writing inference plumbing in 2026, do not write your own RDMA wrappers — use one of these. The choice between them is mostly about which serving stack you have committed to, not technical merit.

Three transfer strategies

1. Direct over RDMA. GPUDirect RDMA copies tensor data directly from one GPU's HBM to another's, bypassing host CPU and host memory. Works over InfiniBand and RoCE (RDMA over Converged Ethernet). Latency ~3-10 µs per transfer, bandwidth limited by the link.

If you transfer the entire cache after prefill completes, you wait for the full transfer before decode starts. For a 10 GB cache on 50 GB/s IB, that's ~200 ms of pure transfer latency added to TTFT. Unacceptable for interactive workloads. Solved by streaming (next section).

2. Layer-wise streaming. Start transferring layer L's KV as soon as prefill finishes layer L, while prefill continues on later layers. Detailed in §6.

3. Same-node disaggregation. Prefill and decode on different GPUs of the same node, connected by NVLink. Transfer is essentially free (~900 GB/s within NVSwitch fabric). Captures the scheduling benefit without inter-node network engineering. A common starting point for teams new to disaggregation. See NVLink and rack-scale topology for the fabric details.


Layer-wise streaming and overlap

Layer-wise streaming is the key technique that makes disaggregation practical for interactive workloads. The idea: never wait for the full cache; pipeline the transfer with the remaining prefill compute.

How it works

Prefill processes layers sequentially. After layer L completes:

  1. K and V tensors for layer L are written to HBM.
  2. An async copy kicks off, moving those tensors to the decode worker.
  3. Prefill proceeds to layer L+1.

If layer-compute time per layer is roughly equal to per-layer transfer time, the transfer completes "for free" — by the time the last layer finishes prefill, the cache for layers 1 through N-1 is already on the decode side, and only layer N's data needs to finish moving.

The math

For an L-layer model with t_compute per layer (prefill) and t_transfer per layer:

naive end-to-end = L × t_compute + L × t_transfer
streamed end-to-end ≈ L × max(t_compute, t_transfer) + t_transfer

When t_transfer ≤ t_compute, the streamed version reduces to roughly L × t_compute + t_transfer, hiding nearly all transfer behind compute. The added TTFT is ~one layer's worth of transfer.

For a 70B model with ~80 layers at ~5 ms per layer prefill and ~2.5 ms per layer transfer (8k context, 50 GB/s link), naive end-to-end is 80 × 7.5 = 600 ms; streamed is 80 × 5 + 2.5 ≈ 403 ms. About 33% faster, and the savings grow with longer contexts.

Implementation notes

  • The decode worker needs to know which layer it has and doesn't have. It can begin generating once the first KV layer arrives, but not before.
  • In practice, decode waits for all layers before starting generation, but the wait is much shorter than the naive transfer.
  • Layer-wise streaming requires careful HBM allocation on the receive side — preallocate slots so async copies have a destination.

This is what Mooncake (Moonshot AI's serving paper) made widely known, and what production stacks implement under the hood.

Per-layer chunking and tensor-parallel KV layout

When the decode pool uses tensor parallelism, each KV layer is sharded across N decode GPUs. The prefill worker must either (a) send the full layer to one decode GPU which then scatter-broadcasts, or (b) shard the layer at send time and stream to each TP rank in parallel. Option (b) is faster but requires the prefill and decode TP layouts to match or for the prefill side to know the decode side's sharding. Production stacks (TRT-LLM, vLLM V1) implement option (b) with a small protocol header that includes the TP rank. If the prefill pool and decode pool have different TP degrees — for example, TP=2 prefill, TP=4 decode — a reshuffle is required, which usually means falling back to option (a) and accepting the latency hit.

Backpressure and the receive-side allocator

The decode worker must pre-allocate a KV slot before the first byte arrives. If the decode pool is at HBM capacity, the prefill side has nowhere to send. Production routers reserve a decode slot at admission time and refuse to start prefill if no decode slot is available — this turns a confusing "transfer stalls mid-flight" failure mode into a clean "request queued" admission decision. The corollary is that the router needs an accurate model of decode-pool HBM, including any future growth (generation length headroom). Underestimating this is the most common source of mysterious p99 spikes in young disaggregated deployments.


Scheduling, routing, and prefix caching

The router is the brain of a disaggregated system. Its job is harder than a colocated scheduler because it makes decisions across two pools with different load characteristics.

Worker selection

For prefill:

  • Pure load balance: shortest queue, least-loaded GPU.
  • With prefix caching: prefer workers that already hold relevant prefix KV cache.
  • Skew handling: avoid hot workers that have accumulated long contexts.

For decode:

  • Load balance by expected remaining work: tokens in flight × estimated tokens remaining per request. A worker with many long-running generations stays loaded.
  • KV cache capacity: workers near HBM capacity should not accept new requests.
  • Affinity: once a request lands on a decode worker, it stays for the request's lifetime.

Prefix caching

Many requests share prefixes:

  • A common system prompt across all chat users.
  • A retrieved document attached to many user questions.
  • A few-shot prefix in an API workload.

If the prefill worker already holds the prefix KV cache, only the user-specific tail needs to be processed. This is a 5-10× speedup on prefix-heavy workloads.

Implementation choices:

  • Local prefix cache. Each prefill worker keeps an LRU cache of prefix KVs. Router routes by prefix hash to maximize cache hits.
  • Distributed prefix cache. A shared store (NVMe-backed, or distributed in HBM/CPU memory) holds prefix KVs. Any prefill worker can fetch. More complex; useful at large scale.
  • Hierarchical caching. Hot prefixes in HBM, warm in CPU memory, cold on NVMe. Eviction by access frequency.

vLLM's automatic prefix caching, SGLang's RadixAttention, and TensorRT-LLM's prefix-cache support all implement variants of this. Hosted provider features like "prompt caching" surface a controllable version to the API user. For the broader serving stack context, see our LLM serving guide.

KV eviction under pressure

Decode workers have finite HBM. When new requests arrive at a full worker:

  • Preemption: drop an in-flight request's KV cache, recompute via prefill if/when it resumes. Costly.
  • Offloading: page KV cache to CPU memory. Even costlier (PCIe bandwidth).
  • Queueing: reject or delay the new request.

Production stacks have all three as fallbacks, ordered by cost. Avoiding eviction is itself a scheduling objective: don't admit more concurrent generations than HBM can hold.

Decode worker affinity and migration

Once a request lands on a decode worker, moving it is expensive: the full KV cache for that request must be migrated, which is the same transfer cost as the original prefill-to-decode handoff. Yet migration is sometimes desirable — a decode worker is being drained for maintenance, or load imbalance has become severe. Production stacks support migration as an explicit operation with three steps: (1) pause generation, (2) layer-wise stream KV to the new worker, (3) resume from the partial state. The pause is user-visible as a brief ITL spike. Most operators set the migration threshold conservatively (only drain for actual maintenance, not for load-balancing) because the cure is often worse than the disease.

Chunked prefill scheduling inside the prefill pool

Inside a single prefill worker, chunked prefill (Agrawal et al., 2023) breaks a long prompt into chunks (typically 512-2048 tokens) and interleaves them with chunks from other requests in flight. This is itself a mini-scheduler problem. The standard heuristic in vLLM and SGLang is "fill to a target batch token-budget per step" — for example, 8192 tokens of work per step, split across whichever requests are in queue. Short prompts get processed in a single chunk; long prompts spread over many. The effect on TTFT distributions is large: a workload with a few 32k prompts mixed into mostly 1k prompts shows p99 TTFT improvements of 3-5× when chunked prefill is on with a reasonable budget. Disaggregation does not replace chunked prefill — it works on top of it inside the prefill pool.


Hardware mix: prefill vs decode pools

The whole reason to disaggregate is that the two phases want different hardware. The 2026 mix:

Prefill pool

Wants: highest possible FLOPs density (FP8 or BF16), modest HBM (only one request's prefill at a time on a worker).

Good fits:

  • H100 SXM — 989 TFLOPs FP16, 80 GB HBM3, 3.35 TB/s. Mature, available, well-tooled.
  • H200 — same compute as H100, more HBM (141 GB) and bandwidth (4.8 TB/s). Overkill for prefill alone but useful in shared deployments.
  • B200 — ~2.5× the FLOPs of H100 in FP8, 192 GB HBM3e at ~8 TB/s. The current frontier.
  • MI300X — competitive FLOPs, 192 GB HBM3. Strong choice if you can use ROCm-tuned serving stacks.

Decode pool

Wants: highest possible HBM bandwidth and capacity. FLOPs don't matter much.

Good fits:

  • H200 — the workhorse. 141 GB HBM3e at 4.8 TB/s. Bandwidth is what you're paying for.
  • B200 — even better HBM (192 GB, 8 TB/s). Expensive per GPU but excellent decode throughput.
  • MI325X — 256 GB HBM3e at 6 TB/s. Decode-friendly profile.
  • MI300X — solid, cheaper-per-GB than NVIDIA equivalents. Good for cost-sensitive decode.

A common 2026 mix

For a large hosted deployment serving frontier models:

  • Prefill pool: B200 nodes, sized for peak prompt processing.
  • Decode pool: H200 or B200 nodes, sized for peak concurrent decode batch.
  • Fast interconnect: NVLink within node, NVL72-class rack-scale within rack, 400G+ InfiniBand across racks.

For a budget deployment:

  • Prefill: smaller H100 pool.
  • Decode: larger H200 or MI325X pool.
  • Cross-pool: InfiniBand or RoCE.

The point of the split is that decode capacity dominates the bill at scale, and you don't want to pay H100 SXM prices for capacity you'll use at 5% FLOPs utilization.

Per-token cost by hardware mix

A rough cost model for 70B Llama-class serving on AWS p5.48xlarge (H100 SXM) and p5e.48xlarge (H200) on-demand pricing, with $/M output tokens including amortized prefill:

Configuration Hardware $/M output tokens TTFT p50 TTFT p99
Colocated H100 (8x SXM) $98/hr/node $1.40 220 ms 1,600 ms
Same-node disag H100 (8x SXM) $98/hr/node $1.05 170 ms 700 ms
Disag H100 prefill + H200 decode mixed $0.85 150 ms 450 ms
Disag B200 prefill + H200 decode mixed $0.65 130 ms 380 ms
Disag B200 both pools $$ $0.55 110 ms 320 ms

Numbers are illustrative for steady-state production load. The disag B200/H200 mix is the current cost optimum because B200 prefill is fast enough that you need fewer prefill nodes, while H200 decode is the cheapest per-GB HBM that meets latency. Full B200 wins on latency but loses on $/token until B200 capacity catches up with H200 in the market. For broader cost framing, see AI inference cost economics.


Comparison with co-located serving

A rough side-by-side for a 70B-class model serving a typical chat workload (1k average prompt, 200 average output, mixed traffic):

Metric Colocated (vLLM default) Same-node disaggregated Fully disaggregated
TTFT (p50) 200 ms 150 ms 130 ms
TTFT (p99) 1500 ms 600 ms 400 ms
ITL (p50) 30 ms 25 ms 22 ms
Decode throughput / GPU 1.4× 1.8-2.5×
$/M output tokens 0.8× 0.5-0.65×
Operational complexity Low Medium High
Inter-pool bandwidth need None NVLink 100-400G IB

Exact numbers depend hugely on workload, hardware, and tuning. The directions are robust.

Tail latency: where disaggregation wins hardest

The mean numbers above understate disaggregation's value. Look at p99 and p99.9:

Metric Colocated Same-node disag Fully disag
TTFT p99 1,500 ms 600 ms 400 ms
TTFT p99.9 5,000 ms 1,200 ms 700 ms
ITL p99 80 ms 45 ms 35 ms
ITL p99.9 300 ms 90 ms 60 ms

The p99.9 column is where SLA pain lives. A colocated system with median performance "as good as" disaggregated still violates its SLA on 1-in-1000 requests an order of magnitude harder. At hosted-provider scale (millions of requests per day) those violations are user-visible.

Where disaggregation does not help

Disaggregation does nothing for tokenizer cost, nothing for cold-start latency on the first request, nothing for model-load time, and nothing for output-quality issues. It is a serving-architecture optimization for steady-state throughput and latency under load. If your problem is "first request after deploy is slow", you want pre-warming and weight pinning, not disaggregation. If your problem is "outputs are wrong sometimes", you want better evals and post-training, not disaggregation.


Production deployments in 2026

Who runs disaggregated inference today:

DeepSeek. Their published serving stack for V3 is fully disaggregated with aggressive layer-wise streaming and expert-parallel decode. Reported ~545 output tokens/s/GPU on V3 at production load, achievable only with disaggregation plus MoE-aware scheduling.

Mooncake (Moonshot AI / Kimi). The Mooncake paper is the widely-cited reference design. Disaggregated prefill/decode with a distributed KV-cache pool, layer-wise streaming, and prefix-aware routing.

vLLM. First-class support shipped in 2024, production-stable by mid-2025. --enable-disaggregated-serving plus a worker-group config.

SGLang. Disaggregated serving with RadixAttention for tree-structured prefix sharing. Strong on workloads with many forking prefixes (multi-turn agents, branching evaluations).

TensorRT-LLM. NVIDIA's serving stack with Triton Inference Server integration. Disaggregation support landed in 2024, tightly coupled with their GPU-direct RDMA paths.

Hosted providers. OpenAI, Anthropic, Google, Together, Fireworks, Anyscale, Groq — all run some form of disaggregated serving. Prompt caching features (Anthropic's prompt caching, OpenAI's prompt caching) are user-visible surfaces of the underlying prefix-share mechanism.

Reading provider prompt caching as disaggregation signals

Anthropic exposes prompt caching with a 5-minute TTL by default, billed at 10% of input price for cache reads and 125% for cache writes — a clear tell that they have a per-tenant prefix cache with explicit accounting. OpenAI's prompt caching is automatic above 1024 tokens, with no per-request control, suggesting a cluster-shared cache with implicit eviction. Together and Fireworks expose pricing tiers that imply mixed disaggregated and colocated pools depending on context length. Reading these provider features as APIs onto the underlying serving architecture is useful when you are deciding whether to build your own stack or rent: if your workload's economics match the cache-heavy discount tier, your prefix-share rate is already high enough that a self-hosted disaggregated stack would pay back.

DeepSeek-V3 numbers in context

DeepSeek's published serving numbers (~545 output tokens/s/GPU on V3) are widely cited but rarely contextualized. They are achieved with disaggregated prefill/decode, expert-parallel decode across 32+ GPUs for the MoE layers, FP8 throughout, and the company's custom training/serving fork of optimized kernels. Reproducing them on stock vLLM requires careful expert-parallel tuning and is roughly 60-80% achievable. The remaining gap is partly proprietary kernels, partly workload-specific tuning, and partly that DeepSeek's numbers are on their own traffic mix rather than a public benchmark. Treat them as a north star, not a target you should expect to hit with a default config.

The pattern across these is that the architecture is no longer disputed. The frontier engineering is in scheduling, prefix caching at scale, and KV-cache compression.

Stack-by-stack feature comparison

A condensed snapshot of which serving stack supports what in May 2026:

Feature vLLM 0.8 SGLang 0.4 TensorRT-LLM 0.18 LMDeploy
Continuous batching Yes Yes Yes Yes
PagedAttention Yes Yes (via RadixAttention) Yes Yes
Automatic prefix caching Yes Yes (radix tree) Yes Yes
Chunked prefill Yes (V1 scheduler) Yes Yes Yes
FP8 KV cache Yes Yes Yes (mature) Yes
Same-node disaggregation Yes Yes Yes Partial
Multi-node disaggregation Yes (beta) Yes Yes (NIXL) No
Speculative decoding (EAGLE-2) Yes Yes Yes Partial
Multi-tenant LoRA Yes (mature) Yes Yes Yes
MoE expert parallelism Yes Yes Yes Partial
Quantized weight loading (AWQ/GPTQ) Yes Yes Yes Yes
Hardware: NVIDIA Yes Yes Yes (only) Yes
Hardware: AMD (ROCm) Yes Yes No Partial
Hardware: TPU Partial No No No

The pragmatic call: vLLM if you want a general-purpose stack with broad hardware support and the largest community; SGLang if your workload has heavy prefix-tree branching (agents, evals, multi-turn) and you want RadixAttention's structural fit; TensorRT-LLM if you are NVIDIA-only, latency-obsessed, and willing to pay the operational cost; LMDeploy if you are on a tight budget and need a leaner runtime.


When disaggregation matters

Disaggregation pays off when several conditions stack:

Long prompts (> 500 tokens average). Long prefill dominates colocated latency. Splitting it out helps most here.

Long outputs (> 100 tokens). Decode batching dominates throughput; you want big decode batches uninterrupted by prefill.

Non-trivial QPS (> 5 req/s/node). Scheduler interference hurts most at high load. Below this, a colocated GPU is rarely the bottleneck.

Fast inter-pool fabric. NVLink within node, RDMA-capable network (InfiniBand or 400G+ RoCE) across nodes.

Shared prefixes. System prompts, RAG context, few-shot prefixes. Prefix caching is the single largest win in many production workloads, and disaggregation makes it scalable.

Cost sensitivity at scale. A 20-40% improvement on $/token at 100M+ tokens/day is material. The same improvement at 100k tokens/day is rounding error.


When it doesn't

Skip disaggregation when:

Short prompts and outputs (both < 100 tokens). Transfer overhead and router complexity dominate the modest scheduling win.

Low QPS (< 1 req/s). A single colocated GPU has spare capacity; the win is theoretical.

Slow network. 1G or 10G Ethernet without RDMA makes inter-pool transfer the new bottleneck. Either upgrade or stay colocated.

No shared prefixes. Loses the biggest prefix-cache optimization. The pure prefill/decode split still wins, but less.

Single-tenant edge deployment. Operational complexity isn't worth it for a deployment with one customer and predictable load.

Tiny models (< 7B). The arithmetic-intensity gap is smaller, the absolute KV cache is smaller, and a single GPU at decent batch sizes is hard to beat operationally.

For most personal projects, hobbyist deployments, and small-team production: vanilla vLLM with continuous batching on one or two GPUs is the right answer.

Decision checklist

Use this as a fast triage before committing to a disaggregated build:

Question If yes If no
Average prompt > 1k tokens? +1 -1
Sustained QPS > 5/node? +1 -1
Model > 30B parameters? +1 -1
Shared prefixes (system prompt, RAG)? +2 0
RDMA or NVLink available between target nodes? +1 -2
TTFT SLA tighter than 500 ms p99? +1 0
Engineering team ready for two-pool ops? 0 -2

Score ≥ 3: disaggregate (start with same-node). Score 0-2: consider same-node only. Score < 0: stay colocated, optimize the easy stuff first (FP8 KV, prefix caching, chunked prefill).


Open research and engineering questions

A few areas where the field is still moving:

Disaggregation across heterogeneous hardware. Mixing NVIDIA and AMD GPUs in one disaggregated stack — prefill on one, decode on the other. Promising on cost; held back by software immaturity.

Disaggregation under MoE. Expert parallelism in the decode pool interacts non-trivially with KV-cache transfer. Best-known approaches are workload-specific; general scheduling is open.

KV-cache compression on the wire. Quantizing or sparsifying KV before transfer to reduce inter-pool bandwidth. Trades CPU/GPU cycles for network. Some production deployments do FP8 or INT4 KV transfer; aggressive approaches (sparsity, learned compression) are research-grade.

Speculative decoding in disaggregated stacks. The draft model adds another scheduling dimension. Currently solved by running draft on the decode worker; alternatives (separate draft pool, shared draft service) are explored. See our speculative decoding guide for the underlying mechanism.

Multi-tenant prefix-cache safety. Sharing prefix caches across tenants is fast but leaks information (response time correlates with cache hits, which correlates with prefix content). Mitigations are early.

Dynamic prefill/decode pool ratios. Autoscaling each pool independently is standard; tightly coordinating their scaling (so a queue surge on one side triggers preemptive scaling on the other) is less mature.

Disaggregating the embedding and LM head

Most stacks lump the input embedding lookup and output projection (LM head) into either the prefill or decode worker. For dense models with vocabularies of 128k+ and hidden dimensions of 8k+, the LM head alone is a 1B+ parameter matmul that gets multiplied across every decode step. Some research deployments split it into a separate pool of tiny compute-light workers that simply project hidden states to logits and sample. The gain is small (5-10% at most) and the engineering cost is large, but for very high-throughput hosted APIs with tight latency budgets it has shown up as a real optimization.

Sub-layer pipelining and head-parallel transfer

Today's layer-wise streaming sends KV at layer granularity. Some research (2025-2026) splits a single attention layer's KV transfer across multiple heads concurrently, hiding more latency on very deep models (Llama-3-405B at 126 layers, DeepSeek-V3 at 61 layers with high head count). The implementation cost is real — kernel-level coordination between the prefill compute and the send buffers — and gains are 5-15% on the longest contexts. Production stacks have not adopted it widely yet because the engineering overhead outpaces the win on typical workloads, but the technique is likely to land in mainline vLLM and TRT-LLM during 2026 for long-context-heavy deployments.

Differentiated per-phase SLAs

The current production model has one SLA (TTFT, ITL). A more refined approach has separate SLAs for prefill (TTFT contribution alone) and decode (ITL plus total generation time), with the router making routing decisions accordingly — for example, sending latency-sensitive short requests to a fast-prefill pool and long-context requests to a high-HBM prefill pool. The framing is correct; the implementation is fiddly because most user-facing SLAs are end-to-end and customers do not perceive the split. Watch this space — providers that get it right will offer pricing tiers that map cleanly to the underlying phase costs.

Long-context-aware admission control

A 200k-token prompt is not just 200x more expensive than a 1k prompt — it is qualitatively different. Attention is quadratic in sequence length on the prefill side; KV cache grows linearly but its absolute size approaches the model itself. Admission control for long contexts is its own discipline: separate queue, separate priority class, separate pool of workers with extended HBM. Mixing 1M-token requests into the same prefill queue as 1k-token chat requests destroys queueing fairness. The standard pattern in production is two queues, one for "long" (above some threshold like 32k or 100k tokens) and one for "short", with the long queue served by a smaller dedicated pool. See our long-context attention guide for the underlying mechanics.

Cross-region disaggregation: pitfalls

A few teams have tried disaggregating across regions — prefill in one region, decode in another — to take advantage of cheaper capacity in one location. The math rarely works. Inter-region latency floors are 30-100 ms on dedicated fiber, 100-300 ms on public Internet. That latency adds directly to TTFT and gets paid every layer if you try to stream. Even with aggressive KV compression and parallel streams, the user-perceived TTFT degrades enough that the cost savings are washed out by SLA violations. The exception is async batch workloads with no latency SLA, where cross-region disaggregation can be a real cost play.

Disaggregating the tokenizer

Tokenization runs on CPU and is rarely the bottleneck, but at the long-prompt end of the distribution it becomes one. A 100k-token prompt with a slow tokenizer (BPE in pure Python, or SentencePiece on a weak core) can take 30-100 ms before any GPU work starts. Some production stacks run a dedicated tokenizer pool on cheap CPU nodes, with the request flowing tokenizer → prefill → decode. The handoff serializes already-tokenized arrays so the prefill worker never touches the raw string. Worth it only if you have measured tokenizer as a bottleneck — for most workloads, a fast Rust tokenizer (tokenizers crate from Hugging Face) embedded in the prefill worker is sufficient.

KV-cache offload tiers

For decode-side capacity beyond what HBM holds, a three-tier hierarchy is standard at scale:

Tier Latency to access Capacity per node Use
HBM (hot) < 1 µs 80-256 GB Active requests
CPU DRAM (warm) 5-15 µs (PCIe) 1-2 TB Prefix cache, paused sessions
Local NVMe (cold) 50-200 µs 8-30 TB Long-tail prefix cache, archived sessions
Remote object store (frozen) 10-100 ms unbounded Audit logs, long-paused sessions

Hot-to-warm movement is essentially free at the rates production workloads need; warm-to-cold is workload-tuned (LRU with a frequency floor). The frozen tier exists mostly for multi-day sessions and compliance, not for latency-sensitive serving. Most teams skip the frozen tier entirely and just rebuild from prompt if a session ages out.

Multi-tenant fairness in a disaggregated pool

Production deployments serve many tenants from a shared pool. Without explicit fairness, a single tenant with bursty long-prompt traffic can starve the prefill queue for everyone. The standard mitigations are weighted fair queueing in the router (each tenant has a quota of in-flight prompt tokens), per-tenant KV-cache budgets in the decode pool (to prevent one tenant from consuming all KV slots), and priority classes for paid vs free tiers. Implementing this correctly is the difference between a serving stack that works on a benchmark and one that survives production. See our eval infrastructure post for how to load-test fairness explicitly.


KV transfer mechanics: NIXL, NCCL, RDMA, GDRCopy

The KV transfer from prefill GPU to decode GPU is the operation that makes or breaks disaggregated serving. Done well, it's a few-millisecond hop on the critical path. Done badly, it eats the entire latency budget and disaggregation becomes a regression.

NIXL (NVIDIA Inference Xfer Library)

NIXL is NVIDIA's library for inference KV transfer, released in 2024 as part of the Dynamo serving stack. It handles GPU-to-GPU KV migration with support for both NVLink (same-node) and RDMA (cross-node) backends transparently. The API allows registration of KV tensor regions, asynchronous initiation of transfer, and explicit completion signals.

NIXL's distinguishing feature: per-layer streaming. Instead of waiting for prefill to complete and transferring the entire KV at once, NIXL streams each layer's KV to the decode GPU as soon as the prefill layer completes. The decode GPU starts processing once layer 0's KV arrives, overlapping transfer with decode start-up. For long prefills, this reduces TTFT (time-to-first-token) by 30-50%.

NCCL

NCCL is the standard library for inter-GPU communication in training. For disaggregated inference, NCCL collectives can transfer KV between GPUs, but the API is awkward (collectives are designed for symmetric all-reduce patterns, not asymmetric point-to-point). NCCL with ncclSend / ncclRecv is the most-used path for hand-rolled disaggregation; it works but is less efficient than NIXL on the same hardware.

RDMA

For cross-node transfer, RDMA (Remote Direct Memory Access) over InfiniBand or RoCE (RDMA over Converged Ethernet) is the standard. RDMA bypasses the kernel and writes directly to remote GPU memory, achieving 100-400 Gbps bandwidth and 1-2 µs latency. NVIDIA's GPUDirect RDMA enables direct GPU-to-NIC paths, eliminating the CPU bounce that older patterns required.

GDRCopy

For small KV transfers (<128 KB), GDRCopy provides direct memory-mapped GPU access from CPU, useful for control-plane operations (sending request metadata alongside the KV). Not for bulk KV transfer; the bandwidth is much lower than direct GPU-to-GPU paths.

Transfer mechanism comparison

Mechanism Same-node bandwidth Cross-node bandwidth Best for
NVLink (intra-node) 900 GB/s (H100) / 1.8 TB/s (B200) n/a Same-server disagg
NIXL on NVLink 800-850 GB/s n/a Same-server, production
NIXL on RDMA n/a 100-400 Gbps Cross-server disagg
NCCL P2P 600-800 GB/s (NVLink) 100-200 Gbps (RDMA) Custom stacks
GPUDirect RDMA n/a 100-400 Gbps Manual cross-node
TCP/IP n/a 10-25 Gbps Fallback only

The pragmatic stack: NIXL on NVLink for same-node, NIXL on RDMA for cross-node, NCCL as fallback for stacks that haven't adopted NIXL. Avoid TCP/IP transfer; bandwidth is two orders of magnitude lower than RDMA and adds latency that erases the disaggregation benefit.


P/D ratio optimization: workload-driven sizing

The fundamental design decision in disaggregated serving: how many prefill GPUs vs decode GPUs. The ratio depends on the workload.

Workload categories

  • Chat-style (short prompts, longer responses): average prompt ~200 tokens, response ~500 tokens. Decode dominates total compute. P:D ratio ~1:8 (one prefill GPU per 8 decode GPUs).
  • Agent-style (long prompts with tools, short responses): average prompt ~2000 tokens (system prompt + tool defs + history), response ~100 tokens. Prefill dominates. P:D ratio ~1:2 to 1:4.
  • RAG (long retrieved context, medium response): average prompt ~4000 tokens (retrieved chunks), response ~300 tokens. Roughly balanced. P:D ratio ~1:4.
  • Reasoning models (medium prompts, very long thinking chains): prompt ~500 tokens, response ~5000 tokens (including thinking). Decode-heavy. P:D ratio ~1:12 to 1:16.
  • Long-context summarization (very long prompts, short responses): prompt ~100K tokens, response ~1K tokens. Prefill dominates strongly. P:D ratio ~1:1 or even 2:1 (more prefill than decode).

Dynamic adjustment

Production stacks (Mooncake, DistServe) implement dynamic P/D ratio adjustment based on real-time queue lengths. If prefill queue builds up, spin up more prefill workers (or convert decode workers to prefill, where the hardware supports it). If decode queue builds up, the reverse.

The conversion is not free — switching a GPU from prefill to decode requires draining in-flight requests and reloading the model in a new configuration. Typical cycle time: 30-60 seconds. Production deployments treat P/D ratio as a slowly-changing variable, adjusted on minute-to-hour timescales rather than per-request.

Static sizing example

A workload with mix of 70% chat (1:8 ratio), 20% agent (1:3), 10% RAG (1:4). Weighted average ratio: 0.7 × 1/8 + 0.2 × 1/3 + 0.1 × 1/4 = 0.087 + 0.067 + 0.025 = 0.18. So for every 1 prefill GPU, ~5.5 decode GPUs.

A 64-GPU cluster sized for this mix: ~10 prefill, ~54 decode. Adjust based on actual queue lengths after observing production traffic for a week or two.


Per-stack support: SGLang, vLLM, TRT-LLM, DistServe, Splitwise, Mooncake

SGLang

SGLang has the most mature open-source disaggregation support in 2026. The disaggregation mode launches separate prefill and decode worker pools, routes requests appropriately, and handles KV transfer via NCCL or NIXL. Configuration is via flags; no need to modify the model code.

Performance numbers (SGLang 0.4, Llama-3-70B, May 2026): 1.8-2.4× throughput improvement vs SGLang co-located on the same hardware count, depending on workload mix.

vLLM

vLLM has a "disaggregation prototype" in v0.8 — functional but not production-recommended. The current limitation: prefill and decode workers run as separate vLLM instances with manual configuration, and KV transfer goes through a slower path. Expected to mature in vLLM 0.9 / 1.0.

TRT-LLM

TRT-LLM's disaggregation is part of NVIDIA's broader Dynamo serving stack. The integration is tight: TRT-LLM engines for prefill and decode, NIXL for KV transfer, Triton inference server for routing. Performance leads the open ecosystem (typically 20-40% throughput advantage over SGLang at matched configuration) but the deployment is NVIDIA-only and more complex to operate.

DistServe

DistServe (Zhong et al., 2024) is the academic reference paper. The implementation is open-source but not actively maintained as a production stack; the ideas have been absorbed into SGLang, Mooncake, and TRT-LLM.

Splitwise

Splitwise (Patel et al., Microsoft, 2024) is Microsoft's production disaggregation system, used in Azure OpenAI Service. Not open-source; details published in the paper. Splitwise's distinguishing feature: heterogeneous hardware (prefill on lower-cost compute-optimized GPUs, decode on bandwidth-optimized).

Mooncake

Mooncake (Qin et al., Moonshot AI, 2024) is Moonshot AI's open-sourced disaggregation stack, used for Kimi K2 serving. Distinguishing feature: distributed KV cache pool (KV stored across the cluster, not pinned to specific decode GPUs). Mooncake's KV pool design influenced subsequent stacks.

Stack feature matrix

Stack Open source Same-node disagg Cross-node disagg KV transfer Production-ready
SGLang 0.4 Yes Yes Yes NCCL/NIXL Yes
vLLM 0.8 Yes Beta Beta Custom Beta
TRT-LLM 0.18 Partial Yes Yes NIXL Yes
DistServe Yes (academic) Yes Yes NCCL Reference
Splitwise No (paper) Yes Yes Custom Yes (Azure-internal)
Mooncake Yes Yes Yes NIXL/custom Yes (Moonshot prod)
TGI No No No n/a n/a

Cost math worked example: prefill + decode pool sizing

A concrete sizing exercise: serve Llama-3-70B at 1000 QPS, mixed chat workload (average prompt 500 tokens, response 600 tokens).

Co-located baseline

Each request needs ~1.5 seconds of prefill compute on H100 + ~10 seconds of decode at average response length. With continuous batching, an H100 SXM at FP8 can serve ~250 tokens/sec total throughput in a co-located setup. At 1000 QPS × 1100 tokens/req = 1.1M tokens/sec total. Required GPUs: 1.1M / 250 = ~4400 H100s. At $30/hour each = $132K/hour cluster cost.

Disaggregated layout

Split into prefill pool and decode pool. Workload analysis: prefill compute is 500 prompt tokens × 4400 GFLOPS × 1000 QPS = 2.2 EFLOPS/sec total prefill compute. Decode compute is much lower (memory-bound).

Sizing prefill pool: 1 H100 prefills 500 tokens in ~0.5 sec at FP8 with continuous batching at batch 4. So 1 H100 serves 8 prefills/sec. 1000 QPS / 8 = 125 prefill H100s.

Sizing decode pool: 1 H100 decodes at ~50 tokens/sec at batch 32. Total decode throughput needed: 1000 QPS × 600 tokens = 600K tokens/sec. 600K / 50 = 12000 decode-token-streams. With batch 32, that's 12000 / 32 = 375 decode H100s.

Total disaggregated: 125 + 375 = 500 H100s. Vs 4400 co-located.

But — that calculation is too optimistic. The 250 tokens/sec co-located number already accounts for some prefill/decode interleaving. Realistic disagg savings vs co-located in production: 2-3×. So ~1500 H100s instead of 4400.

Realistic numbers

For a 1000 QPS Llama-3-70B chat workload:

Configuration GPUs Hourly cost
Co-located vLLM (baseline) ~4400 H100 $132K
Co-located with full optimization (FA3, FP8, prefix cache) ~2200 H100 $66K
Disaggregated (same-node) ~1500 H100 $45K
Disaggregated (cross-node, Mooncake-style) ~1200 H100 $36K
Disaggregated + spec decoding (EAGLE-2) ~800 H100 $24K

The disaggregation win is ~30-50% over fully-optimized co-located. Combined with speculative decoding, the cost reduction approaches order-of-magnitude vs naive serving.


Mixed B200/H200 pools and disaggregation

A 2026 design pattern: heterogeneous pools optimized for each phase.

B200 for decode, H100/H200 for prefill

B200's higher HBM bandwidth (8 TB/s vs 3.35 TB/s on H100) makes it the better decode GPU — decode is bandwidth-bound. B200's compute (FP8 around 4.5 PFLOPs) is also higher, but the compute advantage matters less for decode than the bandwidth.

H100 / H200 remain capable for prefill — prefill is compute-bound, and H100's compute is "enough" for most workloads. Using older H100s for prefill while reserving B200s for decode-only optimizes the per-token cost.

H200 for both, with KV migration

H200's 141 GB HBM is enough for both phases of most workloads. A mixed pool of all H200s with dynamic P/D allocation (any GPU can serve either phase) is simpler operationally than maintaining separate hardware pools.

When to use mixed vs homogeneous

Mixed pools win when (a) the workload has stable P/D ratio, (b) the cost differential between GPU types is significant, and (c) the operational complexity is justified by the scale.

Homogeneous pools win when (a) workload mix is variable, (b) operational simplicity is prioritized, or (c) scale doesn't justify the additional procurement complexity.

Most 2026 production stacks at hyperscale (Azure, AWS Bedrock, GCP Vertex) use mixed pools. Most enterprise on-prem deployments use homogeneous pools.


Prefix caching with disaggregation

Prefix caching (storing KV for common prefixes) interacts non-trivially with disaggregation.

The decision: cache where?

Three options. (1) Cache on prefill GPUs (the natural place — prefill produces the KV). On cache hit, prefill is skipped, and only the suffix is computed and transferred to decode. (2) Cache on decode GPUs (the natural place to use it — decode consumes the KV). On cache hit, decode skips waiting for transfer; only the suffix is transferred. (3) Distributed cache pool (Mooncake-style), accessible from both. On cache hit, the consuming GPU pulls from the pool.

Each has trade-offs. Option 1 (prefill cache) is simplest; the prefill GPU is the natural producer. Option 2 (decode cache) has lower TTFT on hits but requires the decode GPU to maintain a large cache. Option 3 (pool) scales best at large fleet sizes but adds infrastructure complexity.

The "transfer or share" decision

For very common prefixes (system prompts shared across users), the prefix's KV may be replicated across all decode GPUs to avoid per-request transfer. For less common prefixes (per-conversation history), single-copy storage with on-demand transfer is more memory-efficient.

Production heuristic: replicate prefixes with >10% hit rate; single-copy for the rest. Heuristic boundaries are workload-specific; tune based on production trace.

Performance impact

Prefix caching combined with disaggregation typically reduces total compute by 30-60% for workloads with repeated prefixes (most chat, agentic). The combination is greater than the sum of parts: disaggregation makes the prefill GPU available for non-cached prefills while cached requests bypass prefill entirely.


Speculative decoding with disaggregation

Speculative decoding uses a small draft model to propose tokens and a large target model to verify. The pattern composes well with disaggregation.

Target on decode pool

The verification step (large model evaluating N proposed tokens) runs on the decode pool, batched alongside normal decode. The arithmetic intensity is higher than normal decode (N tokens instead of 1), which is favorable for the decode pool's bandwidth-bound regime.

Draft on prefill pool or co-located

The draft model (typically 1-10% of target size) runs either co-located with target on decode, or on dedicated draft hardware. Co-located is simpler; dedicated draft hardware is more efficient at hyperscale.

Combined speedup

Speculative decoding alone delivers 1.5-3× decode throughput. Disaggregation alone delivers 1.5-2× over co-located. Together: 2.5-5× over naive co-located. The numbers compound because they address different bottlenecks (verification efficiency vs phase separation).

EAGLE-2 integration

EAGLE-2 is a state-of-the-art speculative decoder. Production stacks (SGLang, TRT-LLM) integrate EAGLE-2 into the decode pool with negligible code changes. The draft network adds ~3% overhead to decode steps and accepts 3-6 tokens per verification on average, yielding 2-3× decode speedup.


Reasoning models with disaggregation

Reasoning models (o1, o3, R1, Claude Opus thinking mode) emit long thinking chains before final answers. Thinking tokens are decode-style work; the workload is extreme decode-heavy.

P/D ratio for reasoning

Typical P/D ratio for reasoning workloads: 1:12 to 1:16. The prompt is normal-length (a few hundred to a few thousand tokens); the response is very long (often 5K-30K thinking tokens plus a short final answer).

Decode pool sizing

Decode pool needs to hold KV for very long sequences during thinking. With FP8 KV, a 70B-class model at 30K thinking tokens uses ~24 GB KV per request. An H100 can hold batch 2-3 at this point; H200 holds batch 4-5.

The throughput economics of reasoning serving are worse than chat — same model produces fewer requests per GPU per hour because each request takes longer. Pricing for reasoning models reflects this (OpenAI o1 is several times more expensive per output token than GPT-4o).

Truncated thinking

Production stacks expose thinking-budget caps. Truncate thinking after N tokens, force the model to emit a final answer with whatever reasoning was completed. This limits the worst case of pathologically long thinking chains that pin decode-pool capacity.


Reference designs: Mooncake, DistServe, Splitwise

Mooncake architecture (Moonshot AI, 2024)

Distinguishing features: (1) Distributed KV cache pool — KV not tied to specific decode GPUs but pooled across the cluster. (2) Layer-wise streaming KV transfer — overlap transfer with prefill completion. (3) Heterogeneous-aware scheduling — routes requests to the GPU with the right KV pool affinity and the right hardware.

Result: 5-10× throughput improvement over baseline on Kimi K2's serving workload. The win is partly disaggregation, partly prefix caching at the pool level, partly hardware-aware routing.

DistServe (UC San Diego, Tsinghua, 2024)

The seminal academic paper on disaggregation. Key contribution: formal characterization of "goodput" — throughput that meets SLAs — and an optimization framework for P/D resource allocation. Showed 4.5× goodput improvement vs co-located on production workloads.

Splitwise (Microsoft, 2024)

Microsoft's production system. Key contribution: heterogeneous hardware utilization — uses prefill-optimized GPUs (older A100s) alongside decode-optimized GPUs (H100). Cost savings come from extending the useful life of older hardware.

Common architectural elements

All three converge on similar patterns: separate worker pools, KV transfer via RDMA, dynamic routing based on queue depth, KV cache locality awareness. The differences are in implementation details (KV pool architecture, scheduler signals) rather than fundamental design.


Failure handling in disaggregated serving

Disaggregation introduces failure modes that co-located serving doesn't have.

Decode pod failure mid-request

A request is mid-generation on a decode GPU when the GPU fails. Options: (a) abort the request, return error to user; (b) replay from beginning on a new decode GPU; (c) restore from a saved KV checkpoint. Production stacks typically choose (a) or (b); checkpoint-based recovery is rare due to the cost of frequent KV snapshots.

Prefill pool overflow

Prefill queue exceeds capacity. Strategies: shed load (reject new requests with 503), spill to decode pool (convert idle decode GPUs to prefill temporarily), or extend wait time. Production stacks combine load shedding with graceful queue depth limits.

KV transfer failure

NIXL/NCCL transfer fails (network issue, GPU error). Recovery: retry transfer once, then fail the request. Robust stacks track per-link failure rates and route around persistent issues.

Decode pool memory pressure

KV memory across decode pool fills up. Mitigations: more aggressive prefix-cache eviction, in-flight request preemption (suspend, save KV elsewhere, resume), load shedding. Mooncake's distributed KV pool spreads pressure across the fleet; non-pooled designs are more vulnerable.

SLA preservation under partial failure

If 10% of decode GPUs fail, total decode capacity drops 10% but per-request SLAs should not change. Production stacks maintain headroom (typically 20-30% over peak demand) to absorb partial failures without violating SLAs. Cost-conscious deployments run tighter; SLA-conscious deployments run looser.


P/D scheduling: per-request routing and signals

The scheduler routes each request to the appropriate prefill GPU and (after prefill) to the appropriate decode GPU. Signals used:

Queue length

Route to the GPU with the shortest queue. Simple, effective at moderate loads. Breaks at scale when GPUs have heterogeneous capacity — a long queue on a fast GPU may finish before a short queue on a slow GPU.

Latency-aware scheduling

Estimate completion time for each candidate GPU based on queue depth, request size, and historical latency. Route to minimize estimated TTFT. More accurate than queue-length alone; requires latency tracking infrastructure.

KV-cache affinity

If the request shares a prefix with cached content on a specific GPU, route there to hit the cache. Trade-off: cache hit saves prefill; non-uniform routing may overload some GPUs.

Workload-class routing

Route reasoning requests to decode-pool GPUs with longer-sequence capacity; route chat requests to higher-batch-throughput GPUs. Heterogeneous routing requires classifier (typically a small model classifying request intent) at the front of the pipeline.

Per-request priority

Premium-tier users routed to faster (less-loaded) GPUs. Free-tier batched with longer queues. Common in commercial deployments; the scheduling logic is straightforward but the policies are operationally complex.


Cross-rack disaggregation

When the prefill and decode pools span multiple racks (separate NVLink domains), KV transfer must traverse inter-rack networking.

Network requirements

Inter-rack bandwidth needs: for a KV transfer of 5 GB (typical for a 70B model at 2K tokens) at acceptable latency (<20 ms), bandwidth must be ≥2 Tbps. This requires 200+ Gbps per GPU on the inter-rack network, typically achieved via 400 Gbps InfiniBand or 800 Gbps RoCE.

When cross-rack disagg makes sense

  • Decode pool needs HBM that exceeds a single rack's capacity (e.g., serving 70B with 1M context on a fleet larger than one NVL72 unit).
  • Cost optimization where prefill and decode racks are in different cost zones.
  • Resilience: spread workload across racks to survive rack-level failures.

When it doesn't

For most production workloads, single-rack disaggregation (prefill and decode within one NVL72 or DGX H100) suffices. Cross-rack adds latency and bandwidth cost that pays off only at extreme scale.


Observability for disaggregation

Metrics that matter for disaggregated serving:

Per-phase latency split

TTFT (time-to-first-token) = prefill latency + KV transfer latency + decode start-up latency. Track each component independently. A regression in any one points to a different operational issue.

Transfer-time histogram

Histogram of KV transfer times. Mode should be sub-10ms on NVLink, sub-50ms on RDMA. Long tail indicates network congestion or competing traffic; investigate.

P/D queue depth

Independent queue depths for prefill and decode pools. Imbalance (one full, one empty) suggests P/D ratio is wrong for the current workload mix.

KV pool utilization

For Mooncake-style distributed KV pools, track per-GPU KV memory utilization. Hot spots indicate non-uniform prefix-cache distribution.

Per-stack stack-trace metrics

Stack-specific metrics: NIXL transfer success rate, NCCL collective timing, vLLM/SGLang scheduler queue depth, TRT-LLM engine load. Integrate with standard observability (Prometheus, Grafana) for unified dashboards.


The "fused KV" alternative: SARATHI and chunked prefill batching

Disaggregation is one way to escape the prefill/decode bottleneck. Chunked prefill is another — and the two are not mutually exclusive.

SARATHI / chunked prefill

SARATHI splits long prefills into chunks and interleaves chunk processing with decode operations. The result: prefill no longer monopolizes a GPU for the full prefill duration; decode operations make progress between chunks.

This is the "fused KV" approach — instead of separating prefill and decode onto different GPUs, run them on the same GPU but with finer-grained scheduling that prevents the prefill from blocking decode.

Trade-offs vs disaggregation

Chunked prefill keeps the serving topology simple (no separate pools), works on commodity hardware (no special interconnect), and is easier to operate. The throughput win is smaller than disaggregation (typically 1.3-1.7× vs co-located) but the engineering cost is much lower.

For workloads where disaggregation would be marginal (small scale, simple workloads), chunked prefill is often the right answer. For hyperscale serving where every percent matters, disaggregation + chunked prefill combined wins.

Sparse Inference Serving (2024)

A newer approach that uses sparsity in the prefill to skip non-relevant context, reducing effective prefill cost. Less mature than chunked prefill but shows promise for long-context workloads where most of the context is irrelevant to the query.


2026 trends: B200 NVL72 and multi-DC

NVL72 reduces disagg ROI for some workloads

B200 NVL72 (72 GPUs in one NVLink domain, 36 TB/s aggregate bandwidth) enables intra-domain disaggregation with effectively unlimited bandwidth. The transfer-cost component of disaggregation becomes negligible.

The flip side: NVL72's massive HBM (13.8 TB aggregate) makes large monolithic serving feasible. For workloads that fit in 13.8 TB (any model up to ~250B parameters at FP8 plus KV for thousands of concurrent users), a single NVL72 may serve everything without disaggregation.

The economic question: is the operational complexity of disaggregation worth it on hardware where co-located already scales to extreme batch sizes? For most enterprise deployments using NVL72-class hardware, the answer in 2026 is: no, simple co-located serving is enough. Hyperscalers still benefit from disaggregation for the long tail.

Multi-DC disaggregation

Some hyperscalers experiment with disaggregating prefill and decode across data centers connected by 100+ Gbps WAN. The economics: cheap power in one region for compute-intensive prefill, premium hardware in another region for bandwidth-intensive decode.

Practical viability: limited by WAN bandwidth and latency. A KV transfer across 50 ms of WAN latency adds 50 ms to TTFT, often unacceptable. Mostly research and limited production use cases (offline batch inference where TTFT doesn't matter).

What B200 changes for disagg economics

Three concrete changes. (1) Per-GPU decode throughput rises ~2× over H100 due to higher HBM bandwidth, so fewer decode GPUs are needed for a given QPS — the decode pool shrinks. (2) Per-GPU prefill throughput rises ~2.5-3× due to FP8/FP4 tensor cores, so the prefill pool shrinks even more. (3) NVL72 changes the network topology — within an NVL72 domain, KV transfer is effectively free; across domains it requires deliberate routing.

The combined effect: B200 makes disaggregation less compelling at small-to-medium scale (just use NVL72 co-located) and more compelling at hyperscale (where multiple NVL72s are linked and the disaggregation can span domains efficiently). The hyperscale providers report continued ROI from disaggregation on B200 hardware; the enterprise providers report that NVL72 co-located is now sufficient for most workloads.

Looking ahead: 2026-2027

Three trends to watch. (1) Native KV-aware networking (compute-storage-class fabrics with KV-specific transfer primitives) reducing transfer overhead further. (2) Hardware support for cross-pool KV migration (NVIDIA Dynamo-class libraries with first-class scheduler integration). (3) Workload-specific disaggregation patterns (reasoning workloads with multi-stage decoding, agentic workloads with tool-call interleaving) emerging as serving-stack design points.


The bottom line

The prefill/decode mismatch is the central structural fact of LLM inference: two phases with opposite hardware appetites sharing one GPU, each stalling the other. Disaggregation solves it by giving each phase its own pool, tuned for its bottleneck, with the KV cache as the courier between them. The single biggest lever is layer-wise KV streaming — it hides nearly all transfer latency behind ongoing prefill compute, which is what makes the whole architecture practical at production scale.

If you take only this away:

  • Prefill is FLOPs-bound; decode is HBM-bandwidth-bound. No single GPU is optimal for both.
  • Get the substrate right first. Continuous batching, paged attention, prefix caching, and FP8 KV are the biggest wins for most teams — bigger than disaggregation alone.
  • Same-node disaggregation captures most of the gain at a fraction of the engineering cost. Reach for full multi-node only at hosted-provider scale.
  • RDMA-class networking is a prerequisite for cross-node disaggregation. Without it, the KV transfer kills the win.
  • Prefix caching compounds disaggregation. Both attack redundant prefill work; together they are 5–20× on prefix-heavy traffic.

For the memory math behind the KV cache that gets streamed, read KV cache. For the decoding optimizations that stack on top, see speculative decoding.


FAQ

Do I need disaggregation for a 7B model? Probably not. The arithmetic-intensity gap exists but is smaller. Operational complexity outweighs the throughput win for most 7B deployments. Run vLLM on one or two GPUs and see if it's a bottleneck before reaching for disaggregation.

Can I disaggregate inside a single node? Yes, and it's the recommended on-ramp. Put prefill workers on some GPUs of an 8-GPU node and decode workers on the others. NVLink makes KV transfer essentially free. You capture most of the scheduling win without inter-node network engineering.

How does this interact with MoE models? Cleanly and beneficially. MoE prefill activates all experts per token, making it even more compute-heavy and the prefill/decode split more pronounced. Expert parallelism lives in the decode pool; prefill workers can be smaller and FLOPs-dense. See our mixture-of-experts serving guide for the expert-parallel scheduling story.

What about CPU offloading the KV cache? PCIe Gen5 is ~64 GB/s — too slow for interactive decode (the KV cache for a single 70B request might be 10 GB, and you need it accessible at sub-millisecond latency). Used only for offline batch workloads where latency doesn't matter.

Does prompt caching require disaggregation? No. They're independent. But disaggregation makes prefix caching easier to scale because the prefill pool is shared infrastructure that can hold a large prefix cache. Colocated systems can prefix-cache too, just with less aggregate cache capacity.

What's the smallest deployment where this matters? Empirically, 4+ GPUs and > 5 req/s. Below that, schedulers don't get enough load for the split to pay off.

How much engineering does it cost to deploy? Using a serving stack with built-in support (vLLM, SGLang, TensorRT-LLM): a few weeks of tuning. Building from scratch: months. Most teams should use an off-the-shelf stack.

Will B200 or future hardware make disaggregation unnecessary? No. The arithmetic-intensity gap between prefill and decode is structural to the transformer, not a property of any hardware generation. Newer GPUs make both phases faster but don't change the relative profile. The on-package HBM capacity increases (B200's 192 GB, MI325X's 256 GB) make decode pools more capable but also raise the bar on what counts as "prefill-heavy" — you can hold more concurrent decodes per GPU.

Does disaggregation matter for sub-70B models? Marginally. For models below ~30B, decode at a healthy batch size already saturates HBM bandwidth on a single GPU, and prefill is not long enough to be a scheduling problem. Run vLLM with continuous batching, paged attention, and prefix caching. Skip disaggregation. The crossover where the engineering pays back is roughly: 70B-class model, average prompt > 1k tokens, > 10 QPS per node. Below that, the operational overhead exceeds the throughput win.

What is the cost of switching from a colocated to a disaggregated serving architecture? Two costs. Engineering: 4-8 weeks for a serious team using a stack with built-in disaggregation support (vLLM, SGLang, TRT-LLM), plus another month of load testing and tuning routing. Capital: faster intra-cluster fabric (RDMA-capable NICs or NVL72-class racks) if you don't have it. Operational: harder capacity planning because you scale two pools instead of one, and harder observability because failures can be in either pool or in the KV transfer plane. Most teams that have made the switch report that the engineering cost is paid back within a quarter at hosted-provider scale; smaller deployments rarely recoup it.

What about disaggregating prefill, decode, and the embedding/output projections separately? Some research deployments (DistServe variants, internal hyperscaler systems) further split the embedding lookup and output projection from the decode forward. The gains are small (a few percent) and the engineering cost is large. Not worth it for almost any deployment outside of frontier labs.

Does disaggregation make speculative decoding easier or harder? Slightly harder. The draft model has to live somewhere. Practical pattern: place the draft on the decode worker so the draft and target share the same KV. Putting the draft on a separate pool would add a network hop per draft step, which kills the speedup. For the underlying mechanism, see our speculative decoding guide.

How does this interact with long-context serving? Disaggregation amplifies the long-context KV pressure problem. A 128k-context request produces 43 GB of KV cache per request on a 70B model; transferring that between pools is expensive even on fast fabrics. Production stacks pair disaggregation with KV quantization (FP8 or INT4 on the wire) and aggressive prefix caching. See our long-context attention guide for the underlying memory math.

Can I disaggregate MoE serving? Yes, and the gains are usually larger because MoE prefill is even more compute-heavy (all experts activated per token) and MoE decode benefits more from large batches (each expert wants enough work to amortize its weight load). Expert parallelism lives in the decode pool. See our mixture-of-experts serving guide.

What about agent workloads with many short turns? A different regime. Each agent turn is short prefill + short decode + tool call + next turn. Disaggregation helps if turns share prefixes (system prompt + conversation history); it hurts if every turn is short enough that the prefill/decode split is rounding error. The Mooncake / SGLang RadixAttention pattern of branching prefix trees fits this workload best. See our agent serving infrastructure guide.

How do I size the prefill-to-decode pool ratio in practice? Start from your measured request mix. The closed-form ratio in §5 is a starting point, not a target. In practice, instrument prefill-pool queue depth and decode-pool queue depth separately, autoscale each on its own queue, and let the ratio settle. Typical converged ratios in 2026 production: chat workloads 1:5 to 1:8 (prefill:decode), RAG with long contexts 1:3 to 1:5, agent workloads 1:8 to 1:12 because each turn is short. Re-evaluate quarterly because the distribution drifts as your product changes.

Does FP8 KV cache work safely with disaggregation? Yes, and most production stacks use it. The subtlety: if prefill stores KV in FP16 in its own HBM but transmits in FP8, the decode side receives FP8 — make sure that is what your decode kernel expects. FlashAttention-3 supports FP8 KV natively; older FlashAttention-2 kernels require an in-place dequantization on receive, which adds 1-3 ms per layer. Quality regressions from FP8 KV are typically under 0.5% on standard chat benchmarks but can hit 1-2% on long-context retrieval tasks. Always validate on your own eval set. See quantization tradeoffs.

Can I disaggregate when the prefill and decode pools use different model parallelism degrees? You can, with a reshuffle. If prefill is TP=2 and decode is TP=4, each KV layer arrives sharded across 2 ranks and must be re-sharded to 4. This adds an all-to-all on the receive side that costs roughly half a layer's worth of NVLink time. The clean solution is to match TP degrees across pools when possible; the workaround is to do the reshuffle and accept the latency. Pipeline parallelism mismatch is much worse — different layer assignments mean per-layer streaming has to be reconstructed end-to-end. Most production stacks recommend matching PP degrees strictly.

How does disaggregation interact with multi-tenant LoRA serving? LoRA adapters live with the base model in the decode pool (they affect every decode step), and prefill workers must also load the matching adapter to produce a KV cache consistent with the decode-side weights. The standard pattern is to hot-swap adapters per request on both sides, with the router carrying the adapter ID through the prefill→decode handoff. Adapter loading from CPU memory is ~1-10 ms; from NVMe it is 50-200 ms. See our multi-tenant LoRA serving guide.

What happens if the prefill pool fails mid-request? A request that has not yet emitted its first token can be retried from scratch — the prompt is on the router, the decode slot is reserved, the prefill is just compute. Routers handle this with a short retry budget (typically 2-3 attempts) before failing the request. Failures after the first token are harder: you have a partial KV on the decode side and a half-finished generation. Most production stacks fail the request and surface it to the client rather than attempting recovery. The MTBF of a single prefill worker is high enough that the per-request failure rate is negligible at typical scale.

Is disaggregation worth it on consumer hardware (RTX 4090, 5090)? Almost never. Consumer cards have no GPUDirect RDMA, no NVLink between cards on most boards, and PCIe-only inter-card transfer. The KV-transfer overhead consumes the scheduling win. Stick with single-card vLLM at small batch on consumer hardware, or use Ollama-style cold-start serving. If you need higher throughput on a budget, rent A100 or H100 by the hour rather than building a consumer-card disaggregated rig.

How does disaggregation affect observability and SLOs? You now have two separate latency budgets (prefill TTFT contribution, decode ITL contribution) plus a transfer budget. The metrics you want: prefill queue depth, prefill p99 latency by prompt-length bucket, KV transfer time per layer, decode KV slot fill rate, per-pool HBM utilization. The traps: averaging across pools hides one pool overloading while the other is idle; not separating "stuck in transfer" from "stuck in decode" makes triage impossible. Most production stacks expose Prometheus metrics keyed by pool and by request phase; if yours does not, build it before you scale.

Should I disaggregate reasoning models (o1, R1-style) differently? Yes. Reasoning models generate very long internal chains-of-thought before emitting a final answer, which makes the output much longer than the prompt. The prefill:decode ratio shifts heavily toward decode. The prefix-cache benefit on shared system prompts is unchanged, but the per-request decode cost balloons. Provision decode capacity assuming 5-20× more output tokens per request than for non-reasoning workloads, and budget for KV memory growth during the chain. See our reasoning model serving guide.

Where does disaggregation fit relative to speculative decoding? They compose. Speculative decoding (EAGLE-2) gives the decode pool a 1.5-3× per-request speedup; disaggregation gives the entire system a 1.5-3× throughput improvement. Run both. The draft model lives on the decode worker so the draft and target share the KV cache. The combined stack delivers roughly 3-7× over a naive colocated baseline on chat workloads. See our speculative decoding guide.

What is NIXL and do I need it? NIXL (NVIDIA Inference Xfer Library) is NVIDIA's library for inference KV transfer across GPUs and nodes. It handles both NVLink (same-node) and RDMA (cross-node) transparently, with per-layer streaming for low TTFT. If you're using NVIDIA Dynamo or TRT-LLM at scale, you're already using NIXL. For other stacks, NIXL is available as a library to integrate. Not strictly required — NCCL works as a fallback — but NIXL is 20-40% faster for KV transfer on the same hardware.

Does disaggregation reduce or increase total HBM usage? Slightly reduces. Co-located serving holds both phases' working set in the same GPU; disaggregated splits across separate pools where each pool only holds its phase's working set. The KV cache is now in the decode pool only (not duplicated). For typical workloads, total HBM usage drops 10-15%. The bigger win is not memory but utilization — both pools run their hardware at higher utilization than co-located could achieve.

What if my workload doesn't fit any clean P/D ratio? Adaptive P/D allocation. Production stacks (Mooncake, SGLang) can re-purpose GPUs between phases on minute-timescales. Hardware that supports both phases efficiently (H100, H200) makes this easier than mixed pools. The cost is conversion latency (30-60 seconds) and operational complexity; the benefit is automatic adjustment to changing workload mixes.

How does disaggregation affect cold start? Negatively. Co-located serving has one model load per GPU; disaggregated needs the model loaded on both prefill and decode pool GPUs. Total cold-start time and memory waste are higher. AOTInductor or TRT engine builds reduce the cold-start cost; production deployments pre-warm GPUs in both pools before accepting traffic.

Does disaggregation work with MoE? Yes, but with complications. MoE has expert dispatch all-to-all collectives that need to be handled within the prefill or decode phase. Cross-phase MoE all-to-all (prefill experts on prefill pool, decode experts on decode pool) is research-stage; production MoE disaggregation typically replicates all experts on both pools. The expert-parallelism strategy interacts with the disaggregation topology in non-trivial ways.

How small can a decode pool be? At least one GPU per concurrent request. With KV memory pressure, often more. Theoretical minimum for serving 1 request at a time on Llama-3-70B FP8: 1 H100 SXM. Practical minimum for production reliability: 4-8 GPUs with redundancy. Below that, single-GPU failures cause unacceptable outages.

Can I disaggregate inference on a single GPU? No — disaggregation is fundamentally about separating phases across hardware. On a single GPU, you can use chunked prefill to achieve similar interleaving benefits without disaggregation. For workloads where one GPU is enough, chunked prefill is the right answer.

What is goodput vs throughput? Throughput is raw tokens/sec. Goodput is throughput meeting SLAs (e.g., tokens/sec for requests where TTFT < 1s and ITL < 50ms). Disaggregation optimizes for goodput, not raw throughput. A naive measurement of throughput might suggest co-located is similar to disaggregated; goodput measurement reveals the win — co-located meets SLAs for far fewer requests at the same load.

How does KV cache layout affect transfer efficiency? KV cache stored contiguously per layer transfers faster (one large memcpy per layer); paged KV requires gather operations. Most stacks use layer-contiguous storage during transfer, even if the at-rest storage is paged. NIXL handles this transparently; hand-rolled implementations need to be careful about the layout transformation.

What's the right way to debug disaggregation issues? Three-step process. First, verify per-phase functionality: route a small workload to prefill-only and decode-only paths and confirm each works in isolation. Second, instrument the KV transfer: log transfer initiation, completion, and per-layer timing. Third, monitor queue depths separately; an imbalance indicates routing or sizing issues. Most disaggregation bugs manifest as either silent KV transfer corruption (rare with NIXL) or P/D ratio misalignment (very common).

Should I disaggregate for fine-tuning workloads? No, fine-tuning is training-style — long sequences in large batches, prefill-shaped. Co-located is the right pattern. Disaggregation is purely an inference optimization.

How does prefix caching affect disaggregation economics? Prefix caching reduces prefill load disproportionately — cached prefixes don't need prefill at all. This shifts the P/D ratio further toward decode-heavy. If your workload has high prefix cache hit rate (>50%), the prefill pool can be smaller than naive sizing suggests. Conversely, deployments without prefix caching pay full prefill cost on every request and need larger prefill pools.


Glossary

  • Arithmetic intensity — FLOPs performed per byte loaded from HBM. The number that tells you whether a workload is compute- or bandwidth-bound.
  • Continuous batching — admitting new requests into an active batch as old ones finish, instead of running fixed batches.
  • Decode — the phase where a model generates output tokens one at a time. Bandwidth-bound at small batch sizes.
  • Disaggregation — running prefill and decode on separate GPU pools connected by a fast fabric.
  • GPUDirect RDMA — direct GPU-to-GPU memory transfer over RDMA networks, bypassing CPU and host memory.
  • HBM — High Bandwidth Memory. The on-package memory of modern GPUs.
  • ITL — Inter-Token Latency. Time between consecutive generated tokens.
  • KV cache — per-token key and value tensors stored to avoid recomputing attention.
  • Layer-wise streaming — pipelining KV cache transfer with ongoing prefill compute.
  • NVLink / NVSwitch — NVIDIA's high-bandwidth GPU-to-GPU interconnect; NVSwitch is the crossbar fabric that connects multiple NVLink-equipped GPUs at full bandwidth.
  • Prefill — the phase where a model processes the input prompt to produce the initial KV cache. Compute-bound.
  • Prefix caching — reusing KV cache across requests that share a prompt prefix.
  • Ridge point — on a roofline plot, the arithmetic intensity at which compute and bandwidth saturate equally.
  • RoCE — RDMA over Converged Ethernet.
  • TTFT — Time To First Token. End-to-end latency from request to first generated token.

References

  • Mooncake — Qin et al., 2024. "Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving." arXiv:2407.00079. The reference paper for layer-wise streaming and distributed KV pools.
  • DistServe — Zhong et al., 2024. "DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving." arXiv:2401.09670. Reports the 1.5-3× goodput numbers cited above.
  • Splitwise — Patel et al., 2023. "Splitwise: Efficient generative LLM inference using phase splitting." arXiv:2311.18677. Microsoft Research's independent demonstration of the same pattern.
  • PagedAttention / vLLM — Kwon et al., 2023. "Efficient Memory Management for Large Language Model Serving with PagedAttention." arXiv:2309.06180. Foundational paper for paged KV cache used throughout disaggregated stacks.
  • SGLang / RadixAttention — Zheng et al., 2023. "Efficient Programming of Large Language Models using SGLang." arXiv:2312.07104.
  • DeepSeek-V3 technical report — DeepSeek-AI, 2024. arXiv:2412.19437. Production serving infrastructure section describes their disaggregated design.
  • Roofline model — Williams, Waterman, Patterson, 2009. "Roofline: An Insightful Visual Performance Model for Multicore Architectures." Communications of the ACM 52(4). The mental model for compute-vs-bandwidth bounds.
  • NVIDIA TensorRT-LLM documentation — current disaggregated serving and KV cache reuse docs.
  • FlashAttention — Dao et al., 2022. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." arXiv:2205.14135. The kernel that makes long-context prefill tractable.
  • Speculative Decoding — Leviathan et al., 2022. arXiv:2211.17192. Foundational paper for the decode-side optimization that composes with disaggregation.
  • EAGLE-2 — Li et al., 2024. arXiv:2406.16858. Dominant production speculative-decoding variant.

Prefill vs decode mechanics in depth

The two phases differ in compute pattern, memory access, and bottleneck shape. Understanding the asymmetry is the foundation for disaggregation.

Prefill: compute-bound

Prefill processes the entire prompt in parallel. For a 4096-token prompt on Llama-70B, that's one big matmul (4096 × d_model × n_layers) per layer. Arithmetic intensity is high (many FLOPS per byte of HBM read), so the GPU runs near its TFLOPS ceiling. Time scales linearly with prompt length. Batch parallelism helps modestly because each request already has many tokens.

Decode: memory-bound

Decode generates one token at a time. Each step reads the full weight set from HBM, reads the KV cache, computes one token, appends to KV. Arithmetic intensity is low (per-token compute is small vs HBM bandwidth needed). Time scales with output length. Batch parallelism helps a lot — at batch 32, multiple decode steps share the weight read, amortizing memory bandwidth.

Kernel choices per phase

  • Prefill kernels: large GEMM optimized for high arithmetic intensity. FlashAttention-3 (Hopper) and FlashAttention-Blackwell (B200) for attention. cuBLAS / cuBLASLt for FFN.
  • Decode kernels: GEMV (matrix-vector) or grouped GEMM for batched decode. FlashDecoding for attention. Smaller tile sizes than prefill.

Batch composition in disagg

In disaggregated serving:

  • Prefill pool batches multiple full prompts simultaneously (batch dimension = number of concurrent prefills).
  • Decode pool batches many in-flight decodes (batch dimension = concurrent active sessions).
  • These look completely different from a kernel-tuning perspective.

The arithmetic-intensity gap

For Llama-70B FP8:

  • Prefill: ~700-900 TFLOPS sustained (close to H100 peak).
  • Decode: ~30-50 TFLOPS sustained (bandwidth-bound, ~5-7% of compute peak).

This 15-30× gap is why disaggregation works: optimal GPU resource per phase differs.


KV transfer mechanics deep dive

Moving KV cache from prefill pool to decode pool is the new central operation in disaggregated serving.

NIXL (NVIDIA Dynamo)

NVIDIA's KV-transfer library introduced with NVIDIA Dynamo. Optimizes KV movement with prefetch and zero-copy paths. Used in NVIDIA's reference disaggregation stack.

NCCL Send/Recv

Standard point-to-point primitive. Works for KV transfer but not optimized for the access pattern. Used in vLLM disagg prototype.

RDMA direct

Direct RDMA writes from prefill GPU to decode GPU memory. Fastest path for cross-node KV. Used in Mooncake's pooled KV design.

GDRCopy

GPUDirect RDMA Copy library. Direct GPU-to-GPU memory copies over NVLink or PCIe. Used heavily in same-node disagg.

IB Write semantics

InfiniBand provides reliable one-sided write with completion. The natural primitive for KV transfer cross-node.

Transfer bandwidth math

KV cache size for one request, Llama-70B, 4096-token prompt:

  • 80 layers × 2 (K + V) × 4096 tokens × 8192 hidden × 2 bytes (BF16) = ~10 GB.
  • At 1.8 TB/s NVLink: ~6 ms transfer.
  • At 50 GB/s InfiniBand: ~200 ms transfer.

NVLink is fast enough for same-rack disagg; cross-rack disagg over IB adds substantial latency.

Compression and offload

  • FP8 KV cache halves transfer cost.
  • CPU-memory KV offload (Mooncake) keeps cold KV in DRAM, swaps to HBM when needed.
  • Per-layer streaming: start decode as soon as first layers' KV arrives.

P/D ratio optimization in depth

The prefill-to-decode pool ratio determines cost. Workload-driven.

Chat workload (1:8 P/D)

Typical chat: 200-token prompt, 200-token response. Decode-heavy. 1 prefill GPU per 8 decode GPUs is a reasonable starting ratio.

Agent workload (1:16 P/D)

Agent calls have short prompts and short outputs but high concurrency. Decode-heavy. 1:16 is common.

RAG workload (1:4 P/D)

RAG injects retrieved chunks into prompt: 2k-8k input, 200-token response. Prefill-heavy vs chat. 1:4 to 1:6.

Reasoning workload (1:32 P/D)

Reasoning models generate long thinking traces (5k-20k tokens). Massively decode-heavy. 1:32 or higher.

Long-document workload (1:1 P/D)

Summarization of long documents: 16k-100k input, short output. Prefill dominates. 1:1 or even prefill-heavy.

Decision table

Workload Input tokens Output tokens P/D ratio Decode tier dominant?
Chat 200 200 1:8 Yes
Agent 500 200 1:16 Yes
RAG 4000 200 1:4 Mixed
Reasoning 1000 10000 1:32 Yes
Long-doc summary 32000 500 1:1 No
Code completion 500 100 1:6 Yes
Translation 2000 2000 1:2 Mixed

Mixed-workload deployments

Most production deployments serve mixed traffic. Strategies:

  • Static partition: dedicate fixed P/D ratio; size for worst case.
  • Dynamic re-allocation: rebalance pools based on traffic. Mooncake and Dynamo support this.
  • Workload-aware routing: route to per-workload pools.

Cost math

For 100 QPS at chat: 12 prefill GPUs + 96 decode GPUs = 108 total. For 100 QPS uniform pool: ~140 GPUs (no specialization). Disaggregation saves ~25% of capex in this example.


Per-stack disaggregation support

Engine-by-engine capability survey.

SGLang

The most-deployed disagg-prefill-decode engine in 2026. Reference for DeepSeek-V3 production. Pooled KV, supports cross-rack via NVLink+IB. Active development.

vLLM

Disaggregation is a prototype as of 2026; production-grade support emerging. Strong PagedAttention and continuous batching baseline.

TensorRT-LLM (NVIDIA Triton Distill)

NVIDIA Dynamo integrates TRT-LLM with disaggregation. NIXL KV transfer. Production-grade for NVIDIA-hosted deployments.

Mooncake (Moonshot AI)

Production system from Moonshot AI (Qin et al., Mooncake paper, July 2024). Distributed KV pool with CPU-memory offload. Used internally at Moonshot scale.

DistServe (UCSD)

Zhong et al., 2024. Academic system demonstrating goodput optimization via disaggregation. Reference for many follow-up systems.

Splitwise (Microsoft)

Patel et al., 2023. Microsoft Research's foundational disaggregation paper.

NVIDIA Dynamo

NVIDIA's 2025 disaggregation framework. Integrates Triton, TRT-LLM, NIXL, GPUDirect. Production-grade.

Engine comparison

Engine Disagg state KV transport P/D dynamic Production scale
SGLang Mature NCCL/RDMA Yes Frontier
vLLM Prototype NCCL Limited Smaller
TRT-LLM (Dynamo) Mature NIXL Yes NVIDIA-aligned
Mooncake Production (Moonshot) RDMA + CPU offload Yes Hosted-provider
DistServe Research RDMA Yes Academic
Splitwise Research / Azure Custom Yes Microsoft internal
NVIDIA Dynamo Production NIXL Yes NVIDIA reference

Workload-driven disaggregation decision

When disagg helps vs when it doesn't.

Disagg clearly helps

  • Mixed prefill-heavy and decode-heavy traffic.
  • Long-context with high decode token counts.
  • Reasoning models with multi-thousand-token thinking.
  • Hosted-provider scale (10k+ QPS).
  • Heterogeneous GPU pools (B200 decode + H200 prefill).

Disagg doesn't help

  • Small models (< 7B) where the cost gap is small.
  • Short context everywhere (< 200 tokens both prompt and response).
  • Low concurrency (< 100 QPS).
  • Single-node, single-tenant deployments.

Borderline cases

  • Mid-scale deployments (1k-10k QPS) where disagg's complexity may not be worth the 20-30% cost win.
  • Single-workload deployments where uniform pool is simpler.

Cost math worked example: prefill + decode pool sizing

A target: 1000 QPS chat workload, Llama 70B, p50 TTFT < 500ms, p50 TPOT < 50ms.

Uniform pool

  • Per-GPU throughput Llama 70B FP8 on H100: ~5000 tps.
  • 1000 QPS × 400 tokens/req = 400k tokens/s.
  • 80 H100 minimum, ~96 with headroom.
  • Cost: 96 H100 × $30k = $2.88M capex.

Disaggregated pool

  • Prefill: 1000 QPS × 200 input tokens = 200k input tokens/s. Per-GPU prefill rate: ~50k tps. So 4 H100 prefill GPUs.
  • Decode: 1000 QPS concurrent × 200 output tokens at 4-5 tps per session = ~1000 concurrent active. Per-GPU decode at batch 32: ~5000 tps total → 200 concurrent. So ~32 H100 decode GPUs.
  • Total: 4 prefill + 32 decode = 36 H100.
  • Cost: 36 × $30k = $1.08M capex.

Savings

~62% capex reduction with disagg vs uniform, with comparable latency targets. Subject to assumptions about token rates and batching.

Mixed B200/H200

Replace decode pool with B200: per-GPU decode rate at FP4 ~12000 tps. 32 H100 decode → ~12 B200 decode. Cost-balance shifts toward B200.


Reasoning models and disaggregation

Reasoning workloads dominate decode pool capacity.

The problem

Reasoning models (DeepSeek-R1, o-series) generate 5k-20k tokens of thinking before the visible response. Per-request decode time: 10x-100x longer than chat. The decode pool fills with long-running sessions.

Disagg helps disproportionately

Prefill cost is unchanged (short user prompt). Decode cost is massive. Disagg lets prefill pool stay small while decode pool scales independently.

Sizing for reasoning

P/D ratio 1:32+ for reasoning-dominated workloads. Decode pool sized for concurrent long-running sessions, not request rate.

Cache implications

Long thinking traces produce huge KV caches per session. KV cache memory becomes the binding constraint on decode pool size.

See also

Reasoning model serving for the full reasoning serving stack.


Multi-DC disaggregation

Cross-DC disagg is an emerging pattern.

When it makes sense

  • DC1 has cheap prefill capacity (H200, midnight-electric power).
  • DC2 has decode capacity (B200, low-latency for users).
  • Workload tolerates KV transfer cost.

Latency cost

WAN KV transfer adds 10-100ms to TTFT depending on DC distance and KV size. For chat, this is significant. For batch / async use cases, acceptable.

Cost arbitrage

Different DCs have different per-GPU-hour costs. Disagg lets each phase run where it's cheapest.

Production state

Experimental in 2026. Microsoft, Google, AWS are reportedly piloting. No public mature deployment yet.


Failure handling in disaggregated serving

Disagg introduces new failure modes.

Decode-pod failure mid-request

KV cache was transferred to a decode GPU; decode GPU dies. Request stalls. Recovery: retry on different decode pod (KV must be regenerated via prefill).

Prefill-pool overflow

Prefill queue grows; TTFT spikes. Recovery: spin up more prefill GPUs (slow) or shed load.

KV transfer failure

Network drops during KV transfer. Recovery: retransmit, fall back to local prefill, or fail the request.

Cache eviction

Decode pool needs to evict KV to make room for new sessions. Choose eviction by LRU or attention-based heuristic.

Graceful degradation

Production designs detect these failure modes and degrade smoothly: fall back to uniform-pool serving, shed non-priority traffic.


Observability for disaggregation

Disagg requires more metrics than uniform.

Key metrics

  • Prefill pool: GPU utilization, queue depth, per-request time, batch composition.
  • Decode pool: GPU utilization, batch occupancy, KV cache memory pressure, per-session decode rate.
  • KV transfer: throughput, p99 latency, failure rate.
  • End-to-end: TTFT, TPOT, completion rate.

Alerting

  • Prefill p99 TTFT > target.
  • Decode p99 TPOT > target.
  • KV transfer failures > threshold.
  • Pool utilization imbalance.

Tracing

Per-request distributed traces showing prefill → KV transfer → decode timing.


Cross-references and further reading


Additional FAQ

Q: What's the simplest disagg deployment?

Same-node disagg: dedicate some GPUs in a node to prefill, others to decode. NVLink handles KV transfer. Production-deployable with SGLang or TRT-LLM.

Q: When is uniform-pool simpler and adequate?

Small deployments (< 1k QPS), single-workload traffic, when engineering capacity for disagg complexity is unavailable.

Q: How does disagg interact with multi-tenant LoRA?

Both prefill and decode pools must load the right adapter. Cache-affine routing keeps per-tenant requests on consistent adapters. See multi-tenant LoRA serving.

Q: Does disagg help with cold-start latency?

Partially. Disagg doesn't help cold model load; it helps once warm.

Q: Can disagg run on AMD MI300X?

Yes; SGLang and vLLM AMD builds support disagg. Less mature than NVIDIA path.

Q: What's the right monitoring for KV transfer?

p50/p99 transfer time, per-link bandwidth utilization, failure count, retransmit rate.

Q: Does disagg help streaming responses?

Streaming is inherent to decode; disagg makes decode pool more efficient, which benefits streaming.

Q: How does cache-affine routing work in disagg?

Route requests with same prefix to same decode pod (where the KV cache lives). Prefill happens once; decode reuses. Substantial savings for chat with system prompts.

Q: What's the KV transfer cost for a 1M-token context?

For Llama 70B: ~2.5 GB. At 1.8 TB/s NVLink: ~1.4 ms. At 50 GB/s InfiniBand: ~50 ms.

Q: Does disagg help reduce tail latency?

Yes. Mixed prefill/decode on uniform pool has high tail latency from phase interference. Disagg eliminates this.

Q: How does Mooncake's CPU-memory KV pool work?

KV cache spills to CPU DRAM when HBM is full. Decode pulls KV from DRAM via PCIe / GDR. Slower per-token but enables much larger concurrent session count.

Q: What's the network cost of disagg KV transfer at frontier scale?

For 10k QPS chat (Llama 70B FP8 KV): 10k × 10GB × FP8(half) / 1.8TB/s = 28 seconds per-second worth of NVLink bandwidth — that's 28× one NVLink. Confirms NVL72 or fast IB is mandatory.

Q: Can disagg span heterogeneous GPU types?

Yes. Prefill on H200 (cheaper compute), decode on B200 (faster memory). Common pattern. KV format must be compatible.

Q: Does disagg play well with PagedAttention?

Yes. PagedAttention is local to each pool; KV transfer moves the paged blocks. Most disagg engines integrate cleanly.

Q: What's the operational complexity overhead?

Disagg adds: separate pool sizing, KV transfer monitoring, P/D ratio tuning, failure modes. Roughly 20-40% more operational complexity than uniform pool.


Mooncake architecture deep dive

Mooncake (Qin et al., July 2024) is Moonshot AI's production disaggregated serving system. The most-documented production disagg architecture.

Layered architecture

  • Prefill workers: GPU instances dedicated to prompt processing. High throughput per session.
  • Decode workers: GPU instances dedicated to token generation. Optimized for high concurrency.
  • Distributed KV pool: KV cache stored across cluster DRAM, with HBM as a hot tier. Workers fetch on demand.
  • Conductor: scheduler that routes requests, manages KV placement, and handles cache eviction.

KV pool design

Each KV cache page (in PagedAttention sense) has a unique key derived from prefix tokens. Pages can live in:

  • Decode worker HBM (hottest).
  • Decode worker CPU DRAM (warm).
  • Remote DRAM in dedicated KV pool servers (cold).

Pages migrate across tiers based on access pattern. Cross-tier movement uses RDMA.

Prefix caching

Mooncake's design naturally enables prefix caching across all workers. Same-prefix requests find the same KV pages regardless of which decode worker handles them.

Production scale

Moonshot's Kimi service runs on Mooncake. Public stats from the paper suggest 75% capacity gains vs uniform serving on production workloads.

Lessons from Mooncake

  • CPU DRAM is a viable KV tier when HBM is saturated.
  • Cross-worker KV transfer is faster than expected with modern RDMA.
  • Prefix caching at scale captures large fractions of traffic.

DistServe architecture deep dive

DistServe (Zhong et al., 2024, arXiv:2401.09670) is the academic foundation for goodput-optimized disaggregation.

Goodput definition

"Goodput" = requests served per second that meet both TTFT and TPOT SLO targets. Pure throughput counts requests that violate SLO; goodput does not.

Goodput optimization

DistServe shows that goodput is maximized by phase separation. The intuition: a uniform pool serving mixed prefill+decode has phase interference that pushes some requests above SLO. Disagg lets each phase run at its optimal batch size and arithmetic intensity, hitting more SLOs.

Reported gains

DistServe paper reports up to 4.48× goodput improvement vs uniform serving on representative workloads. Real-world gains are smaller (1.5-3×) but consistently positive.

Algorithmic contributions

  • Per-phase batch composition strategies.
  • Goodput-aware scheduling.
  • KV transfer scheduling that pipelines with compute.

Splitwise architecture deep dive

Splitwise (Patel et al., 2023) is Microsoft Research's foundational disagg paper, motivated by Azure production data.

Findings

  • Prefill is compute-bound; decode is memory-bound. The phases have opposite optimal hardware.
  • Mixed-phase serving wastes one or the other resource on every GPU.
  • Phase splitting (disagg) improves cluster throughput by ~40-50% in Microsoft's analyses.

Production at Azure

Microsoft Research publications attribute Azure efficiency gains in part to phase-splitting principles. Specifics are not public; the public paper is the documentation.

Influence

Splitwise predates DistServe and Mooncake; both cite it as foundational. Most production disagg systems incorporate Splitwise's phase-splitting principle.


NVIDIA Dynamo architecture

NVIDIA Dynamo is NVIDIA's production-supported disaggregation framework, announced in 2025 with NIXL and Triton integration.

Components

  • Triton Inference Server: request routing, batching, observability.
  • TensorRT-LLM: the inference engine inside each worker.
  • NIXL: KV-transfer library optimized for NVLink, IB, and GPUDirect.
  • Distill: the disagg orchestration layer.

Design principles

  • Same-rack first (NVLink for KV transfer).
  • Cross-rack support via InfiniBand.
  • Production-grade reliability and observability.
  • NVIDIA-native; tight integration with H100/H200/B200 features.

When to use Dynamo

NVIDIA-aligned production deployments. Customers running TRT-LLM at scale typically adopt Dynamo for disagg.


P/D scheduling: queue-length-aware and latency-aware

The scheduler decides which prefill worker and which decode worker handle each request.

Queue-length-aware routing

Route to the prefill worker with the shortest queue. Standard load-balancing extension to disagg. Works well at low contention; can oscillate at high load.

Latency-aware routing

Route based on predicted latency: queue depth × per-request time. More accurate; requires online estimation.

Cache-affine routing

For decode workers, route to the worker holding the KV for this prefix. Maximizes prefix-cache hits.

Composable routing

Production schedulers combine: cache-affine for decode (prefix hit benefit dominates), queue-aware for prefill (no per-worker cache to preserve), latency-aware overlay.

Scheduler implementations

Mooncake's Conductor, Dynamo's Distill, SGLang's router all implement variations. Open-source reference: SGLang's router code.


Prefix caching with disaggregation

Prefix caching is a major lever; disagg interacts with it.

Mechanics

  • Hash prompt prefix.
  • Look up cached KV for hash.
  • If hit, skip prefill for that prefix.
  • Decode from the cached KV.

Cross-worker prefix caching

Mooncake's distributed KV pool naturally supports this. Other systems require explicit cross-worker cache lookup.

Hit rate impact

For chat with consistent system prompts: 30-70% cache hit rate. Massive cost savings.

Cache invalidation

When the system prompt changes, invalidate old entries. Simple TTL or explicit invalidation.

Interaction with quantization

KV must be cached in same precision as serving. FP8 KV cache for FP8 serving.


SARATHI: chunked prefill alternative

SARATHI (Agrawal et al., 2023) is the chunked-prefill alternative to disagg.

Mechanism

Instead of separating prefill and decode pools, batch a chunk of one prefill with many decodes in a single GPU step. The mixed batch keeps the GPU busy throughout.

Pros vs disagg

  • No KV transfer.
  • Simpler operational model.
  • Works on uniform pool.

Cons vs disagg

  • Cannot specialize hardware per phase.
  • Mixed-batch scheduling complexity.
  • Less effective at frontier scale.

When to choose SARATHI

For mid-scale deployments where uniform pool is simpler and disagg's gains don't justify the engineering investment.

Production state

vLLM's chunked-prefill mode implements SARATHI-style mixing. TRT-LLM supports similar. Many production deployments use chunked prefill as a "disagg-lite" pattern.


2026 trends: NVL72 and the disagg shift

GB200 NVL72 changes the disagg calculus.

NVL72 reduces some disagg need

A single NVL72 rack acts as one giant GPU with 14.4 TB HBM and 1.8 TB/s NVLink between all 72 GPUs. The phase-interference problem disagg solves is smaller within NVL72 because intra-rack bandwidth is so high.

NVL72 enables larger disagg

Cross-NVL72 disagg (one rack prefill, another rack decode) becomes the natural unit. Internal-rack disagg is less critical.

Multi-DC disagg emerges

WAN improvements (1 Tbps+ DCI) make multi-DC disagg plausible. Production experiments in 2026.

Reasoning-model deployment is decode-pool dominated

Disagg's value rises with reasoning. Decode pool size grows; prefill pool stays small.

Operator-friendly defaults

Most operators in 2026:

  1. Start with chunked prefill (SARATHI-style) on uniform pool.
  2. Move to same-node disagg when scale justifies.
  3. Multi-node disagg at frontier scale only.

Disaggregation for fine-tuning workloads

Fine-tuning is mostly training-side, but inference for evaluation during fine-tuning benefits from disagg.

Eval-during-training

During RLHF / DPO, the policy model serves inference on the held-out set. Disagg helps if eval volume is high.

Per-checkpoint serving

After each fine-tune, serve the new checkpoint for human eval. Disagg's KV-transfer pattern works across checkpoints if architecture is identical.


Disaggregation in multi-tenant serving

Multi-tenant serving (multiple customers sharing infrastructure) has specific disagg considerations.

Per-tenant pools

For high-priority tenants, dedicated prefill and decode pools. Most expensive option.

Shared pools with priority

Standard pattern: shared pools with priority queues. Disagg helps each pool stay busy.

Per-tenant cache

KV cache invalidation per tenant. Cache hits within tenant; misses across.

LoRA + disagg

Per-tenant LoRA adapters loaded on demand. Disagg's cache-affine routing extends to "tenant-affine" routing. See multi-tenant LoRA serving.


Disagg benchmarks and reported gains

What the literature and production reports show.

Throughput / goodput improvements

Source Workload Gain vs uniform
DistServe paper Mixed chat + RAG Up to 4.48× goodput
Splitwise (Microsoft) Production Azure ~1.5-2× cost efficiency
Mooncake paper Moonshot Kimi ~75% capacity gain
Common production reports Chat 1.3-1.8× throughput
Common production reports Reasoning 2-3× throughput
Common production reports Long-context 1.5-2.5× throughput

The headline: 1.5-3× is the typical real-world win. Paper numbers are higher because they optimize for benchmark conditions.

Latency improvements

Disagg tightens p99 TTFT and p99 TPOT because phase interference is eliminated.

Cost reductions

For mixed workloads at scale, 20-50% capex savings vs uniform pool. Higher for reasoning workloads.


Practical disagg deployment guide

How to actually deploy disagg in 2026.

Step 1: measure current workload

Mean and p99 input tokens, output tokens, QPS, concurrent sessions. Identifies whether disagg helps.

Step 2: pick a stack

  • SGLang: open-source frontier-aligned.
  • TRT-LLM + Dynamo: NVIDIA-aligned.
  • Mooncake: not generally available outside Moonshot.
  • vLLM: prototype; production-grade emerging.

Step 3: start small (same-node)

Two GPUs prefill + six GPUs decode on one 8x H100 node. NVLink KV transfer. Operationally manageable.

Step 4: monitor and tune

P/D ratio adjustment based on observed traffic. KV transfer latency. Per-pool utilization.

Step 5: scale up

Cross-node disagg when single-node hits capacity. Cross-rack at very large scale.

Common pitfalls

  • P/D ratio mismatched to workload → one pool idles while other queues.
  • KV transfer becomes bottleneck if fabric is slow.
  • Cache thrashing if KV pool too small.
  • Failure handling not designed → cascading failures.

Disaggregation summary table

The full landscape in one table.

Aspect Uniform pool Chunked prefill (SARATHI) Same-node disagg Multi-node disagg
Operational complexity Low Medium Medium-high High
Throughput gain baseline 1.2-1.5× 1.5-2× 1.5-3×
Latency improvement baseline small substantial substantial
Engineering investment minimal modest substantial major
Best for small deployments mid-scale large single-rack frontier multi-rack
KV transfer cost none none NVLink NVLink + IB
Reference systems vLLM default vLLM, TRT-LLM SGLang, TRT-LLM Mooncake, Dynamo

Disagg interactions with other techniques

How disagg composes with other inference optimizations.

Disagg + speculative decoding

The draft model runs on decode pool. The target verifies. Speculative gains compound with disagg gains. See speculative decoding.

Disagg + quantization

FP8 KV cache halves KV transfer cost. FP4 weights reduce decode pool memory pressure. Both essential at scale. See quantization tradeoffs.

Disagg + MoE

MoE adds expert parallelism. Disagg + EP composes: prefill pool runs the full MoE forward (compute-bound); decode pool runs MoE forward many times (decode-bound). All-to-all happens within each pool. See mixture-of-experts serving.

Disagg + reasoning

Most synergistic combo. Reasoning is decode-dominated; disagg lets decode pool scale independently. Production reasoning deployments use disagg by default in 2026.

Disagg + multi-tenant LoRA

Per-tenant adapters loaded in decode workers. Cache affinity tenant-aware. See multi-tenant LoRA serving.

Disagg + RAG

RAG inflates prompt length, shifting toward prefill-heavy. Disagg P/D ratio adjusts accordingly. See RAG production architecture.

Disagg + multimodal

Vision encoding is prefill-like; LLM decode is decode-like. Some deployments place vision encoder on prefill pool. See multimodal serving.