Prompt20
All posts
long-contextattentionflash-attentionropering-attentionyarnkv-cacheulyssesrulerguide

Long Context: The Complete Guide

The definitive guide to long-context LLMs: why attention is O(n²), how FlashAttention helps, position encoding tricks (RoPE, YaRN, NTK), ring attention at extreme scales, KV-cache pressure, and what advertised context lengths actually deliver.

By Prompt20 Editorial · 95 min read

Context lengths kept growing — 8k, 32k, 128k, a million. On paper, the model architecture barely changed. In practice, almost everything around the model changed: the attention kernel, the position encoding, the cache layout, the network topology, and what "out of memory" means.

The take: advertised context length is marketing; effective context length is what matters, and the gap is wider than the field admits. The "Lost in the Middle" finding (Liu et al., 2023) and the RULER benchmark (Hsieh et al., 2024) both document substantial quality degradation well below advertised limits. As a working assumption, plan for effective context around 1/4 of advertised on retrieval-heavy tasks. For most workloads, RAG over a smart context budget beats raw long context on cost and quality. Long context wins for true global-reasoning tasks; it's not a replacement for retrieval.

This guide is about what gets hard, not which model is longest. The two scaling problems (O(n²) attention compute, O(n) KV memory); the kernel-level fix (FlashAttention 1/2/3); position-encoding strategies (RoPE, ALiBi, YaRN, NTK-aware) and what each means for context extension; distributed attention via ring attention and DeepSpeed-Ulysses-style sequence parallelism; sliding-window and sparse alternatives; the KV-cache pressure that dominates serving cost; and the 1M+ context production realities — what works, what doesn't, and where claims part ways with measured quality. Pair with KV cache, quantization tradeoffs (KV quantization is the dominant practical win), NVLink and rack-scale topology, and disaggregated inference. The honest answer: "long context" is rarely the right primary optimization — but when it is, every layer of the stack changes.

Table of contents

  1. Key takeaways
  2. Mental model: long-context attention in one minute
  3. The long-context landscape in 2026
  4. The two scaling problems
  5. FlashAttention and IO-aware attention
  6. Position encoding: RoPE and friends
  7. Extending context without retraining
  8. RoPE vs ALiBi vs YaRN
  9. Ring attention and sequence parallelism
  10. Sequence parallelism patterns: Ring, Ulysses, Context-Parallel
  11. Sliding window vs full attention
  12. 1M+ context production realities
  13. Sparse and approximate attention
  14. KV-cache pressure at long contexts
  15. Long context vs retrieval
  16. Evaluating long-context quality
  17. Hardware considerations
  18. Production deployments
  19. Open problems
  20. Attention sinks and StreamingLLM
  21. Sparse attention deep dive: Longformer, BigBird, Native Sparse Attention
  22. Linear attention and state-space models: Mamba, RWKV, GLA
  23. Hybrid architectures: Jamba, Recurrent Llama, Falcon-Mamba
  24. SWA + global tokens: Mistral, Gemma, Gemini patterns
  25. Long-context evaluation deep dive: NIAH, RULER, BABILong, InfiniteBench
  26. Per-model 2026 long-context details
  27. Production serving math for million-token KV
  28. Context dilution and remedies
  29. YaRN/PI/NTK-aware extension details
  30. Block-sparse routing and learned compression
  31. FlashAttention generations: FA1, FA2, FA3 mechanics
  32. Decision math: RAG vs long-context vs fine-tune worked examples
  33. Evaluation pitfalls and methodology
  34. Production checklists for shipping long-context
  35. Long-context cost tables by model and hardware
  36. Per-model long-context details, 2026 snapshot
  37. Long-context training: why pretraining at scale is hard
  38. The bottom line
  39. FAQ
  40. Glossary
  41. References

Key takeaways

  • Attention is O(n²) in compute and the KV cache is O(n) in memory per layer. Long context taxes both.
  • FlashAttention removed the n² memory cost by tiling. Compute is still O(n²) but is no longer the bottleneck.
  • RoPE is the standard position encoding. YaRN / NTK-aware scaling extends trained context windows further at modest fine-tuning cost.
  • Ring attention distributes a single sequence across many GPUs. Required for million-token contexts.
  • KV cache is the dominant memory cost at long context. KV quantization is the largest practical optimization.
  • Advertised context length ≠ effective context length. Many models claim long windows but degrade quality in the middle. Always evaluate with workload-representative long-context tasks.
  • For most use cases, retrieval-augmented generation beats raw long context on cost and quality. Long context wins for tasks requiring true global reasoning.

Quick comparison: long-context techniques

Technique Compute Memory Quality at full length When to use
Full attention + FlashAttention O(n²) O(n) KV Best Default for moderate context (≤128k)
RoPE + YaRN extension O(n²) O(n) KV Good with fine-tune Extending a trained model past its base len
Ring attention O(n²) distributed O(n/N) per GPU Same as full Single sequence > one GPU's HBM (1M+ tokens)
Sliding window O(n·w) O(n·w) KV Local-only Code completion, chat continuation
Sparse / global tokens O(n·k) O(n·k) KV Task-dependent Mixed-strategy architectures
State-space (Mamba) O(n) O(1) state Behind frontier Research / niche long-streaming workloads
RAG over short context O(k²) for chunk k O(k) KV Depends on retriever Document QA, massive corpora

Numbers are big-O; constants and effective context vary by model. See References for the underlying papers.


Mental model: long-context attention in one minute

The named problem is the quadratic attention wall. Every new token has to attend to every previous token, so doubling the context quadruples the work and doubles the KV cache. For most of the 2017–2022 transformer era the wall was hidden inside fast kernels and short contexts. At 128k it dominates latency; at 1M it dominates the whole rack.

Attention is best understood as an O(n²) handshake protocol. Each token shakes hands with every other token, asks "are you relevant to me?", and weights the answer. With 1,000 tokens that's a million handshakes — fine. With 128,000 it's 16 billion handshakes per layer, and the protocol leaves behind a KV-cache "guest list" that has to stay in HBM until the conversation ends.

Aspect Short context (≤8k) Long context (128k+)
Attention FLOPs small slice of total dominant at decode
KV cache footprint negligible tens of GB per request
Position encoding trained as-is RoPE + YaRN / NTK extension
Attention kernel any FlashAttention-2/3 mandatory
Parallelism axis TP / PP only + sequence parallelism (Ring, Ulysses)
Sticky number n/a Llama-3 70B at 128k: ~42 GB KV alone

In code, the production move is a kernel swap rather than a math change:

# eager attention: O(n) memory for the n×n score matrix
attn = (Q @ K.transpose(-1, -2) / sqrt(d)).softmax(-1) @ V

# FlashAttention: same math, tiled IO, O(1) extra memory
from flash_attn import flash_attn_func
attn = flash_attn_func(Q, K, V, causal=True)

FlashAttention does not break the O(n²) compute wall — it removes the O(n²) memory wall by streaming the softmax across tiles. To break the compute wall on a single sequence you need ring attention or sequence parallelism, which split the sequence across GPUs and pass KV shards around a ring.

The honest framing: long context is not one technique, it's a stack — IO-aware kernels, position-encoding extensions, KV-cache discipline, and sequence parallelism — applied in order as n grows. The rest of this guide is that ladder.


The long-context landscape in 2026

A field map of the techniques, papers, and stacks you'll encounter when planning a long-context deployment.

Position encodings. RoPE (Su et al., 2021, arXiv:2104.09864) — rotary embeddings, the production default. ALiBi (Press et al., 2021, arXiv:2108.12409) — linear attention bias, used by some labs (Bloom, MPT). NoPE — no positional encoding, surprisingly competitive on some tasks. Learned absolute positions — the original transformer design, now largely abandoned for long context.

Context-extension methods. Position Interpolation (linear compression), NTK-aware scaling (frequency-band-aware), YaRN (Peng et al., 2023, arXiv:2309.00071 — the dominant extension method), LongRoPE (per-dimension scaling), and full length-extended pretraining (just train at longer sequences, expensive but reliable).

Attention kernels. FlashAttention 1/2/3 (Dao et al., arXiv:2205.14135 and successors), xFormers, FlexAttention (PyTorch's mask-flexible API), Tri Dao's flash-attn package, Triton-based attention (see our Triton kernel primer), and the vendor BLAS-level paths in cuDNN and hipBLASLt.

Distributed attention. Ring Attention (Liu et al., 2023, arXiv:2310.01889), Striped Attention (load-balanced ring variant), DeepSpeed-Ulysses (Jacobs et al., 2023, arXiv:2309.14509) — sequence parallelism by head sharding, and Context Parallelism (NVIDIA's NeMo / Megatron variant). Each trades comm volume vs comm topology differently.

Sparse and approximate attention. Sliding-window (Mistral, Gemma), sparse global tokens (Longformer, BigBird), block-sparse (used in Llama 4 lineage), Mamba and Mamba-2 (Gu & Dao, 2023, arXiv:2312.00752) and other state-space alternatives, RWKV, RetNet, and hybrid attention/state-space architectures (Jamba, Zamba).

KV-cache compression. FP8 KV (production default), KIVI (2-bit KV), H2O (heavy-hitter eviction), StreamingLLM (attention sinks), GEAR (outlier-aware), PyramidKV (per-layer budget tapering), and SnapKV (prompt-aware compression).

Serving systems with long-context paths. vLLM (paged attention, FP8 KV, chunked prefill), SGLang (RadixAttention for prefix sharing), TensorRT-LLM (chunked prefill, sequence parallelism, KV quantization), Mooncake (disaggregated KV pool across replicas), and lmdeploy.

Production models with long context (mid-2026). Claude family (well-evaluated 200k+), Gemini (1M–2M, leading on absolute length), GPT-4o / GPT-5 lineage (long but not market-leading), DeepSeek-V3 / R1 (128k), Qwen3 (128k+ with YaRN), Llama 3.x / 4.x (128k), Mistral Large (128k with sliding window mix).

The honest summary: 128k is now table stakes; 1M is achievable but the effective context (RULER-style) is much shorter than the label. Serving 1M requires both architectural commitments (ring attention, KV quantization) and an evaluation discipline that most teams skip.

Effective vs advertised context: a reality check

A condensed snapshot of RULER scores from the published literature and community evaluations at May 2026:

Model Advertised RULER 32k RULER 128k RULER 1M
Claude 3.7 Sonnet 200k 95% 91% n/a
Gemini 2.0 Pro 2M 94% 87% 73%
GPT-4 Turbo 128k 92% 82% n/a
Llama 3.1 70B 128k 88% 74% n/a
Qwen2.5 72B 128k (YaRN) 86% 70% n/a
DeepSeek-V3 128k 90% 78% n/a
Mistral Large 2 128k 84% 65% n/a

Numbers are illustrative aggregates. The pattern: every model degrades, and the rate of degradation past 32k is much larger than the marketing suggests. A model that advertises 128k may functionally hold 64k or less on hard retrieval tasks. Plan accordingly.


The two scaling problems

Attention computes pairwise interactions between every pair of tokens. Two consequences:

Compute scales as O(n²). Prefilling a 128k-token prompt is 16× the attention compute of a 32k-token prompt, not 4×. For very long contexts, attention dominates the prefill cost (the rest of the model is O(n)).

Memory scales as O(n) per layer for the KV cache. A long prompt produces a large KV cache that sits in HBM for the duration of generation. For a 70B model at 128k context, KV cache is ~43 GB per request. (See our KV cache memory guide for the per-model math.)

The compute problem is partly solved by better kernels (FlashAttention). The memory problem isn't — it's a hard limit on what fits in HBM, and the dominant constraint on long-context serving at scale.

Numbers for the O(n²) problem

For a 70B-class model (8192 hidden dim, 64 heads, 80 layers), the attention compute for a single layer's prefill scales as 2 × n² × d × num_heads / num_kv_heads. The total attention compute across all layers for various n:

Prefill length n Attention FLOPs (Llama-3-70B) Wall time on H100 SXM (FP8)
1k 0.6 PFLOPs 0.6 s
4k 9.7 PFLOPs 9.7 s (chunked) or batched faster
32k 620 PFLOPs ~10 s with FlashAttention-3
128k 9.9 EFLOPs 100-180 s
1M 620 EFLOPs minutes to tens of minutes

The numbers ignore the rest of the model (MLPs, layer norms) which scale linearly. At very long sequences, attention is 80%+ of the total compute. The H100 is rated at ~989 FP8 TFLOPs; sustaining peak across a multi-minute prefill is unrealistic, so real-world numbers are 2-3× worse than peak math.


FlashAttention and IO-aware attention

Naive attention computes Q·Kᵀ, materializes the full n×n attention matrix in HBM, applies softmax, then multiplies by V. For long sequences, materializing that n×n matrix costs O(n²) memory and dominates HBM traffic.

FlashAttention's idea: never materialize the full matrix. Tile the computation, keep intermediate values in fast on-chip memory (SRAM, registers), and compute attention block by block, accumulating the softmax statistics incrementally.

The math is the same. The IO pattern is dramatically better.

Practical impact

  • Memory: O(n²) → O(n). A 128k-sequence attention no longer requires gigabytes of attention-matrix scratch space.
  • Speed: 2-5× faster on long sequences due to reduced HBM traffic.
  • The bottleneck moves from "attention compute" to "KV cache memory."

FlashAttention is now standard. Every modern transformer training stack and serving stack uses it (or a derivative). FlashAttention-2 and FlashAttention-3 added further kernel-level optimizations for Hopper and Blackwell GPUs.

What FlashAttention doesn't fix

  • The compute is still O(n²). For 1M-token contexts, the absolute FLOPs are enormous.
  • The KV cache is still O(n). FlashAttention helps the attention computation, not the cache that feeds it.

FlashAttention 1, 2, 3: what each generation added

FA1 (Dao et al., 2022) introduced the tiled IO-aware idea. It eliminated the n×n materialization and gave the field its first practical solution to long-context attention. Baseline against which everything else is measured.

FA2 (Dao, 2023) improved work partitioning across thread blocks, better warp scheduling, and added support for variable-length sequences and grouped-query attention (GQA, the dominant attention variant for modern LLMs). On H100, FA2 delivers roughly 2× the throughput of FA1 on the same kernel shapes.

FA3 (Shah et al., 2024) is Hopper-specific. It exploits the asynchronous tensor cores (WGMMA), the new TMA (Tensor Memory Accelerator) for async HBM loads, and FP8 throughout the attention path. On H100, FA3 reaches 75% of peak FP8 throughput on long sequences vs FA2's ~45%. On B200 with the analogous async machinery, the gap is similar.

When you don't want FlashAttention

For very short sequences (< 256 tokens), the FlashAttention launch overhead can be larger than the attention compute itself. Stock cuBLAS matmul-based attention can be faster. This rarely matters in production because batched serving aggregates many short sequences into longer effective inputs, but it can show up in tight latency benchmarks for chatbots with short prompts.


Position encoding: RoPE and friends

Transformers process tokens in parallel, so the architecture itself is permutation-invariant. Position information is added explicitly.

The dominant approach in 2026 is Rotary Position Embedding (RoPE): apply position information as a rotation in pairs of embedding dimensions. The rotation angle depends on position and dimension. RoPE has nice mathematical properties — relative position is preserved, attention scores depend only on relative position offsets — and it trains well.

How RoPE works (briefly)

For a query/key vector at position m, RoPE rotates pairs of dimensions by angles that are functions of m and a frequency base θ. Pairs at higher dimension rotate slower (lower frequency); pairs at lower dimension rotate faster. The attention score Q_m · K_n is then a function of (m - n), the relative position.

Why this matters for long context

RoPE was originally trained at some maximum length (say, 4k or 8k). At positions beyond that, the rotations sweep through frequency-position combinations the model never saw during training. Naive extrapolation produces garbage.

This is the central problem of long-context extension: how to make a model trained at 4k work at 128k or 1M tokens.

The RoPE base frequency θ

A specific number that matters in practice: the base frequency θ used in RoPE. Llama 1 and 2 used θ = 10000 (the original transformer convention). Llama 3 increased it to 500000 to support a longer context window natively. The choice of θ controls how much "rotation" happens across the trained length — higher θ means slower rotation, which leaves more frequency space for extension. Modern long-context recipes typically use θ = 500000 to 5000000 depending on target length. The choice is part of the architecture, not a runtime parameter; changing it requires retraining or fine-tuning.

Why position encoding errors are subtle

A botched context extension does not produce obvious garbage output. The model continues to generate fluent text; it just stops being able to use information from beyond its trained length. Symptoms: long-context retrieval fails silently, the model "forgets" mid-document content, summaries miss obvious facts. Detection requires explicit long-context evaluation; standard chat metrics do not catch it. This is why so many "long context" model releases turn out to have effective contexts much shorter than advertised — the failure is not user-visible in casual testing.


Extending context without retraining

Several methods to extend RoPE-trained models to longer contexts without training from scratch.

Position Interpolation

Linearly compress positions so longer sequences map into the trained range. A model trained at 4k sees 32k positions as if they were 4k positions, compressed by 8×.

  • Simple.
  • Quality drops noticeably.
  • Recovers most quality after a small amount of fine-tuning on the extended context.

PI (Chen et al., 2023) was the first widely-used context-extension method. It is rarely used standalone in 2026 because YaRN strictly dominates it on quality, but understanding PI is useful because both NTK and YaRN are refinements of the same basic idea.

NTK-aware scaling

Different frequency bands serve different purposes. Low-frequency rotations encode coarse position; high-frequency rotations encode fine position. NTK-aware scaling adjusts the frequency base θ so that low frequencies see less compression and high frequencies more, matching the model's training distribution better.

The "NTK" in the name refers to Neural Tangent Kernel theory, which motivates why some frequency bands should be left alone while others are scaled. In practice the math reduces to a specific θ adjustment formula that practitioners apply mechanically; the deep theory is rarely needed for deployment.

  • Better quality than naive interpolation.
  • Still benefits from fine-tuning.

YaRN (Yet another RoPE extensioN)

Combines NTK-aware scaling with attention-temperature adjustments. Preserves precision in the frequency bands the model is most sensitive to.

  • Standard approach in many open-weight long-context models.
  • Light fine-tuning required.

Length-extended pretraining

Just train on longer sequences. Expensive, but reliable.

The cost: training at 128k context is roughly 10-30× more expensive per token than at 4k due to the quadratic attention compute. Production-frontier labs spend 100M-1B tokens on the length-extension phase, which is a small fraction of total pretraining (a few percent) but a non-trivial bill. The savings of doing YaRN extension instead are real (10× cheaper or more) but with quality trade-offs that matter on hard tasks.

Many frontier models combine these: start with RoPE at a base length, apply YaRN-style extension, then fine-tune on long-context data. The advertised 128k or 1M context is usually the result.

Why this matters for evaluation

A model with claimed 1M context that never saw beyond 32k during training (even with extension) will behave poorly on real 1M-token tasks. The label is necessary but not sufficient.

LongRoPE: per-dimension scaling

LongRoPE (Ding et al., 2024, arXiv:2402.13753) extends RoPE by learning per-dimension scaling factors, rather than a uniform or band-uniform scaling. The optimization is run as an evolutionary search to find the rescaling that minimizes perplexity on a held-out long-context dataset. The reported results push base 4k models to 2M context with measurable quality preservation; production adoption is more cautious than YaRN because the per-dimension scaling factors are model-specific and harder to share across deployments. For 1M+ extensions from a moderately-sized base, LongRoPE is the SOTA approach as of mid-2026.

Length-extended pretraining: the brute-force approach

The reliable but expensive option: just train on longer sequences. The published recipes (Llama 3.1's 128k extension, DeepSeek-V3) typically use a two-stage approach: pretrain at 4k or 8k for the bulk of training (where data is abundant and FLOPs efficient), then continue training at the target length for a small fraction of total steps (a few hundred billion tokens out of trillions). The continued-pretraining phase needs data that genuinely benefits from long context — concatenated unrelated documents don't help. Books, codebases, multi-document research, and long synthetic dialogues are the typical sources.


RoPE vs ALiBi vs YaRN

The three most-deployed position-encoding lineages, with what each actually does and where each wins.

RoPE (Rotary Position Embedding)

Rotates pairs of dimensions in the query/key vectors by angles that depend on the token position and a frequency base θ. Different dimension pairs rotate at different frequencies: low dimensions fast (encode fine position), high dimensions slow (encode coarse position). The attention score Q_m · K_n becomes a function of (m – n), the relative position. Reference: Su et al., 2021, arXiv:2104.09864.

Pros: Trains stably, encodes relative position naturally, extensible to longer contexts via frequency manipulation. Default for almost all modern open-weight LLMs (Llama, Qwen, DeepSeek, Mistral, Gemma).

Cons: Naive extrapolation beyond training length is poor — the model has never seen those (position, frequency) combinations. This is the central problem that YaRN and friends solve.

Production status: Universal default. If you're starting a new model in 2026, you're using RoPE. Even the few models that experiment with NoPE (no positional encoding) typically still use RoPE in some layers, because pure NoPE underperforms RoPE on standard benchmarks.

ALiBi (Attention with Linear Biases)

Adds a per-head linear bias to the attention scores: score(i, j) -= m_h × (i - j). The bias penalizes attending to distant tokens, with a per-head slope m_h. No actual position embedding is added to inputs. Reference: Press et al., 2021, arXiv:2108.12409.

Pros: Trivially extrapolates to longer contexts (the bias formula doesn't depend on a learned position table). Used in Bloom, MPT, and some research models.

Cons: Linear decay is a strong inductive bias toward locality; performance on long-range dependencies is competitive at moderate lengths but lags RoPE-plus-YaRN on RULER and similar long-context benchmarks at frontier lengths. Doesn't capture absolute position (which RoPE does indirectly through frequency-band patterns).

Production status: Niche. A few labs and the long-tail of open models. Most have switched to RoPE-based encodings.

YaRN (Yet another RoPE extensioN)

Extends a RoPE-trained model to longer contexts by adjusting the frequency base θ and applying an attention-temperature correction. The key insight: different RoPE frequency bands serve different roles, and naive position interpolation degrades them uniformly. YaRN handles each band according to whether it should be interpolated, extrapolated, or left alone, then adjusts the softmax temperature to compensate for the changed entropy. Reference: Peng et al., 2023, arXiv:2309.00071.

Pros: Extends a 4k or 8k base model to 128k with light fine-tuning and minimal quality loss. Now standard in open-weight long-context recipes (Qwen, Mistral, many Llama fine-tunes).

Cons: Requires fine-tuning (cheap but not free). Doesn't trivially go to 1M without longer pretraining.

Production status: The default for "extend a base model's context." If your provider supports 128k context on an open model, YaRN is probably in the recipe.

YaRN scaling factor in practice

The YaRN scale factor s = target_length / base_length controls how aggressively the position frequencies are interpolated. For a 4k → 128k extension, s = 32. The light fine-tuning that follows (typically 100M-1B tokens on long-context data) recovers most of the quality lost at the scale step. For s > 32 (e.g., 4k → 1M, s = 256), the quality recovery from fine-tuning is incomplete and a length-extended pretraining phase is usually needed in addition to YaRN.

NTK-aware and Position Interpolation

Predecessors to YaRN; subsumed by it in practice. NTK-aware scaling preserves high-frequency content; Position Interpolation linearly compresses positions into the trained range. YaRN combines both ideas with the temperature fix.

Dynamic NTK

A runtime variant: instead of fixing the NTK scaling factor at deployment, compute it dynamically based on actual sequence length per request. Useful when a model serves a mix of short and long requests — short requests get no extension cost, long requests get the appropriate scaling. Reference implementations exist in Hugging Face transformers and vLLM. The cost is per-request setup time (negligible) and slightly more complex caching of position embeddings.

Choosing

Situation Use
New pretraining from scratch RoPE + length-extended pretrain
Extending an existing RoPE model YaRN (with light fine-tune)
Architecture that needs to trivially extrapolate ALiBi
Aggressive 1M+ from a 32k base YaRN + length-extended fine-tune at full length

Ring attention and sequence parallelism

Once a single sequence is too long to fit on one GPU's HBM, attention has to be distributed across devices.

Sequence parallelism partitions the sequence dimension across GPUs. Each GPU holds a chunk of the sequence and its KV cache.

Ring attention is the most common implementation. GPUs are arranged in a logical ring. Each holds a chunk of the sequence. Attention is computed iteratively: each GPU's KV chunk passes around the ring, and every GPU updates its partial attention output as it sees each incoming chunk.

GPU 0: tokens 0..1000      KV chunk A
GPU 1: tokens 1000..2000   KV chunk B
GPU 2: tokens 2000..3000   KV chunk C
...

Step 1: each GPU has its own KV. Compute attention with own chunk.
Step 2: pass KV chunks around. Each GPU sees a neighbor's chunk. Compute.
Step 3: pass again. ...

After N steps (one per GPU), every GPU has attended to every other GPU's chunk, building up the full attention output. This is O(n) communication per token but parallelized across N GPUs.

Striped Attention: load-balancing the ring

Naive ring attention has a load-balance problem under causal masking. A GPU that holds early sequence positions has more attention work (more tokens attend to its KV chunk) than a GPU holding late positions. The slowest GPU dominates step time. Striped Attention (Brandon et al., 2023, arXiv:2311.09431) permutes the assignment of tokens to GPUs so each GPU holds an interleaved set rather than a contiguous chunk. The work per GPU per step is balanced. The implementation cost is a permutation in the data layout; the gain is 1.5-2× throughput on causal long-context workloads. Standard in mature ring-attention implementations.

What ring attention enables

Million-token contexts. A single-GPU 1M-token request would exceed HBM by orders of magnitude. Ring attention spreads the load across many GPUs and serializes nothing critical.

Concretely: serving a 1M-token context on Llama-3 70B requires ~335 GB of KV at FP16, more than 4× the HBM of an H100 SXM. Ring attention sharded across 8 GPUs puts ~42 GB of KV on each GPU, comfortable for an H200. The compute is similarly partitioned: each GPU processes its chunk's attention and the cross-chunk attention rotates through the ring.

What it costs

Inter-GPU bandwidth becomes the bottleneck. Each step of the ring transfers a chunk of KV. For long sequences across many GPUs, this is gigabytes per step.

The dominant deployment topology for ring attention is rack-scale NVLink — all GPUs in one fast-fabric domain, attention chunks move at NVLink bandwidth. Across slower links, ring attention slows substantially. See our LLM serving guide for how this fits the wider stack.


Sequence parallelism patterns: Ring, Ulysses, Context-Parallel

Once you commit to distributing a single sequence across GPUs, three patterns are in production. Each has different comm topology and bandwidth requirements.

Ring attention

GPUs arranged in a logical ring. Each holds a chunk of the sequence. KV chunks circulate around the ring; every GPU eventually attends to every other GPU's KV. Communication is O(n) total (each chunk passes each GPU once), parallelized across N GPUs and overlapped with compute. Reference: Liu et al., 2023, arXiv:2310.01889.

Pros: Memory scales as O(n/N) per GPU. Comm pattern is nearest-neighbor — friendly to ring-shaped fabrics.

Cons: Latency-bound by the slowest hop. Sensitive to load imbalance (causal attention means some chunks have less work — Striped Attention addresses this with permuted layouts).

Best fit: NVLink-bound or NVL72-rack-scale deployments where the all-to-neighbor bandwidth is high.

DeepSpeed-Ulysses (sequence parallelism by head sharding)

Different approach: shard the sequence across GPUs for the QKV projections (each GPU owns a chunk), then all-to-all to reshard so each GPU owns a subset of attention heads over the full sequence, run attention, then all-to-all back. Comm cost is O(n × d_model / N) per all-to-all. Reference: Jacobs et al., 2023, arXiv:2309.14509.

Pros: Comm is independent of sequence length (only depends on hidden size). Better than ring for very long sequences if the fabric supports all-to-all well.

Cons: Two all-to-alls per attention layer. Couples with head count (must have N | num_heads).

Best fit: NVL72 / rack-scale where all-to-all is cheap (same fabric that makes MoE workable). Used in some DeepSpeed-Megatron training stacks.

Context Parallel (NVIDIA NeMo / Megatron)

Hybrid: ring-like KV passing for the attention computation, paired with normal tensor parallelism for the rest of the layer. NVIDIA's preferred recipe for training and serving long context on NVLink-rich hardware.

Choosing

Constraint Use
Very long sequences, NVL72 fabric Ulysses or NVIDIA Context Parallel
Moderate length, ring-shaped fabric Ring Attention
Asymmetric work (causal masking) Striped Attention (load-balanced ring)
Training-only, comm-bound Ulysses

Combine with NCCL tuning collectives and distributed training parallelism strategies.

Communication cost comparison

For a 1M-token sequence sharded across 8 GPUs on NVL72:

Method Comm per attention layer Total per forward pass (60 layers) Wall time on NVL72
Ring Attention 24 GB (3 GB × 8 hops) 1440 GB ~1 s
Striped Attention 24 GB 1440 GB ~0.7 s (balanced)
Ulysses 64 GB (two all-to-all of 32 GB) 3840 GB ~2 s
Context Parallel 24 GB ring + 8 GB AR 1920 GB ~1.2 s

The naive expectation that Ulysses always wins is wrong — for very long sequences where ring's per-hop chunks are large but compute overlaps generously, ring can match or beat Ulysses. The right choice is workload and fabric specific. Always benchmark on your actual model and length.


Sliding window vs full attention

When to give up the O(n²) and accept that not all token pairs matter equally.

Sliding window attention

Each token attends only to a fixed window of neighbors (e.g., 4k tokens). Memory and compute O(n × window). Mistral 7B introduced this in a production model; Gemma and several other recent models use a mix of local and global attention layers.

Pros: Linear scaling. Quality is fine on locality-dominated tasks: code completion, chat continuation, language modeling perplexity.

Cons: Hard cap on long-range dependency. Retrieval across the window boundary fails. Effective context is the window size, not the nominal context length.

Full attention (with FlashAttention)

Every token attends to every prior token. O(n²) compute, O(n) memory (post-FlashAttention). The reference for quality.

Mixed-strategy models

A growing pattern in 2026: some layers run sliding-window attention (cheap, capture local structure); others run full attention (capture long-range dependency). Gemma 2/3, some Llama 4 variants, and Mistral lineage all use variants of this. The mix ratio is empirical — typically 1 full attention layer per 4–8 sliding-window layers.

Pros: Most of the long-range capacity at much lower compute and memory.

Cons: Effective long-range capacity depends on the ratio; needs evaluation on your workload.

Quantitative impact: Gemma 3's 5:1 sliding:full pattern with 4k sliding window achieves roughly 60% of the per-token cost of full attention at 128k context, while preserving 90%+ of the RULER score. The cost-quality tradeoff is favorable enough that mixed-strategy is the default for most new long-context dense models in 2026.

Sliding window + attention sinks (StreamingLLM)

The first few "sink" tokens are always attended to. The rest of the context is windowed. Stabilizes streaming workloads (long chats) without unbounded KV growth. Pairs well with KV quantization. Reference: Xiao et al., 2023, arXiv:2309.17453.

The sink-token effect was discovered by inspecting attention patterns: in models without explicit position encoding tricks, the first few tokens accumulate disproportionate attention regardless of content. Keeping them in cache stabilizes the attention distribution. The number of sink tokens needed is typically 4-16, which is negligible compared to the windowed body of the context. The combination of attention sinks + sliding window + INT4 KV produces an effective infinite-context streaming serving pattern at bounded HBM, used in production for very long chat sessions.

Decision

Workload Default
Code completion, chat continuation Sliding window
Long-document QA Full attention
Streaming chat with cap on history Sliding window + sinks
RAG over long retrieved context Full attention
Research on cost Mixed-strategy
Long-context attention in 2026 — infographic covering the quadratic compute challenge of self-attention, the QKV / softmax / output pipeline, six approaches to long-context attention (full, sparse, sliding window, dilated, low-rank / kernel approximation, hybrid), popular techniques and models (FlashAttention, Longformer, BigBird, Reformer, Performer, MPT / Llama 3 / Claude / Gemini / Kimi), real-world context lengths (GPT-4 Turbo 128K, Claude 3.5 200K, Gemini 1.5 Pro 1M, Llama 3.1 128K, MPT-30B-StoryWriter 1M+, Kimi Chat 2M), system considerations (memory, bandwidth, KV cache, parallelism, eval), and best practices.
Long-context attention at a glance. Self-attention is O(n²) — doubling the context costs 4× compute and memory, which makes attention the scaling bottleneck. Six approaches trade off quality vs efficiency: full attention (baseline, no information loss but doesn't scale), sparse, sliding window, dilated / strided, low-rank / kernel approximations, and hybrid methods that combine local + global + summary + sparse. Modern systems mix techniques — FlashAttention for IO-aware exact attention, RoPE / YaRN / ALiBi for positions, sliding window plus global tokens, and aggressive KV-cache quantization. Real-world context lengths span 128K (GPT-4 Turbo, Llama 3.1) to 200K (Claude 3.5), 1M (Gemini 1.5 Pro, MPT-30B-StoryWriter), and 2M (Kimi). Best practice: start with full attention + FlashAttention, profile bottlenecks, manage the KV cache, evaluate on real workloads, and don't assume longer is better.

1M+ context production realities

The honest section: what shipping million-token context actually costs and which claims to discount.

Prefill is the dominant cost. A 1M-token prefill on a 70B model is roughly 1M / 1k × the work of a 1k prefill — roughly 1000× more attention compute, and the rest of the model scales linearly too. On serious hardware (B200 NVL72 with ring attention or Ulysses) a 1M prefill is on the order of seconds to minutes; on lesser hardware, minutes to tens of minutes. Chunked prefill helps perceived latency (you can start streaming generation earlier) but the total work is unchanged.

KV cache dominates serving memory. Llama-3-70B at 1M tokens is ~335 GB of KV cache at FP16, ~167 GB at FP8, ~84 GB at INT4. A single concurrent 1M-context request occupies a significant fraction of an entire HBM-rich GPU node. The economics only work with aggressive KV quantization and/or KV pool sharing.

Effective context is much shorter than advertised. RULER (Hsieh et al., 2024) reports many "1M context" models maintaining strong quality only to 32k–128k on retrieval-heavy tasks. The "Lost in the Middle" effect (Liu et al., 2023, arXiv:2307.03172) is real and persistent: information placed mid-context is recalled worse than information at the start or end. Planning assumption: budget for effective context around 1/4 to 1/2 of the advertised label on hard tasks.

Position encoding is rarely the bottleneck — data is. Training data for genuinely long-coherent contexts is scarce. Most "long-context training sets" are concatenations of unrelated documents with synthetic stitching. Models trained on those datasets handle long windows mechanically but lose quality on tasks requiring true cross-document reasoning.

The hardware floor is high. Useful 1M-token serving requires at minimum an NVLink-rich node (HGX B200 or similar), and is much more practical at rack scale (NVL72). Ring attention or Ulysses over slower fabrics is so slow as to make 1M serving impractical. The economics: a single H200 node serves perhaps 1-2 concurrent 1M-context requests; an NVL72 rack serves 4-8. Per-rack capital cost is in the millions; per-request cost runs in the dollars for 1M-context inference.

KV pool sharing is the emerging optimization. Mooncake and similar systems pool KV across replicas, paired with disaggregated prefill/decode. The prefill replica computes KV once; multiple decode replicas can read from a shared store. Operationally complex; pays off when the same long prefix is reused across many requests.

When 1M is actually warranted. Whole-codebase reasoning, long-document synthesis where the synthesis is the task, multi-document policy / legal analysis. For document QA with focused questions, RAG over 50k–200k context is almost always better, cheaper, and more accurate.

Use cases that justify 1M context

Use case Why long context helps Alternative
Codebase-wide refactor analysis Cross-file reasoning, type inference None — RAG fragments the codebase
Whole-document legal/policy review Cross-clause consistency checks Multi-pass summarize-then-detail
Multi-document research synthesis Combining evidence from many sources Iterative retrieval
Long-form video/audio transcription analysis Temporal coherence Chunked transcription + summary
Repository understanding for agents Tool calls need full state RAG with code-aware retriever
Multi-turn agent with extensive history Memory persistence Summary-based memory

For these workloads, the cost of long context is justified. For "find the answer to my question in this document", RAG wins on cost and accuracy.

What kills 1M-context serving in production

Two failure modes are common when teams first ship 1M-context features. First, the prefill cost is opaque to users — a user pastes a 1M-token document and waits 60+ seconds for the first token. Mitigations: explicit progress indicators, chunked prefill with progressive output, or up-front cost warnings. Second, the per-token decode cost compounds — generating a 2000-token response at 1M context costs 10-20× the same response at 32k context. Mitigations: aggressive output-length limits, structured-output constraints to keep outputs short, or compute-cap budget per request.


Sparse and approximate attention

A different family of techniques accepts that not all token pairs matter equally and trims the n² to something smaller.

Sliding window attention

Each token attends only to a window of nearby tokens (say, 4k surrounding tokens). Memory: O(n × window_size). Compute: O(n × window_size).

  • Effective for tasks where most relevant context is local: code completion, chat continuation.
  • Catastrophic for retrieval tasks requiring distant context.

Sparse global attention

Most tokens use a small window; some "global" tokens attend to everything and are attended by everything. Hybrid approach.

  • Captures distant interactions through global tokens.
  • Quality depends on which tokens are designated global. Sometimes learned, sometimes positional (e.g., first token of each sentence).

Block-sparse attention

Attention matrix is sparse with structured block patterns. Hardware-friendly because blocks can be skipped efficiently. FlexAttention in PyTorch 2.5+ provides a clean API for expressing block-sparse patterns and generates fast Triton kernels automatically. Production stacks (vLLM, TRT-LLM) ship block-sparse kernels for the common patterns. For the kernel-level mechanics see our Triton kernel primer.

Mixed-strategy models

Many 2026 long-context models use multiple attention types in different layers: some layers full attention, some sliding window, some sparse global. The mix is empirically tuned.

The trade-off

  • Full attention: highest quality, highest cost.
  • Sparse: lower cost, varying quality. Highly task-dependent.

The right choice depends on whether your workload requires long-range dependency. Code completion: probably fine with sparse. RAG over long documents: full attention preferred.

Native sparse attention (NSA) and modern sparse variants

DeepSeek's Native Sparse Attention (NSA, 2025) is a recent design that learns which tokens to attend to dynamically per layer, with hardware-friendly block patterns. The model is trained from scratch with sparse attention as a primitive, not retrofitted. Reported compute savings of 4-8× at long context with quality competitive to dense attention on standard benchmarks. The catch: training from scratch is expensive, and retrofitting existing dense-attention models is open research. NSA-style approaches are likely to become more common at the frontier through 2026-2027 as the cost of long-context training increases.

Combining sparse with dense attention

Many production models in 2026 layer dense full attention with sparse or sliding-window attention in alternating patterns. Gemma 3's "5 sliding + 1 full" pattern, Mistral's sliding-window attention with periodic full layers, and the Llama 4 lineage's hybrid all follow this template. The mix ratio is a hyperparameter; typical values are 3:1 to 8:1 (sparse:dense). The intuition: sparse layers handle the bulk of the compute, dense layers preserve long-range capability. The combined effective context is much closer to the full-attention baseline than the sparse-only model, at a fraction of the compute cost.


KV-cache pressure at long contexts

The hidden cost of long context is not the prefill — that's a one-time pass. It's the KV cache living in HBM for the whole generation.

For Llama-3-70B (80 layers, 8 KV heads, head_dim 128) at FP16:

Context length KV cache size
4k tokens 1.3 GB
32k tokens 10.7 GB
128k tokens 42.9 GB
1M tokens 335 GB

For a batch of concurrent requests, multiply by batch size. A worker holding 16 concurrent 32k-context requests holds ~170 GB of KV cache.

Mitigations

KV-cache quantization. Dropping KV to FP8 halves the footprint. INT4 quarters it. Quality impact ranges from negligible (FP8) to noticeable (INT4). See the quantization tradeoffs guide for depth. For long-context production, FP8 KV is non-negotiable; INT4 KV is workload-dependent but increasingly standard.

KV-cache offloading. Page rarely-accessed cache to CPU memory or local NVMe. Latency hit substantial (PCIe is ~64 GB/s, HBM is ~5 TB/s). Useful for batch and high-context-but-low-QPS workloads.

KV-cache eviction. Discard cache entries deemed unlikely to be useful. Risky for retrieval-heavy workloads. Some research approaches (H2O, StreamingLLM) use attention-based heuristics.

The eviction-vs-quantization tradeoff: quantization preserves all tokens at reduced precision; eviction keeps some tokens at full precision and discards others. For retrieval-heavy workloads, quantization is safer because every token is still queryable. For streaming workloads where old context becomes irrelevant, eviction is more efficient. The combination (quantize current + evict ancient) is the most aggressive practical compression.

KV pool sharing across requests. Mooncake-style distributed KV pools let many decode replicas read from a shared KV store. For long-context production where the same long document is queried by many users, the prefix-cache hit rate can exceed 95%, eliminating the prefill cost for most requests. Operationally complex; only justifies the effort at hosted-provider scale.

Compressing KV with attention sinks. Keep the first few "sink" tokens always; window the rest. Works for streaming but loses information.

Sliding-window models that don't accumulate full KV. Architecture-level fix.

PyramidKV and per-layer budget tapering

Recent research observes that different attention layers need different KV cache budgets — earlier layers benefit from longer history, later layers can survive with shorter. PyramidKV (Cai et al., 2024, arXiv:2406.02069) tapers the per-layer KV budget from large at the bottom of the stack to small at the top, achieving 50-80% KV reduction with minimal quality loss on standard benchmarks. The implementation is a layer-specific eviction policy and is compatible with FP8/INT4 KV. Production adoption is growing in 2026.

SnapKV and prompt-aware compression

SnapKV (Li et al., 2024, arXiv:2404.14469) compresses the KV cache for the prompt portion (not the generated portion) based on observed attention patterns. The intuition: most prompt tokens are not heavily attended to during generation, so they can be evicted aggressively. The compression happens after prefill, before generation, so it does not affect TTFT but reduces decode-time KV pressure substantially. Reported compressions of 4-8× with negligible quality loss on long-document QA.

Concrete KV math for production sizing

For a 70B model with GQA (8 KV heads, head_dim 128), 80 layers, at various precisions:

Context FP16 KV FP8 KV INT4 KV
8k 2.7 GB 1.3 GB 670 MB
32k 10.7 GB 5.4 GB 2.7 GB
128k 42.9 GB 21.5 GB 10.7 GB
512k 171 GB 86 GB 43 GB
1M 335 GB 167 GB 84 GB
2M 670 GB 335 GB 168 GB

The 2M-context row shows the production limit clearly: even with INT4 KV, a single 2M-token request occupies more HBM than any single GPU. Multi-GPU KV sharding (via ring or context parallelism) is mandatory at those lengths.


Long context vs retrieval

A long-running debate: do you want a model with a 1M-token context, or do you want retrieval-augmented generation (RAG) over a 1M-token corpus?

Long context

  • Model sees everything at once.
  • Can reason across arbitrary parts of the input.
  • Expensive per query.
  • Quality degrades in the middle of long contexts ("lost in the middle").

RAG

  • Retriever selects relevant chunks; model sees only those.
  • Scales to arbitrarily large corpora (the model's context is bounded; the corpus isn't).
  • Cheap per query.
  • Quality depends on retriever — bad retrieval means wrong answer regardless of model quality.

The honest answer

Neither dominates. The right answer depends on workload.

  • Document QA with focused questions: RAG is usually cheaper and better.
  • Synthesizing across a whole document: long context wins.
  • Open-ended exploration: long context.
  • Massive corpora: RAG is required.
  • Low cost requirements: RAG.
  • High-accuracy global reasoning: long context.

Many production systems combine both: retrieve a long but focused context (say, the top 50 documents), feed to a long-context model. Gets you a smarter context budget.

Cost comparison: long context vs RAG

Concrete numbers for answering questions over a 10M-token corpus, normalized per query:

Approach Context to model Cost per query Latency
Naive RAG (k=5, 4k chunks) 20k tokens $0.02 0.5 s
Smart RAG (k=20, reranked) 80k tokens $0.08 1.5 s
Long context (full corpus) 10M tokens impossible n/a
Long context (relevant sections by retrieval, k=200) 800k tokens $4.00 60 s
Long context after compression (filtered) 200k tokens $1.00 8 s

The RAG approach is roughly 50-500× cheaper than the long-context approach. The quality differential depends on the workload: focused-question QA favors RAG; whole-corpus synthesis favors long context; the hybrid (retrieval + long-context for the selected subset) hits the cost/quality sweet spot for many production workloads. See our RAG production architecture post for the retrieval side.

When retrieval breaks long context

A subtle interaction: many "long context" tasks are actually retrieval tasks in disguise. The model is supposed to find a specific fact in a long document. If the retriever's job is implicit in the model (everything is in context), the model still has to do retrieval — and the lost-in-the-middle effect is exactly that retrieval failing. Making retrieval explicit (via RAG) usually outperforms making it implicit (via long context) for fact-retrieval tasks. Long context wins only when the question requires synthesizing across the entire context, not just retrieving one piece of it.


Evaluating long-context quality

"128k context" on a model card is a structural claim, not a quality claim. Evaluation matters more than at short context.

Needle-in-a-haystack tests

Place a target fact at a specific position in a long document, ask a question that requires recall. Vary the position. Measure accuracy.

A model with uniform recall across positions is good. A model that scores well at the start and end but poorly in the middle has "lost-in-the-middle" syndrome.

Multi-needle / multi-hop

Multiple facts to combine. Stresses cross-position reasoning, not just recall.

Workload-specific long-context tasks

Code understanding across a large codebase. Document Q&A across a long contract. Conversation memory over many turns. These tax the model in ways generic needle-in-haystack misses.

What headline numbers hide

A model can score well on aggregate long-context benchmarks while failing on:

  • Hard middle positions.
  • Multi-fact retrieval.
  • Reasoning that requires tracking long-range dependencies.

The reliable evaluation is on your actual workload at the lengths you actually serve.

Standard long-context benchmarks

Benchmark What it tests Lengths covered Used for
Needle-in-a-haystack (NIAH) Single-fact retrieval at varied positions 4k - 2M Quick sanity check
RULER 13 subtasks: NIAH variants, aggregation, QA, variable-tracking 4k - 1M Standard benchmark
LongBench Multi-document QA, summarization, code 4k - 32k typically Older standard
InfiniteBench Long-context aggregation and reasoning up to 1M Frontier eval
Loong Long-context multi-document QA 32k - 200k Quality at moderate length
BABILong Synthetic reasoning at long length 4k - 1M Reasoning chain analysis

The de facto reference in 2026 is RULER. NIAH alone is too easy — many models pass it while failing on harder tasks. Trust RULER for cross-model comparison, but always supplement with workload-specific tests.

The lost-in-the-middle effect, quantified

Original Liu et al. paper showed mid-context recall dropping by 20-40 percentage points relative to start/end. Two years later, the effect is reduced but not eliminated in modern models:

Model NIAH start NIAH middle NIAH end
Llama 3.1 70B at 64k 97% 78% 95%
Qwen2.5 72B at 64k 96% 81% 94%
Claude 3.7 Sonnet at 64k 99% 95% 99%
Gemini 2.0 Pro at 64k 98% 91% 98%
GPT-4 Turbo at 64k 96% 84% 95%

The frontier closed models have largely fixed the issue at 64k; open weights are still working through it. At 128k+, the middle-position gap widens again across all models.


Hardware considerations

Long-context serving is HBM-capacity-limited more than anything else.

HBM capacity

  • H100 SXM: 80 GB. Tight for long-context production.
  • H200: 141 GB. Significantly more headroom.
  • B200: 192 GB. Frontier capacity.
  • MI300X / MI325X: 192 GB / 256 GB. Among the highest capacities available.
  • GB200 NVL72: 192 GB per GPU × 72 GPUs = 13.8 TB unified HBM domain. The ceiling for single-rack long-context serving.

For 128k contexts and above, the HBM-rich GPUs are essentially mandatory unless you accept aggressive KV quantization and/or offloading.

Inter-GPU bandwidth

Ring attention and tensor-parallel attention both require fast inter-GPU links. NVLink within node, NVL72-class within rack, InfiniBand across racks.

The bandwidth requirement scales with sequence length and per-step work. A 1M-token ring attention step moves ~3 GB per layer per GPU. Over 60 layers that's 180 GB per forward pass per GPU pair. At NVLink bandwidth (900 GB/s aggregate per node), this is 200 ms of pure transfer time per forward pass — comparable to the attention compute, well overlapped in practice but not free. At InfiniBand bandwidth (50 GB/s per NIC), the same transfer takes 3.6 s, which is not overlappable and destroys the long-context economics.

Topology

Long-context serving with ring attention prefers all GPUs in one fast-fabric domain. Rack-scale fabrics (NVL72 and similar) were partially motivated by long-context and MoE workloads. For the fabric architecture see NVLink and rack-scale topology.

HBM capacity needed by context length (70B model, FP8 KV)

Context Per-request KV Concurrent requests per H200 (141 GB) Concurrent per B200 (192 GB)
8k 1.3 GB 50-80 (limited by weights) 60-100
32k 5.4 GB 15-20 20-25
128k 21.5 GB 3-4 4-6
512k 86 GB 0-1 1
1M 167 GB none, requires multi-GPU none, requires multi-GPU

The cliff between 128k and 512k is the production planning constraint. Above 128k, single-GPU serving requires aggressive INT4 KV; above 256k, multi-GPU is mandatory.

Cost-per-token by context length

A representative scaling curve, normalized to short-context cost:

Context Prefill cost (relative) Decode cost (relative) Total cost for 200 output tokens
4k
32k ~8× 1.5×
128k ~50× 18×
512k ~250× 12× 80×
1M ~600× 22× 180×

The decode cost grows linearly with context (each token's attention reads all KV); the prefill cost grows quadratically. At very long context, the upfront prefill dominates the total cost. Prefix caching (reusing KV across requests with shared prefixes) is the single largest practical mitigation — for an Anthropic-style 1-hour TTL prefix cache hit, the prefill cost is amortized across many requests and the per-request cost drops back toward the decode-only contribution.


Production deployments

Models with strong long-context support in 2026:

  • Claude family — long, well-evaluated context windows (200k+). Among the best at minimizing lost-in-the-middle on standard evaluations.
  • Gemini — has historically pushed the longest context windows (1M, 2M).
  • GPT-4 / GPT-5 lineage — long context, competitive but not market-leading on absolute length.
  • DeepSeek-V3 / R1 — 128k context.
  • Qwen3 — 128k+ context with YaRN extension.
  • Llama 3.x / 4.x — 128k context windows.
  • Kimi (Moonshot AI) — 2M context, an outlier on absolute length. The Mooncake disaggregated serving stack underneath is part of why it is practical.
  • Mistral Large 2 — 128k with sliding-window attention in many layers, an efficiency-first design.

The honest market summary: closed-source frontier models (Claude, Gemini) currently lead on effective long-context quality; open-weight models lead on absolute advertised length when fine-tuned aggressively. The gap on RULER-style hard tasks is narrowing but not closed.

Serving stacks with long-context optimizations:

  • vLLM — paged attention, FP8 KV cache, multi-GPU long context.
  • SGLang — RadixAttention for prefix sharing in long-context workloads.
  • TensorRT-LLM — chunked prefill, KV cache quantization, sequence parallelism.
  • lmdeploy — competitive long-context support, smaller community.
  • Mooncake — disaggregated KV pool architecture, the production-validated reference for 2M-context serving at scale.

Stack-level long-context features

Feature vLLM 0.8 SGLang 0.4 TRT-LLM 0.18 Mooncake
FlashAttention-3 Yes Yes Yes Yes
PagedAttention Yes Yes (via RadixAttention) Yes Yes
Chunked prefill Yes (V1 scheduler) Yes Yes Yes
FP8 KV Yes Yes Yes Yes
INT4 KV Beta Yes Yes Yes
Ring Attention Yes (beta) Yes Yes (Context Parallel) Yes
DeepSpeed-Ulysses No Partial Partial Yes
Prefix caching Yes (auto) Yes (radix) Yes Yes (distributed)
Distributed KV pool No Partial Partial Yes (reference)
YaRN / context extension at runtime Yes Yes Yes Yes

Concrete deployment recipes

128k chat on H200 SXM (8 GPU node):

  • Model: Llama 3.1 70B FP8 + LoRA adapters
  • Attention: FlashAttention-3
  • KV cache: FP8, PagedAttention
  • Position: RoPE base, no extension needed (model is natively 128k)
  • Parallelism: TP=4, DP=2 within node
  • Concurrency: 16-32 concurrent requests at full context
  • Cost per 128k prefill: ~$0.10
  • Cost per output token at 128k context: 5-8× short-context

1M context document analysis on B200 NVL72:

  • Model: Claude or Gemini API for production; for self-hosted, Llama 4 Maverick or DeepSeek-V3 with extension
  • Attention: FlashAttention-3 + Ring Attention or NVIDIA Context Parallel
  • KV cache: FP8 or INT4 per-head
  • Position: YaRN or LongRoPE for extension to 1M
  • Parallelism: SP=8, TP=8 inside rack
  • Concurrency: 1-2 concurrent 1M requests per rack
  • Cost per 1M prefill: ~$2-5
  • Latency: 30-90 s for prefill, then streaming

RAG over 100M-token corpus with 32k context model:

  • Retriever: bi-encoder (BGE-M3, GTE-large), reranker (BGE-reranker)
  • Generator: any 32k-context model (Llama 3.1 70B is overkill; 8B works)
  • Top-k: 20-50 chunks at 1k each, post-rerank
  • Cost per query: $0.01-0.05
  • Latency: 200-800 ms total

These are starting points. Tune for your specific workload.


Open problems

Effective context vs advertised context. Closing the gap between "supports 1M tokens" and "actually maintains quality at 1M tokens" is the central open problem.

Memory-efficient attention beyond FlashAttention. Sub-quadratic alternatives (Mamba, state-space models, linear attention) are competitive in some regimes; haven't displaced softmax attention at the frontier.

Hybrid architectures. Mixing attention with state-space layers, attention with sliding windows, attention with retrieval. Empirically promising; theoretically not understood.

Long-context training data quality. Many "long-context" datasets are concatenations of unrelated documents, not genuinely long coherent content. Real long-context capability requires better data.

KV cache as a shared service. Distributed KV pools across many serving replicas, often paired with disaggregated prefill/decode. Mooncake and similar systems demonstrate the idea; productionizing is still in progress.

KV reuse across requests with semantic similarity. Beyond exact prefix matching, finding KV cache reuse opportunities based on semantic similarity (paraphrased prompts, similar code patterns) is open. Current systems require exact prefix matches; a fuzzy-match capability would dramatically increase cache hit rates.

Long-context fine-tuning data. Genuinely long, coherent training data is scarce. Synthetic generation of long-context training examples (multi-document syntheses, long-chain reasoning traces) is an active area. The quality of long-context capability is upstream-bottlenecked by training data quality.

Cost-aware context routing. Models that learn to decide "should I use the full context or just the most relevant part?" Currently a manual decision per request; making it automatic would optimize cost-quality tradeoffs systematically.

Compression of KV during generation. Online KV pruning that maintains quality. Active research.

Speculative decoding at long context. Draft models trained at short context struggle to draft accurately for long-context targets, and the per-step KV cost makes the speedup math less favorable. Better long-context drafts are an active research area. See our speculative decoding guide.

Cross-modal long context. Long-video and long-audio inputs push context lengths into the millions of tokens (a 1-hour video at 24 fps × 30 tokens/frame = 2.6M tokens). The combination of long context with multimodal embeddings is at the research frontier; production deployments are very limited.

Long-context cost models for the API economy. Hosted providers price long-context heavily (typically 2-5× per-token over short context, with separate caching tiers). Building accurate cost models for "should I send 200k context or do RAG?" is non-trivial and under-tooled.

State-space and hybrid architectures

Mamba (Gu & Dao, 2023) and Mamba-2 (Dao & Gu, 2024) replace the attention block entirely with a selective state-space layer that has O(n) compute and constant memory per step. The quality on standard benchmarks lags pure attention at the frontier but is closing — Mamba-2 hybrid models (e.g., Jamba, Zamba) interleave state-space layers with attention layers and report competitive results at much lower compute. Production adoption is limited to a few labs, but the cost story is compelling enough that it warrants tracking. RWKV and RetNet are alternative state-space lineages with similar pitch.

When state-space wins

For pure language modeling perplexity at extreme length (1M+), state-space architectures already match or beat attention on some metrics. For retrieval and global reasoning at long context, attention still wins because the explicit pairwise interaction lets the model attend back to specific tokens. State-space models compress history into a fixed-size state, which loses the ability to perfectly recall any specific past token. Hybrid models (state-space for the bulk, attention for recall layers) are the pragmatic compromise.


Attention sinks and StreamingLLM

A subtle observation has reshaped long-context inference since 2023: the first few tokens of a sequence accumulate disproportionate attention weight regardless of their content. They become "attention sinks."

The discovery

Xiao et al., 2023 (StreamingLLM) observed that when running an LLM with a sliding window that drops old tokens, quality collapses unless the very first tokens (typically 4 BOS-like positions) are preserved. The attention distribution at every layer routes substantial weight through these positions, even when their content is unrelated to the current query.

The explanation: softmax forces attention weights to sum to one. When a token has no strong reason to attend to any specific past token, it attends weakly but non-zero to many of them. The "sink" positions absorb this excess attention budget. Drop the sinks and the softmax gets concentrated on the remaining tokens, distorting attention patterns and degrading output.

Practical implications

Production sliding-window implementations (Mistral SWA, Gemma SWA) must preserve a small prefix region as attention sinks. Typical implementation: keep 4 BOS positions plus the last W tokens (the sliding window). The total KV cache is W + 4 entries per layer.

For ring attention and other distributed schemes, the rank holding the sinks gets special treatment — it cannot be evicted regardless of position rotation policy.

Attention sinks vs trained sinks

Microsoft's Sink Token paper, 2024 argues that explicitly training a dedicated sink token (a learnable token prepended to every input) outperforms relying on the implicit BOS sinks. Several 2025+ models include trained sink tokens; the cost is one extra position per sequence, and the quality lift is measurable on long-context retention tasks.


Sparse attention deep dive: Longformer, BigBird, Native Sparse Attention

Full attention is O(n²). Sparse attention restricts which positions can attend to which, reducing complexity at the cost of expressiveness.

Longformer (Beltagy et al., 2020)

Combines three patterns: sliding window (each token attends to ±W neighbors), dilated window (some heads use dilated patterns to capture longer-range dependencies), and global attention (designated tokens attend to and are attended by all positions). The combination has linear complexity in sequence length.

Production status: Longformer-style attention was the dominant sparse pattern in 2020-2022. Largely superseded by FlashAttention's O(n²) compute on dense attention, which produces better quality at feasible context lengths up to ~128K. Beyond that, sparse patterns remain relevant.

BigBird (Zaheer et al., 2020)

Adds random attention to the Longformer recipe: each token attends to a small fixed number of randomly chosen positions in addition to the windowed and global patterns. The theoretical motivation: random graph connectivity approximates the universal approximator property of full attention. Practical wins over Longformer were small; BigBird is less commonly used in 2026 production.

Block-sparse attention

A more flexible pattern: divide the sequence into blocks of B tokens, define a sparsity pattern over blocks (which blocks attend to which), implement block-level attention with FlashAttention-style tiling. Modern variants (FlashAttention's sparse mode, FlexAttention's block-sparse) make this efficient.

For workloads with predictable structure (multi-document QA where each document is a block, code generation with module-level boundaries), block-sparse attention can save 50-80% of compute with minimal quality loss.

Native Sparse Attention (DeepSeek-V3, 2024)

DeepSeek-V3 introduced "Native Sparse Attention" where the model is trained with structured sparsity from scratch. Each attention head learns to attend to a sparse subset of positions selected by a small router. Unlike post-hoc sparsity, training-time sparsity allows the model to allocate dense attention to important positions and skip the rest.

Quality: comparable to dense attention on most benchmarks at 30-50% of the FLOPs. Adoption in 2026 is growing; DeepSeek-V3 and the Mistral Large 3 series both use variants.

Sparse attention summary

Pattern Complexity Quality vs dense Production use
Sliding window O(n × W) -2 to -5% Mistral, Gemma
Longformer O(n × W) -3 to -7% Legacy
BigBird O(n × W) -3 to -7% Legacy
Block-sparse O(n × B × density) -2 to -10% Custom
Native sparse O(n × routed_k) -1 to -3% DeepSeek-V3, Mistral Large 3

Linear attention and state-space models: Mamba, RWKV, GLA

Linear attention re-formulates attention with kernel functions that allow recurrent computation. State-space models (SSMs) take a different route to the same goal: linear-time sequence processing.

Performer and Linear Transformers (2020-2021)

Performer (Choromanski et al., 2020) approximates softmax attention with random feature maps that allow associative reordering, reducing complexity from O(n²) to O(n × d²). Linear Transformers (Katharopoulos et al., 2020) use feature maps that allow recurrent updates.

Both work in theory; in practice they trail softmax attention in quality. Modern revivals (Gated Linear Attention, Retentive Networks) close some of the gap.

RWKV-7 (Bo et al., 2024+)

RWKV is a recurrent architecture that interpolates between RNN-style efficiency and transformer-style parallelizable training. RWKV-7 (2024+) introduces dynamic state evolution per layer, closing most of the gap to dense transformers on language modeling.

Quality: within 1-2 points of comparable dense transformers on most benchmarks at scale. Inference complexity: O(n) compute, O(1) memory per token (no growing KV cache). For long-context inference, this is structurally different — the entire context is compressed into a fixed-size state.

Mamba (Gu and Dao, 2023)

Mamba is a structured state-space model with selective state-update. Inference complexity matches RWKV: O(n) compute, O(1) memory per token. Training is parallelizable via a scan algorithm.

Mamba's quality on language modeling matches transformers at small scales (under 3B parameters). At larger scales the gap is smaller than feared but still present on tasks requiring exact retrieval from long context.

Mamba-2 (2024)

Mamba-2 unified the SSM and attention formalisms, showing that the SSM operations can be expressed as masked attention with structured matrices. Practical benefit: better hardware utilization (the SSM update maps to standard matmul kernels) and quality lift over Mamba-1.

Gated Linear Attention (GLA)

Yang et al., 2023. A linear attention variant with data-dependent gating. Quality is between Mamba and dense attention; inference complexity is linear. Used in some hybrid models (Jamba, Recurrent Gemma).

When linear attention wins

  • Streaming workloads (audio, video processing) where tokens arrive sequentially and memory must be bounded.
  • Edge inference where the O(n) KV cache of transformers is impractical.
  • Very long contexts (>1M tokens) where transformer KV becomes infeasible.

For typical 8K-128K context LLM workloads, dense transformers with FlashAttention dominate quality and are the production default.


Hybrid architectures: Jamba, Recurrent Llama, Falcon-Mamba

The 2024-2026 trend is hybrid architectures that combine attention layers with SSM or linear-attention layers.

Jamba (AI21, 2024)

Jamba is a 52B-parameter hybrid (12B active, MoE) that interleaves Transformer layers, Mamba layers, and MoE layers in a 1:7:8 pattern. The intuition: attention layers handle precise positional/retrieval reasoning; Mamba layers handle long-range information flow efficiently.

Quality: competitive with comparable-active-param dense transformers; long-context performance better than pure attention at similar scale. Context window: 256K tokens; effective context shown to be ~128K on RULER.

Recurrent Llama / Recurrent Gemma (DeepMind, 2024)

Recurrent Gemma replaces most Llama-style attention layers with linear-recurrent layers, keeping a few attention layers for tasks where exact retrieval matters. Memory per token is constant; inference throughput at long context is dramatically better than dense Llama.

Falcon-Mamba (TII, 2024)

A pure-Mamba 7B model trained at scale. Demonstrated that pure SSMs can compete with similar-scale dense transformers on most benchmarks; underperforms on exact-retrieval tasks (NIAH). Useful as a proof of concept; production deployments typically use hybrid variants for the retrieval robustness.

Hybrid pattern comparison

Model Attention layers SSM layers Context Effective context (RULER)
Jamba 52B 1/8 7/8 256K ~128K
Recurrent Gemma 9B ~10% 90% 8K ~6K
Falcon-Mamba 7B 0% 100% 32K ~16K
MiniMax-Text-01 Mixed Mixed 4M ~2M
Zamba 7B ~20% 80% 16K ~12K

The pattern: attention provides retrieval precision; SSM provides long-range efficiency. Hybrids inherit the strengths of both at the cost of architectural complexity.

When hybrids make sense in production

Hybrids are most compelling when (a) inference at very long context is the primary constraint, (b) the workload tolerates the slight quality gap on exact-retrieval tasks, and (c) the team can absorb the engineering cost of less-common serving stacks. For a typical 128K-context chat workload, dense Llama-3-70B with FP8 KV beats Jamba on quality and matches it on throughput.

The clearest win for hybrids: streaming workloads (live transcription, real-time agents) where bounded memory per token is structurally required. Pure transformers cannot do this; hybrids and pure SSMs can. By 2027 expect hybrid architectures to dominate the streaming-inference category while dense transformers continue to dominate batch-inference for finite-context workloads.

Hybrid serving stack support

vLLM and SGLang both have experimental hybrid-architecture support in 2026. The implementation is more complex than dense transformers — the SSM update is not the same operation as attention and requires its own kernel. Performance is approaching dense-transformer parity for matched hardware, but the ecosystem maturity gap is still real. TRT-LLM has slower hybrid adoption; the engine-build pipeline is more transformer-centric.


SWA + global tokens: Mistral, Gemma, Gemini patterns

Sliding Window Attention with a small set of global tokens is the dominant pattern for efficient long-context attention in production frontier models.

Mistral SWA

Mistral 7B and Mixtral 8x7B used sliding window of 4096 tokens per layer. The intuition: information propagates through layers; a query at layer L can see L * 4096 tokens' worth of effective context by accumulating across the residual stream.

In practice, this "receptive field" expansion is leaky — information further than ~16K tokens from the query is not reliably retrievable. Mistral Large 3 (2025) switched to a hybrid pattern with some full-attention layers interleaved.

Gemma SWA

Gemma 2 (Google, 2024) interleaves local SWA layers (4K window) and global attention layers (full context) in alternating pattern. Five local layers, one global layer, repeating. This compromise gives the model both efficient local context and explicit long-range pathways at every 6 layers.

Gemini long-context details (public)

Gemini 1.5 Pro's 1M-2M context is achieved via a combination of techniques Google has only partially disclosed: sparse attention patterns at long range, MoE for capacity scaling, and aggressive KV compression. RULER benchmark evaluations consistently place Gemini 1.5 Pro and 2.5 Pro at the top of long-context performance among public models, with effective context (>90% retrieval accuracy) extending past 500K tokens.

Pattern-by-model

Model Pattern Window Global layers Effective context
Mistral 7B Pure SWA 4K 0 ~16K
Mixtral 8x7B Pure SWA 4K 0 ~16K
Mistral Large 3 Hybrid 8K ~10% ~64K
Gemma 2 Interleaved 4K ~16% ~32K
Gemma 3 Interleaved 8K ~20% ~64K
Gemini 2.5 Pro Multi-tier sparse varies varies >500K

Long-context evaluation deep dive: NIAH, RULER, BABILong, InfiniteBench

Evaluating long-context quality is a research problem in itself. Several benchmark families have emerged.

Needle in a Haystack (NIAH)

The classic test: insert a single piece of distinctive information (the "needle") into a long context filled with unrelated content (the "haystack"). Query the model for the needle. Measure recall accuracy as a function of needle position and haystack length.

NIAH popularized the visualization of long-context performance as a 2D heat map (position × length). Quickly became saturated — most frontier 2026 models score >95% on single-needle NIAH at advertised context lengths. Limited diagnostic value beyond confirming basic retrieval works.

Multi-needle NIAH

Insert N (typically 3-10) needles, query for a subset. Tests whether the model can locate and combine multiple pieces of information across the context. Significantly harder than single-needle; most models degrade noticeably as N grows.

RULER (Hsieh et al., 2024)

RULER extends NIAH with 13 tasks of varying complexity: single retrieval, multi-key retrieval, multi-value retrieval, aggregation, common words, multi-hop tracing. Reports per-length effective context where the model maintains >85% accuracy.

The 2026 numbers worth knowing:

Model Advertised RULER effective (>85%)
GPT-5 200K ~96K
Claude Opus 4.x 200K ~110K
Claude Sonnet 4.5 200K ~95K
Gemini 2.5 Pro 2M ~512K
Llama-3 8B 128K 128K ~32K
Llama-3 70B 128K 128K ~64K
Llama-4 Scout 1M 1M ~256K
DeepSeek-V3 128K 128K ~64K
Qwen2.5-72B 1M 1M ~128K
Kimi K2 2M 2M ~512K
MiniMax-Text-01 4M 4M ~1M

The pattern: effective context is typically 1/4 to 1/2 of advertised. Closed frontier models (Claude, GPT, Gemini) tend to have higher effective/advertised ratios than open models, possibly due to more aggressive long-context post-training.

LongBench and InfiniteBench

LongBench (Tsinghua, 2023) and InfiniteBench (2024) provide realistic task-style evaluations (summarization, QA, code completion) over long documents. Less synthetic than NIAH; more representative of production workloads. The downside: smaller context lengths (up to 200K), making them less useful for evaluating 1M+ models.

BABILong

BABILong adapts the bAbI reasoning tasks to long context by padding the supporting facts with distractor text. Tests multi-hop reasoning over long context. Most models degrade significantly beyond 32K context even when advertised context is much larger.

ZeroSCROLLS

ZeroSCROLLS evaluates summarization, QA, and aggregation over long documents in a zero-shot setting. Real-world tasks; harder than synthetic benchmarks. Useful for production-relevant comparison.

Evaluation pitfalls

Three common mistakes. First, positional bias: models often perform better on needles at the start or end of the context (lost-in-the-middle). Always evaluate at multiple positions. Second, retrieval bias: NIAH-style benchmarks reward exact-match retrieval; production workloads often require fuzzy or contextual retrieval. Augment with task-style benchmarks. Third, length extrapolation: a model trained to 128K but evaluated at 512K may pass NIAH on the trained range and fail wildly past it; benchmark at the actual deployment length.


Per-model 2026 long-context details

A consolidated reference for May 2026 long-context options:

Gemini 2.5 Pro

Context: 2M tokens advertised, with experimental 5M and 10M variants in research preview. RULER effective: ~512K. The strongest long-context model on synthetic benchmarks; real-world performance on multi-needle retrieval tasks is also leading.

GPT-5

Context: 200K tokens. RULER effective: ~96K. Smaller advertised context than Gemini or Claude but strong effective context ratio. Used in production by OpenAI's deep research products.

Claude Opus 4.x and Sonnet 4.5

Context: 200K tokens (some endpoints 500K for enterprise). RULER effective: ~110K (Opus), ~95K (Sonnet 4.5). Anthropic has the highest effective/advertised ratio among frontier models, attributed to extensive long-context post-training.

Llama 3 / Llama 4

Llama-3 8B/70B: 128K advertised, RULER effective ~32K (8B) / ~64K (70B). The open community has fine-tuned variants (Yarn, Chronos, ProLong) that push effective context closer to advertised. Llama-4 Scout (2026): 1M advertised, RULER effective ~256K.

DeepSeek-V3

Context: 128K. RULER effective: ~64K. Strong long-context performance for an open model; particularly good on multi-hop reasoning over long context.

Qwen2.5-72B Instruct

Context: 1M advertised (via YaRN extension), 128K trained. RULER effective: ~128K. Used as a long-context open default by many production teams.

Kimi K2 (Moonshot AI)

Context: 2M advertised. RULER effective: ~512K. The strongest open long-context model in 2026; production-deployed by Moonshot for their consumer products.

MiniMax-Text-01

Context: 4M advertised. RULER effective: ~1M. The longest advertised context among open models in 2026. Uses a hybrid SWA + linear attention architecture for tractable inference.

A note on Gemini's 10M context experiments

Google has published research showing Gemini variants with 10M-token context windows, demonstrated on synthetic NIAH-style benchmarks and on video understanding (a 1-hour video at 1fps is ~3.6M frames-worth of context). The production rollout has been cautious; the 10M endpoint is not generally available, and pricing for the rumored future release is expected to be substantially higher per token. The economic question is whether 10M context is genuinely useful enough to justify a serving infrastructure that may cost 10× per request vs 200K context. Most teams who have evaluated it report that hierarchical retrieval over 10M tokens often outperforms direct 10M attention on cost-quality trade-offs.

Production reality check

Effective context numbers exclude task-specific quality differences. A model with high RULER effective context may still underperform on a specific production workload (legal contract analysis, multi-document synthesis) due to training data distribution. Always evaluate on representative workload before committing.


Production serving math for million-token KV

The economics of 1M-token context serving in 2026:

Llama-3-70B at 1M tokens

KV cache: 80 layers × 8 KV heads × 128 head dim × 2 (K+V) × 1M tokens × 2 bytes (FP16) = ~328 GB.

At 80GB H100 SXM: requires 5+ GPUs to hold the KV cache alone. Practical serving uses 8-GPU tensor-parallel with KV cache sharded across.

With FP8 KV: 164 GB, 3+ GPUs.

With INT4 KV (KIVI): 82 GB, 2 GPUs. Quality loss typically <2% on RULER.

Llama-4 Scout (1M context, MoE) at 1M tokens

MoE active params reduce dense compute, but KV cache scales with total layers, not active layers. Llama-4 Scout has ~70 layers × 8 KV heads × 128 head dim × 2 × 1M × 2 = ~287 GB. Similar serving requirements to dense 70B.

Throughput at 1M tokens

A 1M-token prefill on Llama-3-70B at 8×H100 SXM: 45 seconds with FA3 + tensor parallelism. Per-token cost at $30/hour for the 8-GPU node: ~$0.37 per prefill. The decode that follows runs at typical decode rates (30 tokens/sec at the long-context-aware decoder).

At Gemini 2.5 Pro pricing ($1.25 / M input tokens for ≤200K context, $2.50 / M for >200K), a 1M-token prefill costs $2.50. The economics of long-context-as-a-service are still expensive but feasible for high-value queries.

Disaggregation at 1M context

Prefill of a 1M-token prompt is heavily compute-bound (4.5 PFLOPs of attention compute). Decode is memory-bandwidth-bound. Disaggregated serving (prefill on H100, decode on H200 or B200 with higher bandwidth) is the canonical pattern at this scale. See disaggregated inference.

Per-GPU KV cost math

GPU HBM KV at FP16 (70B) KV at INT4 (70B)
A100 80GB 80GB ~245K tokens ~980K tokens
H100 80GB 80GB ~245K tokens ~980K tokens
H200 141GB 141GB ~432K tokens ~1.7M tokens
B200 192GB 192GB ~590K tokens ~2.4M tokens

Per-GPU capacity excludes model weights. For practical serving, subtract weight footprint (140 GB FP16 for 70B; 35 GB INT4) from the available HBM before computing KV capacity.

Batching at long context

Batch size at 128K is fundamentally limited by KV memory. A 70B model at 128K FP16 KV needs ~42 GB / sequence; one H100 SXM (80 GB minus ~10 GB weights at FP8) accommodates batch size 1 at this context. With INT4 KV, the same H100 can hold batch 4-6 at 128K. At 1M tokens, even with INT4 KV, batch sizes are typically 1-2 per GPU.

The economics: serving long context has a much lower throughput per GPU than short context. A 70B model at 8K context can serve thousands of tokens/sec/GPU; the same model at 128K serves hundreds; at 1M, tens. Pricing models for long-context API access reflect this, with per-token rates rising sharply past 200K context (Gemini 2.5 Pro doubles the rate above 200K; Claude maintains a single rate up to 200K and charges premium for the 500K endpoint).

Prefix caching impact at long context

If multiple requests share a long prefix (system prompt + document), cache the prefix's KV. Subsequent requests skip the prefill for the shared portion, dropping latency from minutes to seconds. Production stacks (vLLM with prefix caching, SGLang with RadixAttention, TRT-LLM with KV cache reuse) all implement this. The cost: HBM occupied by cached prefixes; eviction policy needed when the cache fills.

At 1M context, prefix caching is essential. A 1M-token document might be queried 50 times by a single user across a session; without caching, each query re-prefills the document at 45-second cost. With caching, the first query pays the full cost; subsequent queries pay only for the small query-specific tail.



Context dilution and remedies

Even when the model can technically attend to 1M tokens, the practical quality on long context is often worse than a curated 32K context. The phenomenon: context dilution.

The mechanism

Softmax attention distributes weight across all positions. Adding more tokens, most of which are unrelated to the query, spreads the attention budget thinner. The relevant tokens still receive attention but their relative weight drops, and the noise from irrelevant tokens accumulates in the output.

Lost-in-the-middle (Liu et al., 2023)

A specific case: information in the middle of a long context is attended to less effectively than information at the beginning or end. The original paper showed 10-30 point accuracy drops on multi-document QA when the relevant document was in the middle vs the start or end of the context.

The cause is partly position-encoding (RoPE biases attention toward nearby positions) and partly training-data distribution (models are typically trained on short contexts where this asymmetry doesn't matter).

Remedies

  • Reranking before context: retrieve top-K candidates, rerank for relevance, place top results at the start or end of the context, not the middle.
  • Hierarchical processing: summarize chunks of long context first, attend over summaries, then expand to full text for the relevant chunks.
  • Active retrieval mid-generation: instead of stuffing all relevant context upfront, retrieve incrementally as the model needs information. This is the agentic-RAG pattern; sidesteps context dilution by keeping the active context small.
  • Long-context-aware training: post-training on long-context tasks with explicit middle-position needles improves performance there. Anthropic and Google both invest heavily in this; results show in RULER's middle-position accuracy.

YaRN/PI/NTK-aware extension details

Extending a model's context beyond its trained window without full retraining.

Position Interpolation (PI, Chen et al., 2023)

Rescale position indices so that the trained range covers the new context. A model trained on 4096 positions extended to 32768 has positions divided by 8; each position now maps to an interpolated value within the trained range.

PI is cheap (no training required) but degrades quality, particularly on tasks requiring fine positional discrimination. With light fine-tuning (1-2% of training compute), most of the quality is recovered.

NTK-aware scaling

bloc97 et al., 2023. Modifies the RoPE frequency base to extend context: higher frequencies (high-detail) are kept; lower frequencies (long-range) are scaled. Quality degradation is less than PI; no training required.

YaRN (Peng et al., 2023)

YaRN combines NTK-aware scaling with a temperature term on the attention logits. Best balance of training-free quality and extension factor. Used by many open-model long-context fine-tunes (Yarn-Llama, Yarn-Mistral).

Typical extension factor with YaRN: 4-8× the trained context with <2% quality loss after light fine-tuning. Beyond 8×, quality degrades significantly even with fine-tuning.

Practical recipe (May 2026)

For a model trained at 8K context targeting 128K: YaRN with scaling factor 16, fine-tune for 1B tokens of long-context data, evaluate on RULER. Typical effective context after this pipeline: ~32K-64K. Pushing further requires training from scratch at the target context length.

Comparison table

Technique Training cost Max practical extension Quality at limit
Position Interpolation 0 (or light) -5 to -15%
NTK-aware 0 -3 to -10%
YaRN Light fine-tune -1 to -5%
Training from scratch at long context Full Unbounded (model-limited) Baseline

Block-sparse routing and learned compression: the 2026-2027 frontier

The research direction that has the best chance of changing long-context economics within 18 months: making attention itself sparse and learned, rather than relying on post-hoc retrieval to sidestep dense attention.

Block-sparse routing

A small router network predicts, for each query token, which K of the M context blocks it should attend to. Attention is computed densely within the selected blocks. FLOPs scale as O(n × K × B) instead of O(n²), where K << M.

Early 2025 papers (Sparse Mixture of Attention, DeepSeek's Native Sparse Attention) show that block-sparse routing can match dense attention quality at 30-50% of the compute when trained from scratch. The 2026 production status: experimental in research, beginning to ship in some frontier models. By 2027 we expect block-sparse routing to be the default in new frontier models targeting 1M+ context.

Learned KV compression

Train a small compression network alongside the model that compresses old KV entries into a fixed-size memory state, similar to SSM hidden states. Eviction of distant context is replaced by compression into the memory. The trade-off: bounded memory at the cost of compressing-away fine details.

Research demonstrations (Compressive Transformer, Recurrent Memory Transformer) showed feasibility years ago; production adoption has been slow because the quality trade-off was not favorable. The 2025-2026 generation (learned recurrent memories in MiniMax-Text-01 and Kimi K2 variants) is closing the gap.

Retrieval-as-attention

The most speculative direction: replace the global attention layer with an explicit retrieval over a large memory store. Each query token retrieves K relevant entries from the store via approximate nearest neighbor lookup, then attends to them densely. The retrieval store can be much larger than the model's working context; entries are evicted by an explicit policy rather than by attention weight.

Memorizing Transformer (Wu et al., 2022) and MEGABYTE-RAG (2024) are the early demonstrations. Production deployment is bottlenecked on training data and infrastructure for retrieval-aware fine-tuning; expect first production deployments in 2026-2027.

Implications for serving

If block-sparse and learned compression become the dominant patterns, the serving math changes significantly. KV memory cost may stop scaling linearly with context. Attention compute may scale sublinearly with context. The disaggregation patterns from today (prefill-heavy vs decode-heavy) may shift; sparse attention's compute is more decode-like even at long sequences.


FlashAttention generations: FA1, FA2, FA3 mechanics

The FlashAttention line of work (Dao et al., 2022; Dao, 2023; Shah, Bikshandi, Ye, Thakkar, Ramani, Tri Dao, Spector, 2024) is the kernel-level lever that makes long-context practical. Each generation tackled a different bottleneck.

FlashAttention-1 (FA1, 2022)

The original insight: standard attention reads and writes the n×n attention-score matrix to HBM, which dominates the wall-clock cost. FA1 keeps the score matrix in SRAM (on-chip shared memory) by tiling Q, K, V into blocks and computing softmax online with a running-max trick. The arithmetic complexity stays O(n²) but the HBM traffic drops from O(n²) to O(n) per attention head. On A100, FA1 delivered 2–4× speedups over naive PyTorch attention.

Tile sizes are tuned per architecture. On A100 (192 KB shared mem per SM), typical blocks are 128×64 or 64×128. On H100 (228 KB), larger tiles (128×128) become possible.

FlashAttention-2 (FA2, 2023)

Two main changes over FA1. First, restructured the parallelism: instead of parallelising over heads only, parallelise over sequence-length tiles as well, which gives more work to fill the GPU at small batch sizes (common in long-context settings). Second, reduced non-matmul FLOPs (the softmax bookkeeping) which had become a measurable fraction of the cost on Hopper hardware. FA2 is roughly 2× faster than FA1 on A100/H100 for typical shapes.

FA2 also introduced warp specialisation patterns that became the template for FA3.

FlashAttention-3 (FA3, 2024)

Targeted Hopper specifically. Three Hopper-only techniques:

  • WGMMA (warp-group matrix multiply-accumulate). Hopper's async matrix-multiply instruction lets one warp group issue MMAs while another runs softmax. FA3 splits work between producer and consumer warp groups.
  • TMA (Tensor Memory Accelerator). Hopper's async memory-copy engine moves tiles from HBM to SMEM in the background. FA3 overlaps memory copies with compute.
  • FP8 path. FA3 introduced an FP8 variant that uses Hopper's FP8 tensor cores for the QK^T and PV matmuls. Quality is comparable to BF16 at most workloads after careful scaling.

FA3 delivers roughly 1.5–2× over FA2 on H100, and the FP8 path adds another ~1.7× throughput at minimal accuracy cost for many models. The Triton port (in the official Triton repo) lags the CUDA port by 6–12 months but is closing.

FA3 on Blackwell

Blackwell (B100/B200) introduces TCGen5 (next-gen tensor cores) and partition-aware scheduling. Early FA3 patches exist; the FA4 generation expected late 2026 is rumoured to land most of the Blackwell-specific path. Until then, FA3 on Blackwell works but doesn't fully exploit the new hardware.

Why this matters for users

FlashAttention is not optional for serious long-context serving. vLLM, SGLang, TensorRT-LLM, and llama.cpp all depend on FA-family kernels. The version of FA your stack uses directly determines the prefill speed of your long-context workload. As of mid-2026, FA3 on H100/H200 is the production default; FA2 remains the AMD path until ROCm's FA3 catches up.

Generation Hardware Key technique Speedup over baseline
Naive PyTorch A100 None
FA1 A100 Tiling + online softmax 2–4×
FA2 A100, H100 Better parallelism 4–8×
FA3 (BF16) H100 WGMMA, TMA, async 6–12×
FA3 (FP8) H100 FP8 tensor cores 10–20×
FA3 / FA4 Blackwell B200 TCGen5 (in development) TBD

For the deeper technical mechanics see Triton kernel primer.


Decision math: RAG vs long-context vs fine-tune worked examples

Three worked examples showing where each approach actually wins.

Example 1: chatbot over a 200-page company handbook

  • Corpus size: ~150K tokens.
  • Query rate: 1,000 queries/day.
  • Update frequency: monthly.

Long context approach. Stuff full handbook into context every query. Cost per query (at GPT-5-class pricing of $5/M input tokens): $0.75. Daily cost: $750. Monthly cost: $22,500.

RAG approach. Index handbook in a vector store ($10/month). Retrieve top-3 chunks (~3K tokens) per query. Cost per query: $0.015. Daily cost: $15. Monthly cost: $460.

Fine-tune approach. Fine-tune a small model on Q&A pairs derived from the handbook. One-time cost: $2,000. Ongoing inference: $0.005/query. Daily cost: $5. Monthly cost: $150 + amortised training.

Winner: RAG for moderate update frequency; fine-tune if updates are rare and the corpus is stable.

Example 2: legal contract analysis

  • Each contract: 50K tokens.
  • Queries per contract: 5–20.
  • Across-contract reasoning needed: yes.

Long context approach. Send whole contract per query. Cost per contract: ~$1.25 for prefill + decode at 5–20 queries. Quality is excellent because the full contract is in context.

RAG approach. Retrieval works for "find clause about X" queries but breaks down for "summarise the parties' obligations across the entire contract." The cost savings vs long context aren't worth the quality loss for cross-cutting questions.

Fine-tune approach. Doesn't apply — each contract is unique.

Winner: long context, decisively.

Example 3: technical support knowledge base, 10,000 documents

  • Total corpus: ~50M tokens.
  • Queries per day: 50,000.
  • Each query needs maybe 1–3 documents of context.

Long context approach. Can't fit. Even at 2M-token contexts, 50M tokens is 25× too large.

RAG approach. Index everything. Retrieve top-3 chunks per query (~5K tokens). Cost per query: $0.025. Daily: $1,250. The right answer.

Fine-tune approach. Possibly complementary for tone and house-style matching, but doesn't solve the "50M tokens of knowledge" problem.

Winner: RAG, no contest.

A general decision rule

Situation Best approach
Corpus < 100K tokens, stable Long context
Corpus 100K–1M tokens, frequent cross-cutting queries Long context (with budget)
Corpus 100K–1M tokens, mostly local queries RAG
Corpus > 1M tokens RAG
Corpus < 100K tokens + needs domain tone Fine-tune
Mix of all three Hybrid: RAG for knowledge, fine-tune for tone, long context for hard queries

For the production RAG stack see our RAG production architecture post.


Evaluation pitfalls and methodology

Long-context evaluation is unusually treacherous. A few specific traps and how to avoid them.

Pitfall 1: needle-in-haystack alone

NIAH (Needle in a Haystack, Kamradt) tests whether a model can find a single inserted fact in long context. Almost all 2024+ models score 100% on simple NIAH. This created a false consensus that "100K context is solved" — a few months later the field discovered that multi-fact, reasoning-heavy queries fail at much shorter contexts.

The fix: NIAH is a smoke test, not a comprehensive evaluation. Run RULER, BABILong, or LongBench in addition.

Pitfall 2: positional bias

The "Lost in the Middle" finding (Liu et al., 2023) showed that models attend disproportionately to the start and end of context. A fact placed at position 50% of context length is found less reliably than the same fact at position 5% or 95%. Many benchmarks place needles at random positions; if you sample only a few positions you'll miss this effect.

The fix: evaluate at 5%, 25%, 50%, 75%, 95% positions and report the worst.

Pitfall 3: context dilution at long contexts

When context is large but the relevant information is small, models often hallucinate or generate generic responses. This is partly an attention-spread issue and partly a training-distribution issue (long-context training data is rare).

The fix: report quality at multiple context lengths (1K, 8K, 32K, 128K, 512K, 1M) and look for the curve, not a single number.

Pitfall 4: cache hit vs cold prefill

A model that performs well on cached prompts may perform worse on fresh long-context queries if the first-token-latency dominates. Eval setups that cache aggressively can mask this.

The fix: report TTFT (time-to-first-token) separately from total tokens-per-second.

Pitfall 5: benchmark contamination

If your eval prompts appear in training data, your scores are inflated. Older long-context benchmarks (and even some 2024 ones) have been incorporated into training corpora.

The fix: use freshly-constructed eval data for headline numbers; report contamination-detection results.

A defensible evaluation harness

A production long-context evaluation should include:

  • NIAH at 4–8 context lengths and 5 positions per length.
  • RULER's multi-needle and aggregation tasks.
  • A small custom set of in-domain questions.
  • TTFT and decode-throughput measurements.
  • A repeat-with-noise stability check (does the model give consistent answers across re-runs?).
  • Per-position accuracy reporting (not just averages).

Without these, advertised context numbers tell you nothing about real performance.


Production checklists for shipping long-context

When you're about to ship a product with long-context as a feature, the following items are worth ticking off before launch.

Pre-launch checklist

  • Evaluated effective context (not advertised) on representative workload.
  • Measured TTFT at p50, p95, p99 for the longest realistic prompt.
  • Verified KV-cache memory budget at peak concurrent requests.
  • Confirmed KV quantisation (FP8 or INT8) doesn't degrade quality past your threshold.
  • Tested chunked prefill behaviour for very long prompts.
  • Validated that paged KV / vLLM (or equivalent) handles your largest prompts without OOM.
  • If using sequence parallelism: confirmed multi-GPU communication is healthy at your context length.
  • Tested behaviour at exactly the advertised context limit (often breaks 1–2 tokens past).
  • If your product caches prefixes: confirmed cache invalidation logic is correct.
  • Set explicit per-request context limits with clear user-facing errors.

Operating checklist

  • Monitoring TTFT, decode-tokens-per-second, KV-cache utilisation per replica.
  • Per-tenant context-length quotas to prevent noisy neighbours.
  • Cost-per-conversation dashboard tracking input tokens vs output tokens.
  • Alerting on KV-cache OOM events.
  • Periodic re-evaluation as the underlying model is updated.

When to fall back to RAG

  • Corpus > 1M tokens.
  • Update frequency > weekly.
  • Cost per query at full long-context exceeds your budget.
  • Effective context quality at the relevant size is too low.
  • Multi-tenant serving where context size varies wildly.

These checklists are not exotic; they're the production-engineering hygiene that distinguishes "we have a 1M-token model" demos from "our customers reliably get accurate answers on million-token inputs" products.


Long-context cost tables by model and hardware

Concrete numbers to anchor decisions. All figures are illustrative, drawn from mid-2026 public pricing where available, and rounded.

Table A: input token cost at 100K, 500K, 1M tokens (cloud APIs)

Model Input $/M 100K input 500K input 1M input
GPT-5 ~$5 $0.50 $2.50 n/a (200K limit)
Claude Opus 4.x ~$15 $1.50 n/a n/a (200K limit, 1M tier)
Claude Sonnet 4.x ~$3 $0.30 n/a n/a
Gemini 2.5 Pro ~$1.25 $0.125 $0.625 $1.25
Llama 4 Maverick (hosted) ~$1 $0.10 $0.50 $1.00

Output costs are typically 3–5× input costs. For interactive workloads with frequent updates, the per-conversation cost can dominate.

Table B: KV-cache memory per million tokens (rough)

For a 70B-class Llama-style model, BF16 KV cache, 80 layers, 8 KV heads, head dim 128:

  • KV bytes per token ≈ 2 × 2 × 80 × 8 × 128 = 327,680 bytes ≈ 320 KB.
  • 1M tokens KV ≈ 320 GB.
  • FP8 KV ≈ 160 GB.
  • INT4 KV ≈ 80 GB.

A single H100 80GB cannot hold a million-token KV at BF16. H200 (141GB) cannot either. B200 (192GB) can barely. Multi-GPU TP/SP is required, or aggressive KV quantisation.

Table C: when to consider each architecture pattern

Pattern Sweet spot context Notes
Full attention with FA3 up to ~128K Standard for most flagships
Full attention + KV FP8 128K–500K Common 2026 production
SWA + global tokens 32K–1M effective Mistral, Gemma family
Linear attention / SSM 1M+ Mamba, RWKV families
Hybrid (some full, mostly SSM) 500K–10M Jamba, Falcon-Mamba
Ring attention 1M+ training Distributed across many GPUs
Sequence parallel (Ulysses) 100K–1M Production serving

These costs and architecture tables together let you sketch a back-of-envelope feasibility check before committing to a long-context strategy.


Per-model long-context details, 2026 snapshot

A consolidated snapshot of advertised and effective long-context characteristics across the major models in production as of mid-2026. Effective context numbers are drawn from RULER-style evaluations published by independent benchmarkers; precise numbers vary by methodology.

Frontier proprietary models

Gemini 2.5 Pro. Advertised 1M tokens generally, 2M in some access tiers. Position encoding: a Google-internal scheme combining RoPE-class rotations with custom extensions. Effective context (RULER-class): strong at 128K, degradation visible at 512K, significant degradation past 1M for retrieval-heavy queries. Best-in-class on long-document QA per most public benchmarks.

Claude Opus 4.x / Sonnet 4.x. Advertised 200K tokens standard, 1M-token enterprise tier. Position encoding details not publicly disclosed but believed RoPE-based with NTK or YaRN-style extension. Effective context: very strong throughout 200K; the 1M tier shows meaningful degradation past ~400K on complex tasks. Strong performance on cross-document reasoning specifically.

GPT-5. Advertised ~200K tokens. OpenAI does not publish detailed internal architecture. Effective context: comparable to Claude at similar lengths. Recent improvements in reasoning mode help with multi-step long-context tasks.

Frontier open-weight models

Llama 4 family (Meta, 2025). Maverick variant: 1M-token context. Scout variant: 10M-token advertised context (one of the longest advertised). Position encoding: RoPE with iRoPE (interleaved) modifications. Effective context: Maverick is strong to ~256K and acceptable to ~1M; Scout's 10M is more aspirational than operationally robust for retrieval-heavy tasks.

DeepSeek-V3 (and V3.5 if released by mid-2026). Advertised 128K tokens. Uses Native Sparse Attention (NSA) — learned sparse patterns. Effective context very strong at 128K thanks to architecture; sparse pattern means actual compute is much lower than naive full attention at the same length.

Qwen 2.5 / Qwen 3. Qwen 2.5-72B with YaRN extension: 1M tokens. Qwen 3 lineage continues this. Effective context: strong to ~200K, degraded past 512K.

Mistral / Mixtral. Mistral Large 2 and successors: 128K. Mixtral 8x22B: 64K. Mistral uses sliding-window attention with a window of typically 4096 plus global attention layers; this gives O(n) memory at long context but limits cross-context retrieval to global layers' bandwidth.

Gemma 3. Google's open-weight 2B/4B/8B family. Sliding-window attention with global attention every 5 layers; window typically 4096. Effective context: solid at 32K, degraded past 128K.

Kimi K2 (Moonshot, China). Advertised 2M tokens. Strong long-document benchmark performance per Moonshot's published evaluations.

MiniMax Text-01. Advertised 4M tokens via a hybrid linear-attention + full-attention architecture. Effective long-document benchmarks are competitive.

Hybrid and SSM-leaning models

Jamba (AI21, 52B total). Hybrid Transformer-Mamba — Transformer layers interspersed with Mamba layers in a specific ratio (roughly 1:7). 256K advertised context with much better effective performance at long lengths than pure-transformer equivalents.

Falcon-Mamba (TII). Pure SSM 7B model. 1M+ context advertised. Limited by being smaller and less aligned than transformer flagships.

Recurrent-Gemma (Google). RG-LRU-based architecture, smaller, designed for efficient on-device deployment with long context.

What this means in practice

The takeaway from the 2026 model landscape: advertised context length is no longer the differentiator (most flagships are 200K+, several are 1M+). The remaining differentiation is on effective context — how the model actually performs at long lengths. Gemini, Claude (1M tier), and Llama 4 lead on different parts of the curve. Open-weight models with hybrid or SSM architectures lead at the very long lengths (10M+) but lag at retrieval-heavy benchmarks.

The right model choice for long context depends on the specific workload. For document QA: Gemini 2.5 Pro or Claude 1M tier. For codebases: Claude Opus or Gemini. For streaming and very long: hybrids or SSMs. For self-hosted: Llama 4 Maverick or DeepSeek-V3.

Model Advertised Effective (retrieval-heavy) Best at
Gemini 2.5 Pro 2M ~256K–512K Document QA, multimodal long context
Claude Opus 4.x 1M tier 1M ~400K Cross-document reasoning
Claude Sonnet 4.x 200K ~150K General long-context QA
GPT-5 200K ~150K Reasoning over long inputs
Llama 4 Maverick 1M ~256K Self-host long context
Llama 4 Scout 10M ~512K Maximum advertised; quality varies
DeepSeek-V3 128K ~120K (excellent) Self-host; learned sparse
Qwen 2.5-72B 1M ~256K Multilingual long context
Kimi K2 2M ~512K Long-document Chinese-language
MiniMax Text-01 4M ~512K Maximum context with hybrid
Jamba 256K ~200K (very efficient) Cost-sensitive long context

The lookup-and-decide table for serious long-context deployments.


Long-context training: why pretraining at scale is hard

Most long-context discussions focus on inference. Training is where the bottleneck originates, and it shapes what's possible at inference time.

The long-document supply problem

Pretraining corpora are dominated by short-to-medium length texts: web pages, books, articles, code snippets. High-quality documents over 100K tokens are a small fraction of any internet-scraped corpus. Models trained primarily on short contexts learn position-dependent patterns that don't generalise to long contexts.

The result: even with a position-encoding scheme that supports 1M tokens, a model pretrained on mostly-short documents will underperform at long contexts because it hasn't seen the patterns. The fix is curated long-document training (combine books, codebases, long scientific articles, long-context synthetic data), but the supply is genuinely limited.

Sequence-parallel training

Training on long context requires distributing a single sequence across many GPUs. Three patterns:

  • DeepSpeed-Ulysses (head-sharding). Each GPU holds all tokens for some heads. All-to-all communication exchanges activations between heads. Scales well to ~32 GPUs.
  • Megatron Context Parallel. Each GPU holds some tokens for all heads. Pairwise communication. Scales to hundreds of GPUs.
  • Ring Attention (Liu et al.). KV blocks rotate around a ring of GPUs. Hides communication latency under compute. Used in large-scale training of 1M+ context models.

These training-time parallelism patterns are different from the inference-time patterns. A model trained with Megatron CP at 128K context can still be served with vLLM or SGLang at 128K context without those frameworks knowing how the training was distributed.

Curriculum training

A common pattern: pretrain at moderate context (8K–32K), then phase in longer contexts (32K → 128K → 256K → 1M) over a small fraction of total training tokens. The longer phases consume disproportionate compute per token but give the model the long-context patterns it needs.

This is why "long context" in modern models often shows the architecture supports it but the model is much better at the lengths it was heavily trained at. The Llama 4 family follows this pattern; Gemini 2.5 Pro is believed to follow a similar curriculum.

Synthetic long-context training data

Real long documents are scarce; synthetic ones are abundant if you're willing to construct them. Common patterns:

  • Concatenation. Glue multiple shorter documents together. Cheap, but the model learns to treat boundaries as discontinuities.
  • Question-document pairs at distance. Generate questions whose answers require attending to far-apart parts of the context.
  • Multi-hop reasoning chains. Construct chains where each step uses a different part of context.
  • Recall augmentation. Insert "needles" the model is asked to recall later.

The 2024–2026 trend is heavy investment in synthetic long-context data with explicit reasoning supervision. RULER-style tasks are now used as training data, not just evaluation.

Why open-weight long-context models lag

Pretraining at 1M context with a curriculum requires substantial compute (multiple thousand H100-equivalent days). Smaller labs can afford 128K-class training but struggle with 1M. The result: open-weight models advertise long context via YaRN-style extension more than via native training, which is why their effective context lags proprietary models with native long-context training.

This gap is closing as compute becomes cheaper and synthetic data techniques improve, but as of mid-2026, the proprietary advantage on effective long-context performance remains real.

For more on the training stack: distributed training (DP/TP/PP/FSDP) and post-training.

Training aspect Short-context (8K) Long-context (1M)
Document supply Plentiful Scarce
Compute per token Baseline 10–50× depending on attention pattern
GPU memory per sequence Manageable Requires sequence parallelism
Sequence parallelism Optional Mandatory
Synthetic data fraction Small Significant
Training-eval gap Small Substantial

The summary: long-context inference is the visible part; long-context training is the iceberg. The state of the art at inference is bottlenecked by the state of the art in training.

Engineering economics of long-context features

A practical question for engineering teams: when is investing in long-context support worth it relative to other priorities? A rough cost model.

Cost ingredients:

  • Engineering time for serving-stack work (chunked prefill, paged KV, KV quantisation, sequence parallelism): 3–6 engineer-months for a custom stack, 1–2 weeks to adopt vLLM or SGLang.
  • Eval harness for long context: 2–4 engineer-weeks to construct and validate.
  • Inference cost increase: roughly linear in context length for prefill, sub-linear for decode. A 10x context expansion increases per-query cost ~5–8x.
  • Latency budget impact: TTFT scales linearly with context; for interactive apps, beyond ~30K context the latency starts to violate UX expectations.

Benefit ingredients:

  • Reduced need for retrieval engineering (when the corpus fits).
  • Higher answer quality on cross-document and long-document tasks.
  • Simpler architecture (one model handles more cases without retrieval scaffolding).

Decision rule of thumb. For a single product, prefer the simpler architecture (short context + RAG) until you have specific evidence that long context delivers materially better outcomes. The simpler architecture costs less to build, less to run, and less to debug. Move to long context only when the workload demands it (legal docs, codebases, complex agents with accumulating tool history).

For multi-product platforms (foundation-model APIs, multi-tenant serving), supporting long context is table stakes since some customers need it; the question becomes how to amortise the engineering cost across tenants.


The bottom line

The quadratic attention wall doesn't disappear — it gets moved. FlashAttention pushes it from memory back into compute; ring attention spreads it across GPUs; sliding windows and sparse patterns trade global recall for linear cost. The single biggest lever in production today is the KV cache, not attention itself: at 128k and above, KV memory and bandwidth dominate latency, and KV quantization plus paged attention dwarf the rest of the optimization budget.

  • Advertised context length is not effective context length. Evaluate with RULER or a workload-shaped needle test before promising a number.
  • Use FlashAttention-2 or -3 by default; if you are not, you are paying the n² memory tax for nothing.
  • For 200k+ on a single sequence, plan for sequence parallelism (Ring or Ulysses) and a fabric that supports it.
  • KV quantization (FP8 or INT4) is the highest-ROI long-context optimization for serving.
  • For most workloads, retrieval over a short context beats raw long context on cost and quality. Reach for long context when the task is genuinely global.

For the dominant cost driver, see KV cache. For the precision lever that compounds with everything here, see quantization tradeoffs. For the fabric that ring attention demands, see NVLink and rack-scale topology.


FAQ

Does FlashAttention work with custom attention masks? Yes. FlashAttention-2 and -3 support various mask patterns (causal, sliding window, document masks). Custom masks need to be expressible in their kernel framework.

Is RoPE the only viable position encoding for long context? Other approaches exist (ALiBi, NoPE, learned positions), but RoPE plus YaRN-style extension dominates production. ALiBi is used by some labs; performance is competitive at moderate lengths.

How does long context interact with quantization? Cleanly. FP8 KV cache plus weight quantization is standard for long-context production. Sub-FP8 KV is workload-dependent.

Why does quality degrade in the middle of long contexts? Hypothesized: a combination of training data distribution (attention to early and late positions during training), position encoding limitations, and softmax over many tokens diluting attention to mid-context content.

Is RAG dead with long context? No. RAG is cheaper, scales to larger corpora, and often produces better focused answers. Long context is a complement, not a replacement.

Can I run 1M-token contexts on one GPU? Only with aggressive quantization and a small model. For useful frontier models, multi-GPU is required.

What's the latency hit for very long contexts? Prefill is O(n²) — a 1M-token prefill takes minutes on serious hardware. Streaming and chunked prefill help latency-perceived; total compute is still large.

Are state-space models a long-context win? Maybe. Mamba and successors have O(n) compute and constant memory. Quality at the frontier is still behind attention. The field is watching.

How does YaRN compare to LongRoPE or NTK-aware scaling? YaRN is a strict superset in practice — it includes the NTK-aware idea plus attention-temperature correction. LongRoPE adds per-dimension scaling factors and is competitive on 1M+ extensions but harder to tune. Most open-weight long-context recipes in 2026 use YaRN or a YaRN-derived approach.

When should I pick DeepSpeed-Ulysses over Ring Attention? Ulysses' comm cost is independent of sequence length but proportional to hidden size; Ring's is the opposite. For very long sequences on rack-scale fabric where all-to-all is cheap, Ulysses wins. For moderate lengths or ring-shaped fabrics, Ring is simpler and competitive.

Can sliding-window attention serve 128k context? Mechanically yes, but the "effective context" is the window size, not 128k. A 4k sliding window on a 128k context means the model has no access to tokens more than 4k away. Mixed-strategy models (some full, some sliding) preserve some long-range capacity.

Does FlashAttention-3 help on Blackwell? Yes — FA3 added Hopper-specific optimizations; Blackwell-targeted updates extend the same approach. The throughput gains are substantial on long sequences. Most serving stacks now bundle FA3 or equivalent.

Is RAG always cheaper than long context? For document QA with focused questions, yes — RAG over 8–32k context costs a fraction of 1M-context inference. For tasks requiring global synthesis across an entire document, long context wins. The hybrid (smart RAG into a long-context model) is increasingly the default.

How do I evaluate "effective context length" cheaply? Run RULER (arXiv:2404.06654) on your target model at the lengths you care about. Plot accuracy vs context length. The point where accuracy starts to fall sharply is roughly your effective context. Pair with workload-specific tasks for a fuller picture.

Does long context interact with reasoning model serving? Yes, significantly. Reasoning models generate long chains of thought, which extend the KV cache during decode (not just prefill). Long-context plus reasoning compounds KV pressure — plan KV quantization and offload strategies accordingly.

How long does a 1M-token prefill actually take? On a 70B model with FA3 on 8x H200 with proper tensor and sequence parallelism, a 1M prefill takes ~90-150 seconds. On a B200 NVL72 with full ring attention, ~30-60 seconds. On lesser hardware, multiple minutes. Chunked prefill improves perceived latency but does not reduce total wall time. The cost is real: a single 1M prefill consumes more GPU-seconds than thousands of typical chat prefills.

Should I cache KV for long-context requests? Aggressively yes. If your workload has any shared long prefixes (a long policy document, a system prompt, a fixed corpus), caching the KV for that prefix saves the entire long prefill on every subsequent request. Anthropic and OpenAI prompt caching features are user-visible surfaces of this. For self-hosted, vLLM's automatic prefix caching and SGLang's RadixAttention handle it. Prefix-cache hit rate on long-context production workloads commonly exceeds 70%.

Are long-context models still autoregressive? Yes. Each generated token still attends to the full preceding KV cache. The cost of generating each token grows linearly with context length even after prefill is done — a token generated at position 1M reads ~335 GB of KV (FP16) to compute attention. This is why decode at long context is so much more expensive per-token than at short context, and why KV quantization matters disproportionately for long-context decode.

Does RoPE work for non-text modalities (vision, audio)? Variations of RoPE are used in vision transformers (axial RoPE, 2D RoPE) and audio transformers. The same context-extension challenges apply: a model trained at 224×224 image resolution will behave poorly at 4K resolution unless the position encoding is extended. The recipes are emerging; YaRN for images is an active research direction. See our multimodal serving guide.

Is there a quality difference between FlashAttention and standard attention? Mathematically, no — FlashAttention computes exact attention with the same numerical result as standard attention, modulo floating-point precision. In practice, the order of summation in the softmax accumulation can produce tiny numerical differences (< 1e-5 in the output), which are negligible. There is no quality regression from using FlashAttention.

What's the relationship between MoE and long context? Cleanly orthogonal. MoE replaces the FFN block; long-context techniques operate on the attention block. A 1M-context MoE is the same engineering as a 1M-context dense model, plus the MoE all-to-all on top. The combination is the frontier (DeepSeek-V3 at 128k, gemini at 2M MoE) and it stresses both KV memory and all-to-all bandwidth. See our MoE serving guide.

How does context extension affect safety alignment? Empirically, aggressive context extension (e.g., YaRN from 4k to 1M) can soften the model's safety training, because the original safety fine-tuning was done at much shorter contexts and the model's behavior at extended lengths drifts. Production deployments routinely re-run safety evaluation on the extended-context model. See our production safety guardrails post.

What hardware is sufficient for 128k production serving? For a 70B model at 128k context with FP8 KV: one H200 (141 GB) per active request is comfortable; two H100 SXM (80 GB each) per request with tensor parallelism works but is tight. For batched serving (multiple concurrent 128k requests), assume one H200 per 2-4 concurrent requests after KV quantization. The cost per request at 128k is roughly 5-10× the cost at 8k, dominated by KV memory and longer attention compute.

Do I need ring attention for 128k context? No. 128k context fits on a single GPU's HBM for a single request with FP8 KV. Ring attention is needed when a single sequence exceeds one GPU's HBM, which is roughly 256k+ on H100 (FP16) and 1M+ on H200 with INT4 KV. Below that, tensor parallelism + chunked prefill is sufficient.

What's the future of long context? Three trends in 2026: (1) state-space architectures (Mamba, hybrid) potentially replacing pure attention at very long context; (2) better KV compression closing the gap between advertised and effective context; (3) tighter integration with retrieval (long-context-aware retrievers). The combined effect should make 1M context as routine in 2027 as 128k is in 2026.

Why does my long-context model fail on multi-needle queries? Multi-needle requires the model to locate multiple pieces of information and combine them. Single-needle requires only one retrieval. The combinatorial difficulty grows with the number of needles; most models that score >95% on single-needle NIAH drop to 60-80% on 5-needle. The remedy is task-specific post-training, not architectural changes.

Should I use Jamba or Llama for long-context production? For most workloads, Llama-3 70B with FP8 KV (or Llama-4 Scout for extreme context) is the safer choice; the ecosystem support is broader and serving stacks are more mature. Jamba and other hybrids are interesting for streaming or memory-constrained deployments where the linear-time inference matters; for batch serving on H100/H200 with KV compression, the win is smaller than expected.

Is FlashAttention 3 worth the upgrade from FA2? On Hopper (H100/H200): yes, 1.5-2× speedup on attention is significant. On Ampere (A100): no, FA3 requires Hopper-specific tensor core features. On Blackwell (B200): yes, and FA3 plus B200's higher HBM bandwidth amplifies the win. Always pin the FlashAttention version explicitly in production.

What is StreamingLLM and when do I need it? StreamingLLM (Xiao et al., 2023) is a sliding-window inference pattern that preserves attention sinks at the start of the sequence. It enables LLMs to handle effectively infinite streams (chatbot history, audio transcription) without OOM. Use it for streaming workloads; not needed for standard batch inference.

How does Mamba inference compare to transformer inference? Mamba has constant memory per token (no growing KV cache) and linear time per token. For very long sequences (>1M tokens) this is structurally better than transformers. Quality is competitive with transformers on most benchmarks but trails on exact-retrieval (NIAH) tasks. Hybrid architectures (Jamba) recover the retrieval performance.

What is the Native Sparse Attention in DeepSeek-V3? DeepSeek-V3 introduced training-time structured sparsity: each attention head learns to attend to a sparse subset of positions selected by a router. Saves 30-50% of FLOPs vs dense attention with minimal quality loss. The training-time integration is what makes it work; post-hoc sparsity typically degrades quality more.

Should I use ring attention at 256K context? Only if 256K does not fit on one GPU. With INT4 KV, 256K Llama-70B context fits on one H200 (141 GB). At FP16, it requires multi-GPU tensor parallelism. Ring attention adds engineering complexity; use it only when sequence parallelism is genuinely required.

How do I evaluate long-context performance for my workload? Build a workload-specific eval set: 50-200 examples from production traces with hand-validated answers. Run the model at increasing context lengths (32K, 64K, 128K, 256K, ...) and measure quality. The "effective context" is the largest length where quality holds at ≥95% of baseline. This number is more relevant than any public benchmark.

Does YaRN-extended context match natively-trained long context? No, not at the extreme. YaRN-extended models typically achieve 50-70% of natively-trained quality at the same context length. For high-stakes long-context applications, natively-trained models (Gemini, Claude, Llama-4) outperform YaRN extensions of base models.

What's the difference between Ulysses and Ring attention? Both distribute a single sequence across GPUs. Ulysses splits along the sequence dim and uses all-to-all communication for the attention matrix; Ring splits along the sequence dim and rotates K/V around a ring topology. Ulysses is faster at small scales (≤8 GPUs); Ring scales better at large scales (≥16 GPUs). Modern stacks support both and choose based on topology.

How does sliding window attention interact with retrieval performance? Pure SWA models cannot retrieve information further than W tokens from the query reliably. For retrieval-heavy workloads, SWA models underperform full-attention models at long context. Hybrid SWA + global layers recover most of the retrieval performance at much lower compute cost than full attention.

Should I implement my own KV compression? Probably not. Production-grade KV quantization (KIVI, KVQuant) is implemented in serving stacks (vLLM, SGLang, TRT-LLM) and well-tested. Roll-your-own typically loses 2-5% quality compared to library implementations due to subtle decisions about per-channel scaling, outlier handling, and dequant fusion. Use the library.

How does FlashAttention-3 differ from FA2 in practice? FA3 uses Hopper-specific features (WGMMA async matrix multiply, TMA async memory copy, FP8 tensor cores) to overlap memory and compute much more aggressively than FA2 could. On H100, FA3 BF16 is roughly 1.5–2× faster than FA2, and FA3 FP8 adds another ~1.7× on top. For long-context prefill specifically (where attention dominates), the speedup translates directly to lower TTFT. On non-Hopper hardware (A100, AMD), FA3 falls back to FA2-equivalent kernels.

When does linear attention or SSM actually beat full attention? Three regimes. First, context > ~1M tokens where O(n²) full attention becomes intractable. Second, streaming workloads where state-based update is natural. Third, hardware constrained environments where the kernel-level efficiency advantage matters. For typical 128K–256K production workloads, full attention with FA3 is still faster and higher quality. Hybrid models (Jamba, Falcon-Mamba, Recurrent-Gemma) are the practical middle ground.

What's "context dilution" and how do I detect it? Context dilution is the observed degradation when relevant information is a small fraction of a large context. Symptoms: model generates more generic responses, fabricates details, or fails to reference clearly-present information. Detection: compare same model's output at 8K, 32K, 128K context with the same key information present; if 128K is meaningfully worse, you have dilution. Mitigations include positional emphasis (put key info at start or end), explicit "focus on these sections" instructions, or fall back to retrieval.

Is paged KV (vLLM) compatible with KV quantization? Yes, vLLM supports paged KV with FP8 quantization out of the box as of mid-2026; INT4/INT8 are available via plugins. SGLang has equivalent support. The combination is the production default for long-context serving — paging gives memory efficiency for varying-length requests, quantization gives memory efficiency per token. Together they enable serving 1M-token contexts on multi-GPU setups that wouldn't fit otherwise.

What does "advertised 1M context" mean for an open-weight model? Almost always: the model was trained or post-trained at 1M tokens for at least some passes, but effective quality at 1M is substantially below the model's quality at 128K. The RULER benchmark consistently shows large degradation between 128K and 1M for most "1M context" models. The honest interpretation: 1M is the architecture's maximum, not the operating quality maximum. Plan for effective context around 200K–400K even on 1M-advertised models for retrieval-heavy work.

How does DeepSeek-V3's Native Sparse Attention differ from earlier sparse approaches? DeepSeek-V3 (released early 2025) uses NSA — a learned sparse attention pattern where the model itself decides which tokens to attend to. This contrasts with fixed-pattern sparse (Longformer's window+global, BigBird's window+random+global) which use hand-designed patterns. NSA delivers near-full-attention quality at a fraction of the compute, especially at very long contexts. The technique is being adopted in other 2026 models.

What's the right KV quantization for a 1M-token context workload? INT8 is the safe default — quality loss <0.5% for most models, memory savings 2×. FP8 is a good alternative on Hopper / Blackwell hardware where it has dedicated support. INT4 doubles memory savings again but starts to show measurable quality regression on long contexts; use only after careful evaluation. Per-head or per-channel quantization typically outperforms per-tensor.

Can I use long context to replace fine-tuning for domain adaptation? For knowledge yes, for tone no. Putting a domain corpus into context gives the model access to the information but doesn't change its writing style, default vocabulary, or reasoning patterns. Fine-tuning is still the right answer when you want the model to sound like your domain. The hybrid pattern (fine-tune for tone, RAG/long-context for knowledge) is increasingly common.

Why does my model's quality drop right at the advertised context limit? Three possible reasons. (a) The model wasn't trained at exactly that length; the limit is what it nominally supports, not what it does best at. (b) Position encoding extrapolation degrades near the limit. (c) Serving stack edge cases — KV cache management can have off-by-one issues near the boundary. The robust workflow is to test 1–10% below the advertised limit and stay there for production.

Does multi-query attention (MQA) or grouped-query attention (GQA) help with long context? Substantially. MQA and GQA share K/V across heads, reducing KV cache size by 8–32×. Almost every flagship in 2026 uses GQA (Llama 3+, Gemma 2+, Mistral, Claude family). For long context this is a multiplicative win — without GQA, current models would be impractical at 100K+ context.

How does prefix caching change long-context economics? Dramatically. If your workload has stable prefixes (system prompts, document context that's reused across queries), the cached prefix's prefill cost is paid once. Anthropic's prompt caching reduces costs by up to 90% for repeated prefixes; OpenAI's caching and Gemini's context caching offer similar savings. For products built around a stable knowledge base + variable user queries, prefix caching can change the economics decisively.

What's the worst-case latency hit of going from 8K to 1M context? Prefill scales roughly linearly in context length on FA3, so a 1M-token prefill is roughly 125× the time of an 8K prefill on the same hardware. On a single H100 with a 70B model, expect 5–10 seconds for 1M-token prefill versus ~80 ms for 8K. Multi-GPU sequence parallelism (Ring, Ulysses) brings this down meaningfully. Decode latency per token is similar regardless of context length, so total response time depends on prefill share.

Are there any contexts where SSMs cleanly beat transformers? Yes: streaming workloads (live transcription, real-time monitoring) where context grows indefinitely and you want O(1) per-token decode. Mamba and RWKV families have shipped production deployments for streaming. For batch workloads with fixed context, transformers still dominate. Hybrid architectures (full attention layers interspersed with SSM layers) get most of both benefits.

How does context window affect tool-use and agent workloads? Significantly. Agent loops with many tool calls accumulate context fast — each tool call adds the call, the result, and any reasoning. By turn 10–20 of an agent loop, you can easily hit 100K+ tokens just from accumulated tool history. Long context is enabling more capable agents (especially research and code-modification agents), but the cost per agent run scales accordingly. See agent serving infrastructure.

Will 10M-token context be common by 2027? Plausible for some models, but the effective context will likely lag. Gemini and a few competitors are already pushing context beyond 1M. The architectural changes needed (NSA-style learned sparse attention, better positional extrapolation, training-distribution improvements) are tractable. The quality at 10M effective will probably remain a research challenge through 2027 — expect "10M architecture, 1M effective" as the typical 2027 state.


Glossary

  • ALiBi — position encoding via attention bias. Alternative to RoPE.
  • Chunked prefill — splitting a long prefill into smaller chunks for better scheduling and lower TTFT.
  • FlashAttention — IO-aware attention kernel avoiding n×n materialization.
  • Lost in the middle — quality degradation for information at the middle of long contexts.
  • Needle-in-haystack — long-context recall evaluation.
  • NTK-aware scaling — frequency-band-aware RoPE extension.
  • Position interpolation — naive RoPE extension by linear position compression.
  • Ring attention — distributed attention with KV chunks rotating among GPUs.
  • RoPE — Rotary Position Embedding.
  • Sliding window attention — each token attends only to a local window.
  • State-space model — alternative architecture with O(n) attention-like operations (Mamba, etc.).
  • YaRN — Yet another RoPE extensioN. Sophisticated context-extension method.

References

  • FlashAttention — Dao et al., 2022. arXiv:2205.14135. IO-aware tiled attention.
  • FlashAttention-2 — Dao, 2023. arXiv:2307.08691. Better work partitioning.
  • FlashAttention-3 — Shah et al., 2024. arXiv:2407.08608. Hopper-specific optimization.
  • RoFormer / RoPE — Su et al., 2021. "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864.
  • YaRN — Peng et al., 2023. "YaRN: Efficient Context Window Extension of Large Language Models." arXiv:2309.00071.
  • Ring Attention — Liu et al., 2023. arXiv:2310.01889.
  • Lost in the Middle — Liu et al., 2023. arXiv:2307.03172. The canonical reference for mid-context degradation.
  • StreamingLLM — Xiao et al., 2023. "Efficient Streaming Language Models with Attention Sinks." arXiv:2309.17453.
  • H2O — Zhang et al., 2023. "Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models." arXiv:2306.14048.
  • RULER — Hsieh et al., 2024. "RULER: What's the Real Context Size of Your Long-Context Language Models?" arXiv:2404.06654.
  • Mamba — Gu, Dao, 2023. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752. State-space alternative.
  • ALiBi — Press, Smith, Lewis, 2021. "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation." arXiv:2108.12409. Alternative position-encoding lineage.
  • DeepSpeed-Ulysses — Jacobs et al., 2023. "DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models." arXiv:2309.14509. All-to-all-based sequence parallelism.