Prompt20
All posts
moemixture-of-expertsinferenceexpert-parallelismall-to-alldeepseekmixtraldeepepmegablocksguide

Mixture of Experts: The Complete Guide

The definitive guide to Mixture of Experts models: how routing works, why expert parallelism replaces tensor parallelism, the all-to-all bottleneck, load balancing under skew, serving economics, and what breaks at scale.

By Prompt20 Editorial · 92 min read

A Mixture of Experts (MoE) model is a transformer with the feed-forward block replaced by N parallel "experts" plus a router that picks the top-k per token. Parameter count rises; per-token compute stays roughly fixed.

The take. MoE pays at scale and only at scale. The capability-per-active-FLOP win is real (Switch Transformer, Mixtral, DeepSeek-V3 prove it), but it's an economy-of-scale story — high QPS, multi-tenant pooling, rack-scale fabric, generous HBM. Below that threshold a dense model with the same active-parameter count usually wins on cost, latency variance, and operational simplicity. The frontier going MoE does not mean you should.

The rest of this guide is the systems side: where the network becomes the bottleneck, how routing imbalance destroys throughput, and what the engineering trade-offs actually cost. Routing algorithms (top-k, expert-choice, soft, sinkhorn, auxiliary-loss-free), the all-to-all collective and the rack-scale fabric built to swallow it, expert-parallelism layouts, replication of hot experts, production load balancing, and concrete case studies from DeepSeek-V3, Mixtral, and Llama 4. Cross-links to the NCCL guide, NVLink and rack-scale topology, the KV cache guide, and disaggregated inference — MoE never fails in isolation.

Table of contents

  1. Key takeaways
  2. Mental model: MoE serving in one minute
  3. The MoE landscape in 2026
  4. How MoE works
  5. Where the cost actually moves
  6. Expert parallelism
  7. The all-to-all collective
  8. Load balancing and the routing problem
  9. Routing algorithms compared
  10. MoE inference patterns
  11. Batch size pressure
  12. Capacity factor and token drops
  13. MoE under disaggregation
  14. Hardware and topology fit
  15. Production deployments in 2026
  16. Load balancing in production
  17. When MoE wins
  18. When dense wins
  19. Open problems
  20. Quantization for MoE
  21. When to choose specific MoE configs
  22. Routing strategies deep dive
  23. Communication patterns deep dive: dispatch, combine, overlap, DeepEP
  24. Per-architecture deep dive: Mixtral, DeepSeek-V3, Qwen-MoE, Arctic, DBRX, Llama-4, Grok-1, Hunyuan-Large, Jamba-MoE
  25. Composed parallelism: EP + TP + PP on GB200 NVL72
  26. Inference engines for MoE: TensorRT-LLM, vLLM, SGLang, llama.cpp, FasterMoE, MegaBlocks
  27. MoE on Blackwell: FP4 experts and sparse tensor cores
  28. MoE inference cost arithmetic
  29. Failure modes: router collapse, expert death, recovery
  30. Upcycling dense → MoE and the 2026 scaling story
  31. LoRA on MoE and expert-specific adapters
  32. Worked example: serving DeepSeek-V3 on GB200 NVL72
  33. KV cache for MoE
  34. The bottom line
  35. FAQ
  36. Glossary
  37. References
  38. Per-architecture deep dive: 2026 MoE catalog
  39. Routing strategies catalog
  40. All-to-all communication deep dive
  41. MoE inference engines compared
  42. MoE on Blackwell FP4
  43. MoE failure modes in production
  44. Cost-per-token math for MoE deployments
  45. Benchmarks: MoE serving throughput by config
  46. When to upcycle dense to MoE
  47. MoE-specific FAQ

Key takeaways

  • MoE keeps per-token FLOPs roughly constant while scaling parameter count. Capability per active-FLOP is genuinely higher than dense.
  • The cost moves from compute to memory and network: all experts live in HBM, and token routing requires all-to-all collectives at every MoE layer.
  • Expert parallelism (EP) replaces tensor parallelism for the MoE block. Each GPU owns some experts.
  • All-to-all is bandwidth-hungry. MoE training and inference are NVLink-bound or rack-fabric-bound in a way dense isn't.
  • Routing is imbalanced in practice. Capacity factors, drop policies, and auxiliary load-balancing losses fight this; none fully solve it.
  • Batch size matters more for MoE than for dense. Low-QPS MoE inference is wasteful.
  • MoE is a serving-economy win at scale. At small scale, dense is often better.
  • Frontier reality: every major lab's frontier model in 2026 is MoE. Open-weight dominant: DeepSeek-V3, Qwen3-MoE, Llama 4 series.

The MoE serving stack at a glance

[Request]
  │
  ▼
[Router / scheduler]  ── tenant fairness, prefix affinity
  │
  ▼
[Prefill workers]      ── high FLOPs, EP=8-16, all-to-all over NVLink
  │   KV layer-wise stream
  ▼
[Decode workers]       ── high HBM, EP=16-72, expert replication, MegaBlocks
  │
  ▼
[Streaming response]

Each layer carries its own MoE-specific concerns: the router must understand both prefix locality and per-expert load; prefill workers run dense compute at MoE-shape; decode workers carry the all-to-all weight and the imbalance pain. The handoff between them must preserve the topology assumptions on each side.


Mental model: MoE serving in one minute

The named problem is the activation/parameter gap. A modern MoE has hundreds of billions of parameters, but any single token only touches a small fraction of them. The unused weights still sit in HBM, the router still has to pick which ones to fire, and tokens still have to find their way to the right GPUs. You are paying memory cost for the full model and network cost for the routing, while only getting compute cost for the active slice.

The maître d' analogy is the cleanest. The router is a host standing at the door of a restaurant with hundreds of specialised kitchens. Each diner (token) is sent to the two or three kitchens best suited to their order. The kitchens are spread across a city block (the GPU rack), so every order requires a courier run — that courier run is the all-to-all collective. The restaurant only works if the host distributes orders evenly and the couriers ride fast roads (NVLink, not Ethernet).

Dimension Dense transformer MoE transformer
Params in HBM All used per token All loaded, ~5–15% used per token
Dominant cost FLOPs HBM capacity + all-to-all bandwidth
Scaling primitive Tensor parallel Expert parallel
Bad batch behaviour Linear slowdown Token drops or padding waste
Hardware fit Any NVLink node Rack-scale fabric (NVL72 class)
Sticky number n/a DeepSeek-V3: 671B total, 37B active per token

In code, the routed FFN is conceptually:

# top-k token-choice routing (simplified)
gate_logits = router(x)                 # [tokens, num_experts]
idx, weights = topk(gate_logits, k=2)   # which experts, how much
y = sum(weights[i] * experts[idx[i]](x) for i in range(k))

In production this is fused into a grouped GEMM and a pair of all-to-all collectives (dispatch + combine) — see DeepEP and MegaBlocks. The mental model carries through: routing, dispatch, expert compute, combine.

Read the rest of this guide as a tour of what happens when that two-line idea meets a 72-GPU rack and real traffic.


The MoE landscape in 2026

The MoE world in 2026 is bigger than "sparse FFN" — it's a stack of routing algorithms, expert layouts, kernels, and serving systems that have co-evolved with rack-scale hardware. A rough field map:

Model families. DeepSeek-V3 and R1 (256 routed experts + 1 shared, top-8, auxiliary-loss-free balancing), Qwen3-MoE (Alibaba's high-expert-count line), Llama 4 Maverick and the still-training Behemoth (Meta), Mixtral 8x7B / 8x22B (Mistral, the line that made MoE accessible), DBRX (Databricks, 132B / 36B active, fine-grained), Grok-1 (xAI, 314B / 78B active), and the closed-source GPT/Claude/Gemini frontier models whose pricing curves are consistent with MoE.

Routing algorithms. Top-k token choice (Switch, GShard, Mixtral), top-k with auxiliary load-balancing loss, expert-choice routing (Zhou et al., 2022 — experts pick their top tokens, guaranteeing balance), soft routing (no hard top-k, every expert weighted), sinkhorn-balanced routing (entropy-regularized optimal transport), and DeepSeek's auxiliary-loss-free bias-based balancing.

Kernels and libraries. MegaBlocks (block-sparse GEMM for variable-sized expert batches, removes the need to pad to capacity), Tutel (Microsoft's adaptive MoE comm library), ScatterMoE, Grouped GEMM in cuBLAS / hipBLASLt, NVIDIA Transformer Engine MoE kernels, and DeepEP — DeepSeek's open-source expert-parallelism communication kernels tuned for H800 and B200 fabric.

Serving stacks. vLLM (production MoE serving with EP), SGLang (RadixAttention + MoE), TensorRT-LLM (deeply integrated with rack-scale fabric), DeepSpeed-MII (Microsoft), and lmdeploy (InternLM). Each has different sweet spots for expert count, EP size, and KV-cache strategy.

Hardware substrates. B200 with NVL72 (the prototypical MoE rack — 72 GPUs in one NVLink domain), H100/H200 NVSwitch nodes (workable for EP up to 8), MI300X / MI325X with Infinity Fabric (competitive HBM, software maturing), and older A100 / MI250 generations where the all-to-all penalty makes large-EP MoE serving uneconomic.

Adjacent techniques. Speculative decoding with MoE targets (the draft must mimic routing — see speculative decoding), MoE-to-dense distillation, per-expert quantization (some FP4, others FP8 — see quantization tradeoffs), and MoE in attention layers (still mostly research). Every layer of the modern serving stack has been bent toward MoE since 2023; even dense workloads now run on hardware designed for it.

Active-to-total parameter ratios across the 2026 lineup

The single most useful comparison is the active/total ratio, because it tells you both inference cost and how aggressively the model uses sparsity.

Model Total params Active params Active ratio Experts Routing
Mixtral 8x7B 47B 13B 27% 8 top-2
Mixtral 8x22B 141B 39B 28% 8 top-2
DBRX 132B 36B 27% 16 top-4
Grok-1 314B 78B 25% 8 top-2
Qwen2-MoE 57B-A14B 57B 14B 25% 64 top-8
DeepSeek-V2 236B 21B 9% 162 top-6
DeepSeek-V3 671B 37B 5.5% 256 + 1 shared top-8
Llama 4 Maverick ~400B ~17B ~4% many top-1

The frontier is clearly toward lower active ratios: more total parameters, more specialization, more experts. DeepSeek-V3's 5.5% active ratio is roughly what the 2026 frontier looks like; pre-2024 designs at 25-30% active are now considered coarse-grained.


How MoE works

In a dense transformer, every feed-forward layer applies one MLP to every token. Parameters: ~12 × d_model² per layer (the FFN's two matrices, with an expansion factor of 4).

In MoE, that single MLP is replaced by N parallel "experts," each a smaller MLP. A learned router computes a score for each (token, expert) pair, picks the top-k experts per token (typically k=1 or k=2), and dispatches the token to those experts. Each expert computes its FFN output. The results are weighted by router scores and summed.

For each token:
  scores = softmax(W_router @ x)           # one score per expert
  top_k = top_k_indices(scores, k)         # which experts to use
  output = sum(
    scores[i] * expert_i(x) for i in top_k
  )

Parameter count grows with N but per-token FLOPs scale with k, not N. A 8x22B MoE (Mixtral-class: 8 experts, 22B each, top-2) has ~141B total parameters but activates only ~39B per token. DeepSeek-V3 has 671B total parameters with ~37B active per token (256 routed experts plus a shared expert, top-8).

The bet is that adding parameters at fixed compute lets the model learn more nuanced mappings — different experts specialize on different input distributions — and that this raises capability per active FLOP.

The bet has paid off. The systems cost has been substantial.

Fine-grained vs coarse-grained experts

There are two design philosophies. Coarse-grained MoE (Mixtral, DBRX, Grok-1) uses 8-16 experts, each with full-FFN width (intermediate dim 14k for 7B-sized experts), top-2 routing, and a 25-30% active ratio. Fine-grained MoE (DeepSeek-V2/V3, Qwen-MoE) uses 64-256 experts, each with reduced intermediate dim (1.5-3k), top-6 to top-8 routing, and a 5-10% active ratio. The DeepSeekMoE paper (arXiv:2401.06066) argues that fine-grained gives higher specialization at the cost of routing overhead, and the empirical results have moved the field. The cost in serving complexity is real: 256 experts with top-8 routing means dispatching each token to 8 different GPUs, which is 4× the all-to-all volume of Mixtral's top-2.

The shared expert pattern

A "shared expert" is an always-active small MLP that runs on every token in addition to the routed experts. DeepSeek-V3 has one; DBRX does not. The shared expert acts as a base layer that handles general-purpose patterns, freeing routed experts to specialize harder. It also provides a graceful fallback for dropped tokens — if all top-k experts for a token are over-capacity and the token gets dropped from routing, the shared expert still contributes, preventing a complete information void. The cost is the shared expert's compute on every token; the gain is robustness and (empirically) better quality on the long tail.


Where the cost actually moves

In a dense model, compute and memory scale together. Bigger model means more FLOPs and more HBM traffic per token. They're correlated bottlenecks.

In MoE, they decouple:

  • Compute per token is fixed by k and the active parameter count. ~37B active params at FP8 → ~37 GB to read per token at batch 1, plus the routing overhead.
  • Memory footprint is fixed by total parameter count. ~671B params at FP8 → 671 GB resident across the GPU pool, regardless of how many activate per token.
  • Network traffic per token is the routing dispatch and combine: roughly 2× the token's hidden state per MoE layer, scaled by the parallel-expert count.

So MoE is:

  • More memory-hungry than its active-parameter count suggests.
  • More bandwidth-hungry than dense due to all-to-all.
  • Not more compute-hungry per token.

Concrete memory breakdown for a 671B MoE

For DeepSeek-V3 at FP8, the per-GPU resident footprint in a typical EP=8 layout on H200:

Component Size Notes
Routed experts (32 of 256) ~83 GB Each expert ~2.6B params at FP8
Shared expert (replicated) ~3 GB Always on every GPU
Attention weights (TP slice) ~5 GB If TP=1, full attention per GPU
Embedding + LM head ~2 GB Sharded or replicated
Activations + buffers ~10 GB Working memory
KV cache (per request) varies 256-2048 MB depending on context
Total static ~103 GB Fits in 141 GB H200 with room for KV

The takeaway: H200's 141 GB is the right minimum for DeepSeek-V3-class serving at EP=8. Anything less forces a wider EP and more all-to-all volume per token.

The implication: the right hardware for MoE looks different. You want HBM capacity (for the resident experts), HBM bandwidth (for decode reading weights), and very high inter-GPU bandwidth (for all-to-all). Pure FLOPs/$ matters less than for dense training.


Expert parallelism

The natural way to fit a large MoE across many GPUs is to put different experts on different GPUs. This is expert parallelism (EP).

A 256-expert MoE with EP=64 places 4 experts per GPU. A token routed to expert 137 is dispatched to the GPU that owns expert 137, regardless of which GPU the token currently lives on.

Why not tensor parallelism

Tensor parallelism (TP) splits each layer's matrices across GPUs. It works for dense FFNs because the same MLP is applied to every token — every GPU does its slice of every token's FFN.

For MoE, this is wasteful. If you TP-split each expert, every GPU has to participate in every expert's computation, even when only some tokens activate that expert. You've kept all the FLOPs and added all-reduces per expert.

EP is the natural choice: the unit of parallelism (one expert) matches the unit of activation (one expert call). FLOPs are sparse; communication moves tokens between GPUs instead of moving partial activations across all GPUs.

In practice, large MoE deployments use a mix: TP within an expert (if each expert is large enough to need it), EP across experts, and data parallelism on top of both. Our distributed LLM training guide covers how TP, PP, DP, and FSDP compose.

Composing EP with TP, PP, and DP

The full parallelism matrix for a large MoE looks like this. Each layer's compute is sliced along multiple axes simultaneously:

  • EP (expert parallelism): experts distributed across GPUs. Unit = one expert.
  • TP (tensor parallelism): matrices within an expert (or within attention) sharded. Unit = one matmul row/column slice.
  • PP (pipeline parallelism): layers split across stages, one micro-batch in flight per stage. Unit = one transformer layer group.
  • DP (data parallelism): replicated model, different batch shards. Unit = one model replica.

For DeepSeek-V3 training on H800, the published layout uses EP=64, TP=1 (experts are small enough to fit), PP=16, DP=many. Each combination has a different all-reduce or all-to-all pattern, and the order matters for the compute/communication overlap. PP overlaps cleanly with EP all-to-all; TP all-reduces inside an expert serialize with EP all-to-all and have to be carefully scheduled.

When to TP inside an expert

Pure EP works when one expert fits comfortably on one GPU's HBM and the expert's GEMM is large enough to saturate that GPU. For Mixtral 8x22B (experts ~22B params each), one expert per GPU does not fit at any reasonable precision — the expert needs TP=2 or TP=4 to span multiple GPUs. For DeepSeek-V3 (256 experts at ~2.6B params each), each expert easily fits on one GPU and TP is unnecessary at the expert level. The rule: if expert_params × bytes_per_param > 0.6 × HBM_per_GPU (leaving room for KV and activations), TP inside the expert. Otherwise, pure EP.


The all-to-all collective

Every MoE forward pass involves two all-to-all communications per MoE layer:

Dispatch. Each GPU has a batch of tokens. It needs to send each token to the GPU owning its assigned top-k experts. This is an all-to-all: every GPU sends some tokens to every other GPU, and receives some tokens from every other GPU.

Combine. After experts run, the outputs need to return to the GPU where each token originated, to be combined and passed to the next layer. Another all-to-all.

Two all-to-alls per MoE layer. For a 60-layer MoE with every other layer being MoE (30 MoE layers), that's 60 all-to-alls per forward pass. Per token.

Why this is hard

All-to-all bandwidth scales linearly with the number of participating GPUs. Doubling EP doubles the per-step communication time, unless network bandwidth also doubles. On a typical 8-GPU NVLink-NVSwitch node, all-to-all completes in microseconds. Across 64 GPUs over InfiniBand, it can be milliseconds. Across 256+ GPUs spanning multiple racks, more.

For training, this is tolerable because batch sizes are huge (millions of tokens per step) and compute fully overlaps communication. For inference, it's painful: small batches, latency-critical, and the all-to-all dominates step time.

The dominant fix

Rack-scale fabrics like NVL72 — extending NVLink-class bandwidth across 72 GPUs in a single rack — were largely motivated by MoE all-to-all costs. Inside the rack, an all-to-all across 64 experts runs at NVLink speed instead of InfiniBand speed, a factor of ~10× improvement. This is what makes large-EP serving practical. (See our NVLink and rack-scale topology guide for the fabric mechanics, and the NCCL guide for the collective itself.)

For deployments below rack scale, the rule of thumb is: keep your expert-parallel group inside a single fast-fabric domain. EP across InfiniBand works but is slow.

All-to-all latency budget by fabric

A concrete comparison of all-to-all latency for a single MoE layer dispatch+combine at typical decode batch sizes (256 tokens, 4096 hidden dim, top-8 routing):

Fabric Bandwidth (per GPU) Per-layer all-to-all 30-layer MoE forward
Intra-node NVLink (H100 NVSwitch, 8 GPU) 900 GB/s aggregate 25 µs 0.75 ms
NVL72 rack (B200) 1.8 TB/s aggregate 18 µs 0.54 ms
InfiniBand 400G (NDR) across 32 GPUs 50 GB/s 280 µs 8.4 ms
InfiniBand 200G (HDR) 25 GB/s 560 µs 16.8 ms
RoCE 200G ~22 GB/s 650 µs 19.5 ms
100G Ethernet 12.5 GB/s 1.1 ms 33 ms

The rack-scale fabric is not a luxury — at the 30-MoE-layer scale typical of modern frontier models, a slow fabric adds tens of milliseconds per forward pass, which destroys decode ITL. This is precisely why NVL72 sells out.

DeepEP and the kernel-level wins

DeepEP (DeepSeek's open-source expert-parallelism communication kernels, released alongside V3) replaces NCCL's default all-to-all with a custom implementation tuned for the dispatch+combine pattern. It fuses the local routing computation with the network send, eliminates a copy from compute buffer to send buffer, and exposes finer-grained scheduling to overlap dispatch with the prior layer's compute. Reported speedups vs stock NCCL are 1.3-1.8× on H800; similar gains have been reported on H200 and B200. The wider lesson: NCCL is general-purpose, MoE all-to-all is a specific pattern, and the gap between general and specific is large enough that custom kernels are standard at the frontier. See our NCCL tuning guide for the general primitives.


Load balancing and the routing problem

The router is learned. The model decides which tokens go to which experts. There is no guarantee that the routing is uniform across experts.

In practice, routing is very not uniform. Some experts get 5-10× the traffic of others on real workloads. The imbalance can be persistent (same experts always hot) or workload-dependent (different traffic patterns activate different experts).

Two costs of imbalance:

1. All-to-all stragglers. All-to-all is synchronous: the slowest receiver determines the step time. A GPU holding an over-subscribed expert is a bottleneck; the rest wait. Imbalance directly slows throughput.

2. Capacity overruns. Each GPU has finite HBM and finite compute. If too many tokens route to one expert, you have to either drop tokens, allocate slack capacity, or stall.

Mitigations during training

  • Auxiliary load-balancing loss. Add a term that penalizes high variance in per-expert token counts. Encourages the router to spread traffic. Trade-off: the router's quality (routing tokens to truly-good experts) competes with uniformity.
  • Expert dropout. Randomly mask experts during training to prevent over-reliance.
  • Z-loss and other regularizers on router logits.

These help. They don't eliminate imbalance — production MoE traffic is non-uniform by design (it's useful that different experts specialize).

Mitigations at serving time

  • Capacity factor. Cap tokens per expert per batch. Tokens above the cap are dropped to a fallback (identity, or a small dense layer).
  • Expert replication. Place hot experts on multiple GPUs and load-balance across replicas. Costs HBM; helps throughput.
  • Routing-aware request grouping. Send requests with similar topical distribution to the same replica so warm caches are hit.
  • Dynamic bias updates. DeepSeek's aux-loss-free bias can be updated at inference time too, nudging the router away from oversubscribed experts in real time. Risky if applied too aggressively (causes oscillations); useful with conservative step sizes.
  • Hierarchical routing. Route first to an expert group, then within the group. Reduces the effective N for any single all-to-all and localizes imbalance to within a group.

Quantifying imbalance: the load CV metric

The single most useful metric is the coefficient of variation of per-expert token counts: CV = stddev / mean across experts. A perfectly balanced router has CV = 0. Production MoE routinely shows CV = 0.3-0.7. Above CV = 0.5, you should be replicating hot experts; above CV = 0.8, you have a routing pathology and need to investigate the training process. Track CV per layer (not just averaged across layers) because imbalance is layer-specific in surprising ways.


Routing algorithms compared

"Top-k routing" is shorthand for a family of algorithms that differ on how they decide token-expert assignments — and the differences show up directly in throughput, balance, and quality.

Top-k token choice (Switch / GShard / Mixtral). Each token picks its k highest-scoring experts. Simple and what most people mean by MoE. Failure mode: nothing guarantees balanced load — popular experts get hammered, others idle, capacity overflows produce drops. Switch Transformer (Fedus et al., 2021, arXiv:2101.03961) was top-1; GShard (Lepikhin et al., 2020, arXiv:2006.16668) standardized top-2 with auxiliary load loss and capacity factors.

Expert-choice routing (Zhou et al., 2022, arXiv:2202.09368). Inverts the perspective: each expert picks its top-N tokens. Load balance is guaranteed by construction — every expert receives exactly N tokens. The trade-off: a token may be selected by zero experts (dropped implicitly) or by many. Excellent for training throughput; awkward for autoregressive decode (no token-level k cap), so production decoders typically stay with token-choice.

Soft routing. No hard top-k; every expert contributes weighted by its router score. Dense compute (you're computing every expert anyway), but smoother gradients. Used in some research models; rarely production because you lose the active-FLOP savings that are the entire point of MoE.

Sinkhorn / optimal-transport routing. Treat routing as an assignment problem with marginal constraints (uniform load on experts, top-k per token). Sinkhorn iteration solves it efficiently. Yields balanced assignments without auxiliary losses. Used in some research lines; production adoption limited by added compute per step.

Auxiliary-loss-free routing (DeepSeek-V3). Per-expert bias terms are nudged online: experts that are underloaded get their bias raised, overloaded experts get it lowered. No auxiliary loss term competing with the model's task loss. Tends to produce better quality at the same balance level. Described in the DeepSeek-V3 report (arXiv:2412.19437) and DeepSeekMoE (arXiv:2401.06066).

Hash routing / random routing. Bypass learning; assign tokens to experts by a hash of the token id or a fixed permutation. Trivially balanced; quality is worse than learned routing but a useful baseline for ablation.

Z-loss and router stability. A separate auxiliary term, the z-loss (introduced in ST-MoE, arXiv:2202.08906), penalizes large pre-softmax logits in the router. Without it, router logits can grow without bound during training, causing numerical instability in mixed-precision training. Z-loss is essentially free quality-wise and prevents an entire class of training failures. Universal in modern MoE training.

Routing collapse. A persistent training pathology where the router ends up routing nearly all tokens to a few experts regardless of input. Auxiliary loss, z-loss, expert dropout, and aux-loss-free bias updates each independently reduce the risk; the combination eliminates it in practice. If you see expert utilization CV above 1.0 during training, suspect collapse and tune the regularizers.

Practical rule. For frontier training and inference, the field has converged on token-choice top-k with either auxiliary load-balancing loss (Mixtral-style) or auxiliary-loss-free bias updates (DeepSeek-style). Expert-choice is attractive for training-only flows; soft and sinkhorn live mostly in research.

Routing algorithm comparison table

Algorithm Balance guarantee Quality Train/infer Production use
Top-k token choice (no aux) None Best raw Both Rare (load issues)
Top-k + auxiliary load loss Probabilistic Slight quality cost Both Mixtral, Llama 4
Top-k + aux-loss-free bias (DeepSeek) Probabilistic, no loss conflict Best of practical Both DeepSeek-V3, Qwen3
Expert-choice Exact (by construction) Strong on training Train only Research / training-only
Soft routing Dense compute Smooth gradient Both Research
Sinkhorn / optimal transport Exact Strong Train Research
Hash / random Exact (trivial) Poor Both Ablation baseline

The 2026 production default is top-k token-choice with aux-loss-free bias balancing and a shared expert. Everything else is workload-specific.


MoE inference patterns

Once you've picked a routing algorithm, the next decision is how the experts live across GPUs at serve time. There are three patterns in production.

1. Pure expert parallelism (EP). Each expert lives on exactly one GPU. A token routed to expert 137 must reach the one GPU holding it. This is the cleanest map, has the smallest HBM footprint, and makes all-to-all costs proportional to EP size. Default for high-expert-count models (DeepSeek-V3 with EP=64 puts 4 experts per GPU).

2. Expert replication. Hot experts are placed on multiple GPUs and load-balanced across replicas, similar to how you'd shard a hot key in a distributed cache. Costs HBM (you've doubled or tripled the footprint of the replicated experts) but eliminates the straggler problem for known-hot experts. Most production stacks support this as a tuning knob; it pairs naturally with telemetry from a few hours of traffic showing which experts trend hot.

3. Expert parallelism + tensor parallelism within an expert (EP × TP). When a single expert is large enough that one GPU can't hold it (or its decode bandwidth is insufficient), TP shards the expert across a small group of GPUs that act as the unit holding that expert. Common pattern: EP across racks, TP=2 or TP=4 within a node for each expert. Adds an internal all-reduce per expert call but is essential for very large experts. The combination is described in the DeepSeek-V3 technical report and supported in vLLM and TensorRT-LLM.

Variations.

  • Grouped experts. Several experts collocated on the same GPU and computed via grouped GEMM (one kernel launch processing per-expert sub-batches). MegaBlocks (Gale et al., 2023, arXiv:2211.15841) provides block-sparse kernels that eliminate the capacity-factor padding entirely.
  • Shared expert. An always-active small MLP layered on every token. Cheap, improves quality on dropped tokens, used by DeepSeek-V3 and DBRX.
  • Dispatch-combine fusion. Recent kernels fuse the dispatch all-to-all with the post-routing local GEMM, reducing kernel launches and HBM round-trips. DeepSeek's open-source DeepEP and NVIDIA's MoE kernels both go in this direction.

Per-pattern inference characteristics

Pattern HBM cost Throughput Latency variance Operational complexity
Pure EP Baseline Best when balanced High under imbalance Low
EP + replication of hot experts +10-30% Best in practice Low Medium
EP × TP Baseline Best for large experts Medium High
Grouped GEMM (multiple experts per GPU) Baseline Good at moderate scale Medium Low
Block-sparse (MegaBlocks) Baseline Best for skewed workloads Low Medium (kernel mgmt)

The mixed pattern most production stacks settle on: EP across the rack, expert replication for the top-decile hot experts, grouped GEMM within each GPU's local expert set, block-sparse kernels enabled when CV exceeds 0.5.

Layout sketch (DeepSeek-V3-class):

Per node (8 H800 / B200):
  TP=2 inside each expert group
  4 expert "shards" per GPU
  Shared expert replicated everywhere
Across nodes (NVL72 rack):
  EP=64 across the rack
  Dispatch / combine over NVLink fabric
  DP at the request level on top

Batch size pressure

MoE wants large batches more than dense models do. The reason is structural.

A batch of B tokens with top-k routing produces ~B·k expert activations distributed across N experts. If B·k is small relative to N, most experts run with very few tokens — the wrong shape for GPU GEMMs. Below some batch size, expert GEMMs are too small to amortize their HBM weight loads, and decode utilization collapses.

The crossover is workload-dependent. For a typical MoE in 2026, decode batch sizes below ~32 produce notable underutilization. Above ~128, experts are well-fed.

Implications for serving

  • High-QPS deployments fill expert batches naturally. MoE shines.
  • Low-QPS deployments see most experts under-fed. MoE suffers.
  • Multi-tenant aggregation is therefore especially valuable for MoE — pooling requests across users keeps experts loaded.

This is also part of why MoE is heavily favored by hosted providers (high QPS) and harder to justify for single-tenant on-prem deployments (lower QPS).

Throughput vs batch for DeepSeek-V3-class deployments

A representative curve, measured across published numbers and reproduced on rented H200 capacity for a 256-expert top-8 model on 16 GPUs with EP=16:

Decode batch Tokens/s/GPU Active experts/step Notes
1 8 ~8 of 256 Most experts idle; all-to-all dominates
8 55 ~50 Routing imbalance hits hardest here
32 195 ~160 Approaching steady state
128 540 ~256 (all hit) All experts active most steps
256 720 256 KV pressure starts
512 825 256 Near compute ceiling

Two takeaways. First, low-batch decode is genuinely catastrophic for MoE — the expert utilization is below 5% of peak. Second, the curve flattens above ~256, because all experts are now compute-bound and additional batch only helps the weight amortization marginally.

Multi-tenant aggregation: the hidden subsidy

Hosted providers' QPS advantage is structural. A single enterprise serving its own MoE at 1 req/s sees average decode batch under 10 even at p99. A hosted provider aggregating across thousands of tenants sees decode batch in the hundreds at all times. The per-token economics differ by 3-5× as a result. This is the cleanest argument for using a hosted MoE API rather than self-hosting unless your workload's QPS is hosted-provider-scale on its own.


Capacity factor and token drops

The capacity factor sets the maximum tokens per expert per batch:

capacity_per_expert = (capacity_factor × batch_size × k) / num_experts

A capacity factor of 1.0 allocates each expert its expected share. A factor of 1.5 allows 50% over-subscription before dropping.

Higher factor: fewer drops, more HBM allocated to per-expert token buffers, worse balance utilization (slack capacity wasted).

Lower factor: more drops, less HBM, more even utilization, possible quality degradation from dropped tokens.

Production deployments tune this per workload. Typical values: 1.0-1.5 in training, 1.25-2.0 in serving (where you can't easily retry a dropped token).

What happens to dropped tokens

Several strategies:

  • Identity fallback. The dropped token bypasses the MoE layer; its representation passes through unchanged. Simple, common, slightly degrades quality.
  • Shared expert fallback. A small dense MLP that runs for all tokens regardless of routing. Drops fall back to it. Better quality, more compute. DeepSeek-V3 uses a variant of this.
  • Reroute. Send dropped tokens to the next-best expert. More balanced but adds synchronization rounds.

The "right" choice depends on quality budget and serving constraints.

Drop rates in practice and quality impact

In a well-tuned MoE at capacity factor 1.25, drop rates typically run 1-5% of tokens during inference. With a shared expert as fallback, the perceived quality impact is minimal (sub-0.5% on standard benchmarks). Without a shared expert, identity-fallback drops can produce noticeable artifacts on workload-specific tasks — for example, a math-heavy prompt that drops tokens routed to a math-specialist expert can produce visibly worse arithmetic reasoning. The fix is either a higher capacity factor (more HBM cost) or a shared expert (a small permanent compute cost). Modern frontier MoE designs lean toward the shared expert solution because the cost is bounded and the safety net is strong.

MegaBlocks and capacity-factor elimination

MegaBlocks (Gale et al., 2023) reframes MoE compute as block-sparse GEMM: instead of padding each expert's batch to a fixed capacity and dropping the overflow, it computes the actual variable-sized batches with custom kernels. This eliminates both the wasted compute from padding and the quality loss from drops, at the cost of a more complex kernel and slightly worse hardware utilization due to irregular shapes. For high-skew workloads, MegaBlocks is a 1.3-2× throughput win vs capacity-factor padding. For low-skew workloads, the wins are smaller and grouped GEMM is simpler. Production stacks (vLLM, SGLang) ship MegaBlocks-style kernels as an opt-in.


MoE under disaggregation

MoE and disaggregated prefill/decode interact in interesting ways.

Prefill side

Prefill processes the whole prompt in one parallel pass. Every MoE layer runs on every token, every all-to-all happens at full prompt-batch size. Communication amortizes well across the long prompt. Compute utilization is high.

Prefill workers for MoE want:

  • High FLOPs (it's compute-bound like dense prefill).
  • Enough HBM to hold all experts (or enough EP to partition them).
  • Fast inter-GPU bandwidth for the all-to-alls.

Decode side

Decode processes one new token per request per step. The all-to-all at each MoE layer moves single-token volumes across the fabric. Latency, not bandwidth, becomes the dominant cost.

Decode workers for MoE want:

  • High HBM capacity (resident experts plus KV cache).
  • Low-latency interconnect (rack-scale NVLink is ideal).
  • Large concurrent batches (to keep experts fed despite the per-token small-volume issue).

This is the main reason serious MoE deployments push for rack-scale fast fabrics. The decode-side all-to-all is the new bottleneck once disaggregation handles the prefill/decode split.

Prefill batching wins for MoE

Long prompts make MoE prefill especially efficient because the per-token dispatch volume amortizes across thousands of tokens. A 32k-token prefill running on a 256-expert MoE with EP=64 has all experts firing comfortably above their batch-saturation threshold during prefill; the per-token overhead is negligible. This is why long-context RAG workloads disproportionately favor MoE on the prefill side. See our long-context attention guide for the attention-side mechanics.

Cross-pool considerations

The KV cache transfer from prefill to decode doesn't differ much from dense — it's still the per-layer K and V tensors. But MoE prefill produces them on workers that may have wildly different layouts than the decode workers (different EP groupings, different replication), so the transfer layer has to handle the topology mismatch.

Asymmetric pool layouts for MoE disaggregation

A common pattern at frontier scale: prefill pool with smaller EP (EP=8 or EP=16 inside an NVSwitch node, optimized for compute) and decode pool with larger EP (EP=32 or EP=64 across a rack, optimized for HBM and decode bandwidth). The mismatch means the KV layout differs across the handoff. Practical resolution: KV cache is sharded by head and layer (not by expert), so EP differences do not affect the KV transfer directly. Expert weights themselves are static — they exist in both pools — and only the per-token routing decisions and expert dispatch differ. The transfer is therefore the same as dense, but the per-pool serving runtime must independently manage its EP layout.


Hardware and topology fit

MoE's hardware preferences differ from dense:

Resource Dense priority MoE priority
FLOPs/$ High Moderate
HBM bandwidth High High
HBM capacity Moderate High
NVLink within node Moderate High
Rack-scale fabric Moderate High
Cross-node IB Moderate Moderate

Concrete hardware fits for MoE in 2026:

  • B200 with NVL72-class rack fabric: the frontier. Designed (substantially) for MoE.
  • H100/H200 NVSwitch nodes: workable up to EP=8 within node. Above that, all-to-all crosses to IB and slows.
  • MI300X with Infinity Fabric: competitive HBM and bandwidth; software ecosystem still catching up for MoE.
  • Older parts (A100, MI250): suboptimal. Limited NVLink, no rack-scale fabric.

Topology rule: keep the EP group inside one fast-fabric domain. If your EP requires 64 GPUs and your fast-fabric domain holds 8, you have a problem.

Per-chip serving suitability

Chip HBM HBM BW NVLink/IF Suitable EP MoE verdict
H100 SXM (80 GB) 80 GB 3.35 TB/s 900 GB/s aggregate (8 GPU) 8 Workable; below frontier
H200 SXM 141 GB 4.8 TB/s 900 GB/s 8 Best 8-GPU node fit
B200 SXM 192 GB 8 TB/s 1.8 TB/s aggregate 8 (NVSwitch) or 72 (NVL72) Frontier MoE serving
GB200 NVL72 192 GB × 72 8 TB/s 130 TB/s aggregate up to 72 Designed for MoE
MI300X 192 GB 5.3 TB/s 896 GB/s aggregate 8 Capable, software gap
MI325X 256 GB 6 TB/s 896 GB/s aggregate 8 Strong on HBM
A100 80 GB 80 GB 2 TB/s 600 GB/s 8 Possible but slow
L40S 48 GB 864 GB/s none 1 (no NVLink) Inappropriate

The frontier reality: serious open-weight MoE serving in 2026 is GB200 NVL72 or fall back to H200/B200 NVSwitch 8-GPU nodes. MI300X/325X are competitive on raw specs but lag in tooling for MoE kernels. For the chip-level details see H100/H200/B200 architecture and the NVIDIA 2026 lineup.

Why rack-scale fabric was built for MoE

NVL72 (and rack-scale fabrics generally) was justified internally at NVIDIA in significant part by MoE dispatch costs. The math: a 256-expert top-8 MoE with EP=64 needs ~256 KB/token of all-to-all traffic per MoE layer. Multiply by 30 MoE layers, 256 tokens per decode step, 50 decode steps/s/GPU, 72 GPUs in the rack: about 90 TB/s aggregate cross-bisection bandwidth demanded by serving alone. NVLink-class within-rack fabric is the only thing that delivers this. InfiniBand topologies max out below it. The bet that MoE would dominate the frontier (paid off) drove the architecture decision (paid off).


Production deployments in 2026

DeepSeek-V3 / R1 lineage. 671B total params, 37B active, top-8 routing across 256 routed experts plus a shared expert. Trained on H800 with custom communication kernels (DeepEP) and auxiliary-loss-free load balancing via per-expert bias terms. The technical report (arXiv:2412.19437) and DeepSeekMoE (arXiv:2401.06066) describe the recipe. The lineage is the reference open-source MoE — its serving layout (EP across NVL-class fabric, shared expert always on, fine-grained experts) is now the default template.

Mixtral 8x7B / 8x22B (Mistral AI). 8 experts, top-2 routing, no shared expert. 8x7B is ~47B total / ~13B active; 8x22B is ~141B total / ~39B active. Conventional auxiliary load-balancing loss. The Mixtral paper (Jiang et al., 2024, arXiv:2401.04088) is the most accessible recipe for "small expert count, large experts." Production lesson: with only 8 experts, EP=8 inside a single NVSwitch node is sufficient, so the all-to-all stays NVLink-local and the operational complexity is dramatically lower than DeepSeek-V3-class deployments.

Llama 4 series (Meta). Llama 4 Maverick is MoE; Behemoth (still training) is the frontier-scale entry. Public detail is limited, but published material confirms top-1 routing and a relatively small expert count compared to DeepSeek. The line continues to lean on Meta's existing serving infrastructure, indicating that the more-experts-is-better trend has not fully won yet.

Qwen3-MoE. Alibaba's MoE line. Aggressive expert specialization, public serving stack support across vLLM and SGLang, integrated with their open-weight release cadence.

Hosted closed models. GPT-4-class, Claude-class, and Gemini-class hosted models are widely understood (though not always confirmed) to be MoE. Pricing and latency patterns are consistent with MoE serving.

Serving stacks with MoE-specific paths:

  • vLLM — production MoE serving, EP support.
  • SGLang — MoE with RadixAttention.
  • TensorRT-LLM — NVIDIA's stack, deeply integrated with MoE kernels and rack-scale fabric.
  • DeepSpeed-MII — Microsoft's MoE inference toolkit.

Stack feature matrix for MoE in May 2026

Feature vLLM 0.8 SGLang 0.4 TRT-LLM 0.18 DeepSpeed-MII
EP across nodes Yes Yes Yes Yes
EP + TP combined Yes Yes Yes (most mature) Partial
Expert replication Yes Yes Yes Limited
Block-sparse / MegaBlocks kernels Yes (opt-in) Yes Custom (similar) Yes
Shared expert support Yes Yes Yes Yes
Aux-loss-free routing (DeepSeek-style) Yes Yes Yes Yes
Disaggregated prefill/decode for MoE Yes (beta) Yes Yes Limited
DeepEP integration Partial Yes No No
FP8 expert weights Yes Yes Yes (mature) Yes
Per-expert quantization Limited Limited Yes Limited
Multi-tenant LoRA with MoE Yes Yes Yes Limited

Pragmatic call: TensorRT-LLM is the most mature on raw MoE serving performance for NVIDIA-only deployments; vLLM is the broadest fit; SGLang is the choice when prefix-tree workloads benefit from RadixAttention. DeepSpeed-MII has fallen behind on MoE features in the last 18 months.


Load balancing in production

Training-time mitigations (auxiliary loss, expert dropout, z-loss) are upstream of the problems you actually face in production. At serve time you have a learned router whose imbalance distribution is fixed; your job is to keep tail latency under control while honoring its decisions.

What real imbalance looks like. On a high-QPS DeepSeek-V3-class deployment, traffic to the top-decile of experts can run 3–5× the bottom-decile. The skew is partly persistent (some experts encode common patterns, e.g., code, English narrative) and partly workload-correlated — a burst of math-heavy traffic activates a different set of experts than a burst of conversational traffic. The first lesson: log per-expert token counts at minute granularity and never rely on averages.

Replication of hot experts. Once you know which experts are persistently hot, place them on multiple GPUs. The dispatch logic round-robins (or load-balances by current queue depth) across replicas. Costs HBM proportional to the replication factor, but eliminates the straggler problem deterministically. Most production stacks support replication factors per expert. Rule of thumb: replicate the top decile at 2× and you've removed the worst of the tail.

Capacity-factor tuning. The capacity factor is the slack you allocate before tokens overflow. Too low (1.0): frequent drops, quality regressions. Too high (2.5+): wasted HBM in dispatch buffers, scheduling friction. Production sweet spot is usually 1.25–1.75 for serving. Tune empirically: run a workload-representative trace at decreasing capacity factors until token-drop rate or quality starts to move.

Per-request expert affinity. Some serving stacks (SGLang's RadixAttention pairs naturally with this) try to route requests with similar topical distributions to replicas warmed for those experts. Useful for prefix sharing across long-context requests but adds scheduler complexity.

Workload-aware routing snapshots. Periodically snapshot the empirical expert distribution and use it to bias scheduling (which replica receives this request) or capacity allocation (how much HBM to budget per expert). DeepSeek has published their auxiliary-loss-free bias updates as both a training and inference-time mechanism for this.

The straggler-pool pattern. Some teams maintain a small "straggler pool" of GPUs holding popular experts at higher replication, separate from the main EP layout. When the all-to-all detects an overflow that would otherwise drop, it spills to the straggler pool. Operationally complex; pays off on workloads with sharp daily skew.

Observability requirements. A serving stack that doesn't expose per-expert utilization, per-GPU dispatch volume, and capacity-overflow events is one you'll regret when incidents arrive. Pair with the rest of your vLLM/PagedAttention and eval infrastructure telemetry.

Concrete imbalance numbers from production

Published telemetry from DeepSeek's V3 deployment showed top-1% experts at 4.2× mean load over a 24-hour window, top-decile at ~2.5× mean. Mixtral 8x22B in vLLM benchmark traces shows similar skew at coarser granularity (one or two of the eight experts running 1.5-2× the mean). The skew is workload-dependent but never disappears — load-balancing losses reduce it, they do not eliminate it.

Why imbalance is more painful at inference than training

During training, the all-to-all happens on a batch of millions of tokens; the law of large numbers smooths per-expert counts so even imbalanced routing has manageable absolute load differences. At inference, decode operates on batches of hundreds of tokens, where a single skewed expert receiving 50% of routes is plausible. The relative imbalance is similar but the absolute load gap is sharper and harder to amortize. Production teams routinely see ITL p99 hits of 30-50% from this effect on weakly-balanced MoE serving.


When MoE wins

MoE is the right choice when:

  • You need high capability at fixed inference compute. MoE delivers more capability per active-FLOP than dense.
  • You serve high QPS. Expert batches stay full.
  • You have rack-scale or NVSwitch fabric. All-to-all is cheap.
  • You can afford HBM. Total parameters are large.
  • You aggregate users. Multi-tenant pooling fills experts.

These conditions are met for hosted providers, large enterprises with significant on-prem AI traffic, and labs training frontier models. They are not met for most personal projects and many smaller deployments.

MoE for batch and offline workloads

A regime where MoE wins decisively even without high QPS: large batch / offline workloads. When you can pad the decode batch to 1000+ requests because latency does not matter (overnight inference, retrieval indexing, evaluation runs), every expert is saturated and the all-to-all amortizes perfectly. Batch-mode MoE inference delivers some of the best $/token economics available, often beating both hosted APIs and dense self-hosted. The pattern is underused — many teams default to hosted APIs for batch work without checking whether their volume justifies a few hours on rented H200 capacity. For pricing math see AI inference cost economics.


When dense wins

Dense is the right choice when:

  • Low-QPS or single-tenant inference. Expert under-utilization makes MoE wasteful.
  • Limited HBM budget. MoE's total-parameter footprint is large.
  • Slow inter-GPU network. All-to-all over slow links destroys MoE throughput.
  • You need predictable latency. MoE's routing creates more variance than dense.
  • Edge deployments. Single-GPU or small-deployment settings favor dense.

This is part of why the open-source dense lineage (Llama dense models, Mistral dense, smaller Qwen dense) remains heavily used despite MoE dominating the frontier.

The 7-30B sweet spot for dense

The 7B-30B parameter range is the sweet spot for dense models in 2026: small enough to fit on a single H100 or H200, large enough to be useful, fast enough for sub-200 ms ITL on consumer-adjacent hardware. Llama 3.1 70B, Mistral 7B, Qwen 14B, and similar models continue to dominate this range. MoE has no story below 30B because the parameter overhead of having multiple experts dwarfs the benefit when total compute is small. For most production deployments outside hyperscalers, a well-tuned dense 70B in this range with FP8 weights and continuous batching is more cost-effective than chasing a frontier MoE.

A decision checklist

Use this to triage MoE vs dense for your workload:

Question If yes If no
Sustained QPS > 100/cluster? +2 -2
8+ GPUs with NVLink-class fabric? +1 -2
Total HBM budget > 500 GB? +1 -1
Multi-tenant aggregation possible? +1 -1
Sub-100 ms TTFT SLA? -1 (variance) 0
Frontier-quality target (top 10 on standard benchmarks)? +2 -1
Workload is on-prem, single team, low concurrency? -2 +1

Score ≥ 3: MoE wins. 0-2: pick by team familiarity. < 0: dense wins.

Cost comparison at equivalent quality

For "GPT-4 class" quality, the cost-comparison is roughly:

Model class Total params Active params Per-token serving cost (relative)
Dense 70B 70B 70B 1.0×
MoE 8x22B (~140B/40B) 141B 39B 0.65×
MoE 671B / 37B active (V3-class) 671B 37B 0.50× (at hosted scale)
Dense 405B 405B 405B 3.5×

The hosted-scale qualifier matters: the same DeepSeek-V3 self-hosted at low QPS easily costs 1.2-1.5× a dense 70B. The economics are not a property of the model — they are a property of the model × deployment scale.


Open problems

Routing quality at very small batch. Below batch ~32 per expert, the GEMM is too small for the GPU. Solutions: pad with zeros (wastes compute), reroute (adds latency), batch across replicas (requires coordination). None ideal.

MoE on heterogeneous hardware. Running some experts on AMD, some on NVIDIA, in one EP group. Cost-attractive, software-painful.

Continual learning in MoE. Adding experts post-hoc to a trained MoE, or rebalancing routing as workloads drift. Active research.

Privacy of routing. Side-channel: which expert a token routes to leaks information about the token. For sensitive workloads, this matters. Mitigations are early.

MoE-aware speculative decoding. Draft models for MoE targets have to predict routing as well as tokens. Naive speculation underperforms; specialized methods exist but are immature.

Expert offloading and just-in-time loading. For deployments without enough HBM for all experts, schemes that load experts from CPU memory on demand have been proposed. Practical only at low QPS; high QPS amortizes the load cost across too few requests.

Routing stability under drift. As workloads shift over weeks or months, the router's expert distribution drifts. No good story exists yet for incrementally rebalancing without retraining. Periodic full retraining is the current workaround.

Asymmetric per-expert quantization

A promising open direction: run different experts at different precisions based on their utility and traffic share. Heavily-used experts at higher precision (BF16), lightly-used or specialist experts at FP4. The mixed precision is opaque to the rest of the system because each expert is a self-contained module. Early results show 30-40% HBM savings with negligible quality impact on Mixtral-class models. Production adoption is held back by tooling — most quantization libraries treat a model as uniform.


Quantization for MoE

The MoE-specific quantization story differs from dense quantization.

Why MoE is more quantization-sensitive

Each expert is a smaller subnet of the model. A quantization error that's negligible in a dense 70B FFN can be meaningful in a 7B-equivalent expert that's only invoked for ~25% of tokens. Per-expert calibration becomes important — the activations into expert E only come from tokens that route to E, so the calibration distribution is narrower.

Per-expert calibration

Standard PTQ (post-training quantization) for dense models calibrates on a sample of activations through the model. For MoE, the calibration must run enough tokens through each expert to characterize its activation distribution. With 256 experts and top-8 routing, ~32× more calibration tokens are needed than for a dense model with the same active parameter count.

In practice: production MoE PTQ uses 10–100k calibration tokens (vs. 1–10k for dense), explicitly tracking which tokens hit which experts. Some experts inevitably get few tokens and stay at higher precision (BF16 fallback for cold experts).

Mixed-precision experts

A common pattern: experts that are routed-to often get more aggressive quantization (FP8 or FP4); rarely-used experts stay at BF16 because the per-token cost is small. The mixed-precision layout adds bookkeeping but doesn't change the serving topology meaningfully.

FP8 weight-only

The default 2026 quantization for MoE inference: weights at FP8, activations at BF16 or FP16. Quality regression is small (<1% on most benchmarks); memory footprint roughly halves; throughput on FP8-capable hardware (Hopper, Blackwell) climbs proportionally. The standard recipe for Mixtral, DeepSeek, DBRX inference deployments.

FP4 for Blackwell

MXFP4 weights with FP8 or BF16 activations. Cuts weight memory another ~2×. Quality cost is more visible (1–3% on hard benchmarks), particularly on the rarely-used experts where calibration is weak. The pattern is standard for Blackwell MoE inference where the memory savings are needed (671B-class models fit on a single rack in MXFP4).

Router precision

The router stays at FP32 or BF16 across quantization regimes. The router is small (a tiny fraction of FLOPs and memory) and its precision matters for routing decisions. Quantizing the router rarely helps and often hurts.

KV cache quantization for MoE

KV cache quantization (INT8, INT4) works the same for MoE as for dense — no MoE-specific considerations. The KV cache lives in attention, which is shared across MoE layers' experts. See KV cache memory math for the full quant story; everything there applies.


When to choose specific MoE configs

A short decision guide for picking MoE hyperparameters when training or fine-tuning.

Number of experts

  • 8 experts: minimum that's competitive; good for small-scale (10–50B total). Mixtral. Easy to balance.
  • 16–32 experts: mid-scale; Llama-4 Scout territory. Modest sparsity, good balance properties.
  • 64–128 experts: large scale; DBRX, Arctic. Real benefits from fine-grained specialization.
  • 256+ experts: frontier; DeepSeek-V3. Aggressive sparsity; requires careful balance management.

Top-k

  • k=1: cheapest; needs strong auxiliary loss. Token dropping risk if capacity is tight.
  • k=2: most common; good quality/cost balance. Mixtral, Grok.
  • k=4–8: frontier quality; needs more all-to-all bandwidth. DBRX, DeepSeek-V3.

Shared experts

DeepSeek's pattern: 1–2 shared experts that fire on every token, alongside the top-k routed experts. The shared expert provides a baseline of compute on every token; the routed experts add capability. Pattern adopted by Qwen2-MoE, Snowflake Arctic, Hunyuan-Large.

When to use shared experts: when you want a quality baseline that doesn't depend on the router. When NOT to: when you want maximum sparsity for compute efficiency.

Auxiliary loss weight

If using token-choice routing with aux loss: typical weight is 0.01–0.1 of the main LM loss. Too low: experts collapse. Too high: aux loss dominates and quality suffers. Tune empirically per architecture.

Capacity factor

Typical: 1.0–1.5. 1.0 means each expert handles exactly its expected share of tokens; >1.0 leaves headroom for routing variance. Too low: token drops in training. Too high: wasted compute. 1.25 is a common production default.


Routing strategies deep dive

The routing function is small in code (a linear layer plus softmax plus top-k) but high in impact. The 2026 production landscape covers a half-dozen distinct strategies.

Token-choice top-k (the default)

Each token computes scores for all experts via a linear projection; softmax; pick top-k experts; weight outputs by softmax probabilities. The mainstream approach used by Mixtral, DeepSeek, Qwen, DBRX, Grok. Top-k values: 1 (Switch Transformer style, cheapest), 2 (Mixtral, Grok — balanced quality/cost), 4–8 (DBRX, DeepSeek-V3 — frontier quality).

Strengths: simple; works with standard auxiliary loss for balance; well-understood failure modes. Weaknesses: no built-in guarantee against router collapse; balance depends on auxiliary loss being well-tuned.

Expert-choice routing

Inverted: each expert chooses its top-N tokens from the batch. Guarantees perfect load balance by construction; no aux loss needed. Strengths: clean implementation; no balance failure modes. Weaknesses: requires batch-aware operation (each expert needs the full batch's scores); some tokens get no expert; serving requires a fallback path.

Used in research; production deployments are rare. The serving-time variant (each replica of an expert chooses tokens from its assigned batch) is conceptually attractive but operationally complex.

Hash routing

Route based on a hash of the token (deterministic, content-independent). Trivial to implement; perfect load balance. Strengths: no learned router to fail. Weaknesses: experts don't specialize meaningfully because routing doesn't reflect content; quality lags learned routing.

Used as a baseline in research and as a fallback when learned routers fail; not a production pattern.

Soft routing (mixture of softmaxes)

Instead of top-k, compute a weighted combination of all experts' outputs based on the full softmax. Strengths: differentiable everywhere, no discrete decisions. Weaknesses: every token activates every expert — defeats the FLOP-saving point of MoE. Used in some specialized contexts; not for serving frontier models.

Sinkhorn routing

Use a Sinkhorn iteration to enforce balanced routing across both tokens and experts. Optimal-transport flavor. Strengths: rigorous balance. Weaknesses: extra computation per layer; not widely adopted in production.

Auxiliary-loss-free balancing (DeepSeek-V3)

A per-expert bias adjusted online during training to balance expert load. Eliminates the auxiliary loss that competes with the LM objective. Already discussed elsewhere; mentioned here for the routing-comparison context.

Comparison

Routing Load balance mechanism Per-token FLOPs Quality (relative) Production use
Token-choice top-1 Aux loss k× FFN Baseline Switch Transformer
Token-choice top-2 Aux loss 2× FFN +5% over top-1 Mixtral, Grok
Token-choice top-8 Aux loss + balance-free bias 8× FFN +10% over top-2 DeepSeek-V3
Expert-choice Built-in Variable Comparable to top-2 Research
Hash Built-in k× FFN Lower Baseline only
Soft None needed All experts Highest Not for serving
Auxiliary-loss-free Bias adjustment k× FFN Comparable to aux-loss DeepSeek-V3

Communication patterns deep dive

The MoE all-to-all and surrounding communication patterns merit their own treatment.

Dispatch and combine

Every MoE layer has two communication phases. Dispatch: each token's activations are sent to the GPUs holding its top-k experts. Combine: the experts' outputs are sent back to the GPU holding the token, where they're weighted by router probabilities and combined.

Each phase is an all-to-all collective. The total volume per layer per token: 2 × top-k × hidden_dim bytes (dispatch out + combine back, in BF16). For DeepSeek-V3 with top-8 and hidden_dim=7168, that's ~230 KB per token per layer. Across 61 layers and a batch of 4096 tokens: ~58 GB per forward pass. Spread across 64 EP ranks: ~900 MB per rank per forward.

Why NVLink matters

At NVL72's 1.8 TB/s NVLink bandwidth per GPU, 900 MB transfers in ~0.5 ms. Cross-rack on 400 Gb/s InfiniBand: 50 ms — 100× slower. The all-to-all becomes the bottleneck without rack-scale NVLink. This is the structural argument for GB200 NVL72 (and similar AMD MI300X racks) for frontier MoE serving.

DeepSeek's communication overlap

DeepSeek-V3's tech report describes how they overlap MoE communication with attention computation: while the all-to-all for layer L's MoE is in flight, the GPUs compute layer L+1's attention. This requires double-buffered activations and careful scheduling, but it cuts effective MoE communication latency to near zero. The pattern is now standard in production MoE engines.

Mixed-precision all-to-all

The activations sent in dispatch/combine can be quantized to FP8 (or FP4 on Blackwell) to halve the communication volume. The cost is some numerical accuracy loss, but it's small (the activations are reduced after combine). DeepSeek-V3 uses FP8 all-to-all; the perf win is significant on cross-rack deployments.

Expert pipeline overlap

A more aggressive pattern: pipeline different experts' computations within the same layer. While expert A processes its tokens, expert B's tokens are being dispatched. Requires careful kernel scheduling but reduces effective serial latency.

DeepEP and the 2025 communication libraries

The DeepEP library (github.com/deepseek-ai/DeepEP) released by DeepSeek in early 2025 provides production-grade MoE communication primitives: fused dispatch/combine, FP8 all-to-all, NVLink-aware routing. By mid-2026, DeepEP is integrated into vLLM, SGLang, and several internal engines at frontier labs. The 30–50% MoE serving throughput gains from DeepEP integration are well-documented.


Per-architecture deep dive

The 2026 MoE landscape is dominated by a small number of architectures, each with distinctive design choices that propagate to serving requirements.

Mixtral 8x7B and 8x22B

Mistral's first MoE releases (December 2023 and April 2024). 8 experts per layer, top-2 routing, 7B / 22B parameters per expert. Active parameters per token: ~13B for 8x7B, ~39B for 8x22B. Total parameters: ~47B and ~141B. Sparsity ratio (active/total): ~28%. The architectures launched the open-source MoE era; serving them is straightforward on 1–2 GPUs (8x7B fits on one H100, 8x22B on two). Routing is standard top-k softmax with an auxiliary load-balancing loss.

DeepSeek-V2 (236B) and DeepSeek-V3 (671B)

DeepSeek's MoE designs pushed sparsity dramatically. V2 has 160 experts per layer with top-6 routing and 2 shared experts; total 236B parameters, active per token 21B. V3 doubles down: 256 experts per layer with top-8 routing and 1 shared expert; total 671B parameters, active per token 37B. Sparsity ratio dropped to ~5.5% — much more aggressive than Mixtral.

V3's serving innovations documented in the tech report (arXiv:2412.19437): auxiliary-loss-free load balancing via a per-expert bias term that gets adjusted online; FP8 training and mixed-precision serving; aggressive expert parallelism (EP=64 in their inference deployment); fused all-to-all + MoE-dispatch kernels.

Qwen2-MoE 14B and 57B

Alibaba's Qwen2-MoE family. 14B has 60 experts per layer, top-4 routing, ~3B active. 57B has 64 experts, top-8 routing, ~14B active. Smaller and more practical for single-node serving than DeepSeek-V3.

Snowflake Arctic

128 experts, top-2 routing per layer, with a dense base model running in parallel (a "Dense-MoE Hybrid"). 480B total parameters, ~17B active per token. The hybrid design means every token also runs through a smaller dense block, providing a baseline of compute on every token; the MoE adds capability per active-FLOP. Snowflake has been candid about the engineering challenges; the hybrid pattern hasn't been widely copied.

Databricks DBRX

132B total, 16 experts per layer, top-4 routing, ~36B active per token. Fine-grained routing (more experts, higher top-k) at the cost of more routing computation. DBRX's open release was significant for putting a high-quality MoE in the open-source community; serving it is mainstream on 4–8 GPU nodes.

Llama-4 Maverick and Scout (2025)

Meta's MoE entries. Maverick is the larger model (~400B total, with vision-capable variants); Scout is smaller. Routing is top-1 expert choice at fine granularity. Serving on Meta-scale infrastructure used custom expert-parallel kernels; the open releases work on standard inference engines with some perf tuning.

Grok-1 (314B)

xAI's MoE model. 8 experts, top-2 routing, ~78B active per token. Released open-source in March 2024 under Apache 2.0. The architecture is straightforward Mixtral-style; the interesting aspect was the scale of the open release at the time.

Tencent Hunyuan-Large (389B)

Tencent's open MoE. 64 experts with top-1 routing per layer, with shared experts; ~52B active. The top-1 routing with shared experts is unusual; designed for efficient single-GPU expert assignment in their serving infrastructure.

AI21 Jamba-MoE

Hybrid architecture combining Mamba (state-space) layers with transformer-MoE layers. The MoE layers have 16 experts with top-2 routing. The Mamba layers reduce KV cache memory for long contexts; the MoE layers carry the capability. A novel design that hasn't been widely adopted but illustrates the design space.

Comparison table

Model Total params Experts/layer Top-k Shared experts Active per token Sparsity
Mixtral 8x7B 47B 8 2 0 13B 28%
Mixtral 8x22B 141B 8 2 0 39B 28%
DeepSeek-V2 236B 160 6 2 21B 8.9%
DeepSeek-V3 671B 256 8 1 37B 5.5%
Qwen2-MoE 57B 57B 64 8 8 14B 25%
Snowflake Arctic 480B 128 2 dense base 17B 3.5%
Databricks DBRX 132B 16 4 0 36B 27%
Llama-4 Maverick ~400B many 1 varies ~17B <5%
Grok-1 314B 8 2 0 78B 25%
Hunyuan-Large 389B 64 1 shared 52B 13%
Jamba-MoE varies 16 2 0 varies 12%

What the spread tells us

Frontier MoE models in 2026 trend toward higher expert count and more aggressive sparsity (DeepSeek-V3, Llama-4, Hunyuan). Open-source models that need to run on commodity hardware stay closer to Mixtral-style (8 experts, top-2): easier to serve on 1–2 GPUs, lower routing overhead, fewer load-balancing failure modes. The choice of architecture is driven heavily by the deployment target — frontier-only or open-and-self-hostable.


Composed parallelism on GB200 NVL72

Serving a 671B MoE like DeepSeek-V3 requires composing expert parallelism (EP), tensor parallelism (TP), and pipeline parallelism (PP). The GB200 NVL72 — NVIDIA's 72-GPU rack with all-NVLink connectivity — is the canonical hardware target.

DeepSeek-V3's deployed topology

Per the tech report: EP=64 for the MoE layers (each GPU holds 4 experts), TP=8 for the attention layers, PP=8 for the model overall. Total compute: 64 GPUs for the MoE expert blocks plus the attention blocks distributed across PP stages. On GB200 NVL72, this fits comfortably with room for replication and warm spare experts.

Why EP=64

64 experts per GPU × 4 experts/GPU = 256 experts total. The all-to-all collective at every MoE layer carries token activations across all 64 expert-holding GPUs. At NVL72's NVLink bandwidth (900 GB/s per GPU bidirectional), the all-to-all completes in single-digit microseconds for typical batch sizes — small relative to expert compute time.

TP=8 for attention

Attention isn't expert-parallel; it's standard tensor-parallel across 8 GPUs (one TP group per pipeline stage). Splits Q, K, V projections across GPUs; all-reduce inside the attention block.

PP=8

Layer groups distributed across 8 pipeline stages. Each stage runs on a subset of GPUs that holds attention (TP=8) and experts (EP=64 / PP=8 = EP=8 per stage). Microbatch-level pipelining overlaps compute and communication.

Fitting on NVL72

NVL72 has 72 GPUs. DeepSeek-V3's deployment uses 64 of them for the active model + 8 for replicas and standby. The all-NVLink topology means cross-stage and cross-EP communication doesn't leave the rack — bandwidth is uniform within the rack, latency is microseconds. Without NVL72-class hardware, the same model has to span racks via InfiniBand, which is 10× lower bandwidth and adds materially to all-to-all latency.

Other 2026 deployments

Other frontier MoE deployments (Mixtral, Llama-4, DBRX) use less aggressive parallelism because the models are smaller. A typical Mixtral 8x22B deployment: TP=4 EP=8 PP=1 on a single H100 node, no need for cross-node MoE communication. The serving complexity scales with model size.


MoE inference engines

The inference engines have been catching up to MoE-specific needs through 2024–2026.

TensorRT-LLM

NVIDIA's reference inference engine. MoE support added in 2024; mature on Hopper, evolving on Blackwell. Strengths: peak perf on NVIDIA hardware; CUTLASS-based grouped GEMM for expert computation; tight integration with NCCL for all-to-all. Weaknesses: closed-source kernels; opinionated about deployment topology. Best fit for production NVIDIA deployments at scale.

vLLM

The dominant open-source serving engine. MoE support is solid for Mixtral-class models and DeepSeek-V2/V3 with some perf tuning. Uses Triton kernels for expert dispatch and combine; NCCL for cross-rank all-to-all. Strengths: easy to deploy, large community, integrates with most quantization formats. Weaknesses: lags TRT-LLM by 10–20% on Hopper for very large MoE.

SGLang

The other dominant open-source engine. Strong MoE support, particularly for DeepSeek-V3 (where the SGLang team published reference deployment configs). Often matches or exceeds vLLM on MoE workloads.

llama.cpp MoE

CPU/Metal/Vulkan-targeted runtime. MoE support for Mixtral and DeepSeek (CPU offload for the experts not active on a given token). Practical only for small MoE models or large MoE on memory-rich machines with slow per-token latency tolerance.

MegaBlocks

A research-grade MoE inference library. Block-sparse matmul kernels avoid token-drop (no padding needed for non-uniform expert assignment). Strengths: novel approaches; sometimes faster than the mainstream engines on specific workloads. Weaknesses: less polished as a production engine.

FasterMoE

A predecessor research project that influenced production MoE designs. Less directly used in production by 2026 but historically important.

DeepEP and DeepSpeed-MoE

Microsoft's contributions: DeepSpeed-MoE for training; DeepEP (the inference-focused successor) for serving. Both integrate with the broader DeepSpeed ecosystem; used in some production deployments but less common than vLLM/SGLang/TRT-LLM.

Comparison

Engine MoE support Hopper perf Blackwell perf Open-source Best for
TensorRT-LLM Mature Peak Mature No Large-scale NVIDIA prod
vLLM Mature Strong Good Yes Open-source default
SGLang Mature Strong Good Yes DeepSeek-V3 specifically
llama.cpp Basic N/A N/A Yes CPU / consumer hardware
MegaBlocks Research-grade Specific wins Lag Yes Block-sparse experiments
DeepEP Mature Strong Good Yes DeepSpeed-shop

MoE on Blackwell

Blackwell (B100, B200, GB200) changes the MoE serving economics meaningfully.

FP4 experts

Blackwell's native FP4 support means MoE experts can live in HBM at FP4 (with per-block scales). For DeepSeek-V3 (671B params), FP4 weights occupy ~340 GB instead of ~1.3 TB at BF16 — fits comfortably in a single NVL72 rack's HBM (72 GPUs × 192 GB B200 = 13.8 TB). The per-token compute on FP4 tensor cores is correspondingly higher throughput.

Sparse tensor cores

Blackwell adds support for 2:4 structured sparsity in the tensor cores (every 4-element block has 2 zeros). For MoE, this combines with the expert routing's natural sparsity — only the top-k experts compute per token. The kernels haven't fully matured by mid-2026, but the hardware support is there.

MXFP8 and MXFP4

Microscaled FP formats with per-block scales let MoE inference push to FP4 or FP8 without the dynamic-range issues of unscaled FP8. The TensorRT-LLM and vLLM Blackwell paths use MXFP4 for expert weights with FP8 or BF16 activations.

Network bandwidth on NVL72

GB200 NVL72 ups the NVLink bandwidth per GPU to 1.8 TB/s. The all-to-all collective for MoE dispatch becomes correspondingly less of a bottleneck — for typical batch sizes, the all-to-all latency drops below 10 microseconds. Larger and more aggressive EP configurations become practical.

What this means for deployment

A 671B MoE that required EP=64 spread across two H100 racks in 2024 can run on a single GB200 NVL72 rack with EP=64 in FP4 in 2026. The hardware progression has caught up to the model architecture.


MoE inference cost arithmetic

The cost math for MoE inference differs from dense inference in instructive ways.

Cost components

Per-token cost in MoE inference: (a) attention computation (same as dense), (b) router computation (small), (c) active expert computation (a fraction of total expert FLOPs), (d) all-to-all communication (in distributed serving), (e) KV cache memory bandwidth (same as dense).

Active-FLOPs comparison

DeepSeek-V3 at 37B active params per token uses similar per-token compute as Llama-3 70B (also dense ~70B FLOPs per token, but with 70B activated, so 70B FLOPs). On the same hardware:

  • Dense 70B: 70B FLOPs/token, 70B params memory.
  • DeepSeek-V3 MoE: 37B FLOPs/token, 671B params memory.

If you're compute-bound (small batch), DeepSeek-V3 is ~2× cheaper per token. If you're memory-bound (huge batch), the dense model fits in less memory and serves more batch slots per GPU — DeepSeek-V3's memory footprint forces more GPUs per replica.

Capability per active FLOP

DeepSeek-V3 outperforms Llama-3 70B on most benchmarks despite roughly half the active FLOPs. The capability-per-active-FLOP ratio is ~2× higher. This is the central economic argument for MoE: more capability per inference dollar.

When the math breaks

  • Low QPS: with few in-flight requests, the MoE serving stack pays for expert memory and all-to-all overhead without amortizing it across batch. Cost per token can be 2–3× higher than dense.
  • Latency-sensitive single-stream: the all-to-all adds latency on every layer; the dense alternative doesn't have this overhead.
  • Small model: at small scale (<10B active), the MoE overhead dominates the savings. Dense wins.

When the math works

  • High QPS multi-tenant serving: amortize expert memory and all-to-all across many concurrent requests.
  • Rack-scale hardware: NVL72 makes all-to-all cheap; without it, cross-rack MoE has real latency cost.
  • Frontier capability requirements: when you need the best-in-class model, MoE is what frontier labs ship.

MoE failure modes

Training and serving MoE introduces failure modes that dense models don't have.

Router collapse

The router converges to picking the same few experts for almost all tokens. Other experts are starved of training signal and become useless. Mitigations: auxiliary load-balancing loss (Switch Transformer's original approach); auxiliary-loss-free balancing via online bias adjustment (DeepSeek-V3's approach); expert dropout (force random expert selection on a fraction of tokens).

Expert death

An expert that receives few tokens has degraded gradient quality; its weights drift; capability degrades. Once an expert is "dead," restoring it is difficult. Production training pipelines monitor per-expert utilization and intervene (rebalance routing, restart problematic experts from a checkpoint, reduce auxiliary loss aggressively).

Token dropping

When capacity factor is set too low, tokens that route to overflowing experts get dropped (skip the MoE block). Quality regression. Mitigations: higher capacity factor (more memory, more compute headroom); dynamic capacity adjustment; spilling overflow tokens to a fallback expert.

Routing skew at serving time

The training distribution may not match serving traffic. An MoE trained on general data may have one or two experts that become overloaded on a specific customer's traffic, causing all-to-all hotspots. Mitigations: per-expert replication for hot experts; online routing rebalancing; expert-aware load balancing across replicas.

All-to-all stragglers

A single slow GPU in the EP group stalls the all-to-all for all participants. Mitigations: latency-aware routing (skip a straggler's experts temporarily); periodic GPU health checks; preemptive replacement of suspect GPUs.

Expert checkpoint corruption

A single corrupted expert weight can produce subtly bad outputs that pass surface-level monitoring. Per-expert checksums and continuous validation help — see the checkpoint storage and recovery guide for the full reliability story.


Upcycling dense → MoE

Training a frontier MoE from scratch is expensive. The 2024–2026 alternative: take a trained dense model and "upcycle" it into a MoE.

The upcycling recipe

Copy the dense FFN block N times; each copy becomes an expert. Initialize the router randomly. Continue training with auxiliary load-balancing loss until experts diverge. The capability is bootstrapped from the dense model; experts specialize as training proceeds.

Snowflake's upcycling

Snowflake Arctic used dense → MoE upcycling for the MoE component. The dense backbone they trained first served as the initialization for each expert; the hybrid Dense-MoE design retains the dense path on every token.

When upcycling works

When the dense base model is already high-quality. Upcycling from a weak base produces a weak MoE; the experts can't recover what the base never had. The 2025 frontier upcycling efforts started from frontier dense models (Llama-3.1 405B, Mistral Large) to ensure a strong base.

Sparse upcycling (Komatsuzaki et al., 2023)

The original sparse-upcycling paper (arXiv:2212.05055) demonstrated that initializing experts as copies of the dense FFN block, then training, recovers most of the from-scratch MoE quality at a fraction of the compute. Standard reference.

ALCO and 2026 variants

ALCO (Adaptive Layer-wise Computation Optimization) and similar 2025–2026 techniques extend upcycling with layer-wise specialization — different layers get different upcycling treatments based on which layers benefit most from sparsity. Research-grade but influencing production designs.

The MoE-vs-dense scaling story (2026)

2026 scaling-law work suggests that for a fixed training compute budget, MoE wins on benchmark scores by ~5–15% over the equivalent dense model. The ratio depends heavily on the routing scheme and the sparsity level; aggressive sparsity (DeepSeek-V3 style) wins more than mild sparsity (Mixtral style). The conclusion: at frontier scale, MoE is the default architecture; dense models survive at smaller scales where the operational simplicity wins.


LoRA on MoE

Fine-tuning MoE models with LoRA introduces design choices not present in dense LoRA.

Adapter placement

Options: (a) LoRA on the attention only, leaving experts frozen; (b) LoRA on every expert (multiplying the parameter count of the adapter); (c) LoRA on shared experts only; (d) routing-aware LoRA where the adapter only fires on certain expert routes.

Option (a) is the simplest and most common: train attention LoRA, leave experts as-is. Works well when the fine-tuning task doesn't require expert-level adaptation. Adapter size is the same as dense LoRA.

Option (b) multiplies the LoRA size by the expert count. For a 256-expert model, the LoRA is 256× larger than the attention-only option. Used when expert-level adaptation matters.

Expert-specific adapters

A more nuanced design: different LoRA adapters for different experts, possibly trained on different sub-tasks. Routing-aware fine-tuning. Research-stage in 2026; some production deployments use it for multi-tenant scenarios where different customers' use cases route to different experts.

Serving multi-LoRA MoE

The inference engines (vLLM, SGLang) support multi-LoRA on dense models. MoE LoRA support is less mature; for the simple case (attention-only LoRA), it works the same as dense. For expert-level LoRA, custom serving code is usually needed.


Worked example: serving DeepSeek-V3 on GB200 NVL72

Bringing the numbers together for a concrete deployment.

Setup

  • DeepSeek-V3: 671B total params, 37B active per token, 256 experts per layer, top-8 routing, 61 layers, hidden_dim 7168.
  • Hardware: GB200 NVL72 rack — 72 B200 GPUs, all-NVLink at 1.8 TB/s per GPU.
  • Quantization: MXFP4 expert weights (with FP8 scales), BF16 activations.
  • Topology: EP=64 (4 experts per GPU), TP=8 for attention, PP=8.

Memory footprint

  • Experts at MXFP4: 671B × 0.5 bytes ≈ 336 GB. Plus FP8 scales (~5% overhead) ≈ 350 GB total. Spread across 64 EP ranks: ~5.5 GB per rank.
  • Attention weights at BF16: ~10B × 2 = 20 GB total. Spread across TP=8: ~2.5 GB per rank.
  • KV cache: for 32k context × batch 256 × layers × KV dim, sized to fit in remaining HBM after weights. Typically 50–80 GB per GPU.
  • Per-GPU HBM use: ~5.5 GB experts + ~2.5 GB attention + ~70 GB KV cache + ~10 GB activations and overhead ≈ 90 GB out of 192 GB available. Comfortable margin.

Throughput

  • Per-token compute: 37B FLOPs activated. On B200's FP8 tensor cores at ~5 PFLOP/s per GPU sustained: 37B / 5e15 ≈ 7.4 µs per token compute, divided across 64 GPUs running in parallel.
  • Per-layer all-to-all latency: 0.5 ms (from the communication math above).
  • 61 layers × 0.5 ms ≈ 30 ms per token of pure all-to-all latency.
  • With dispatch/compute overlap (DeepSeek's pattern): effective per-token latency ~5–10 ms.
  • Throughput: ~100–200 tokens/s per request, much higher in aggregate across concurrent requests.

Cost

  • GB200 NVL72 hardware: roughly $3M/year amortized capex + power for one rack.
  • Sustaining ~1000 concurrent requests at 100 tokens/s each: 100k tokens/s total throughput.
  • Per-token cost: $3M/year / (100k tokens/s × 3.15e7 s/year) ≈ $1 per million tokens. Competitive with mid-tier API pricing.

Sensitivity

  • Drop top-k to 4: ~half the all-to-all volume, modest quality regression. Per-token cost drops ~20%.
  • Use FP4 activations as well as weights: another ~25% throughput, more quality risk.
  • Cut concurrency to 100: each request gets more compute headroom, per-token latency drops to ~3 ms, but per-token cost rises ~10×. Trade-off depends on workload.

Without rack-scale NVLink

The same model on cross-rack InfiniBand: all-to-all per layer climbs from 0.5 ms to ~50 ms. 61 layers × 50 ms = 3 seconds per token of all-to-all latency. Even with overlap, you can't hide that much. The deployment becomes throughput-only (large batch, long latency tolerance) and per-token cost rises 3–5×. The takeaway: NVL72-class hardware isn't a nice-to-have for frontier MoE; it's structural.


KV cache for MoE

MoE doesn't change the attention computation — KV cache works the same as in dense models — but the surrounding constraints differ.

Same per-token KV size

KV cache size per token = 2 × num_layers × num_heads × head_dim × bytes/element. For DeepSeek-V3 (61 layers, 128 heads, head_dim 128, BF16): 2 × 61 × 128 × 128 × 2 ≈ 4 MB per token. For 32k context: 128 GB per request. The same number as a dense 70B with similar config.

Multi-head Latent Attention (MLA)

DeepSeek-V2 and V3 use Multi-head Latent Attention — a compressed KV representation that cuts KV memory by ~5×. The compressed cache is then projected back to full K and V on-the-fly during attention. Standard in DeepSeek; not (yet) widely adopted in other MoE models. See the V3 tech report for details.

Batch composition pressure

MoE's expert load-balance benefit relies on diverse batches. KV cache pressure favors longer sequences (better amortization of attention compute over generation steps). The two pressures interact: a serving stack that aggregates many short conversations has good expert balance but worse KV efficiency; one that processes few long conversations has the opposite. Production tuning balances them.

KV cache sharing across experts

KV cache is per-token, not per-expert. The same KV cache serves attention regardless of which experts the token routes to in MoE layers. This is good news — no per-expert KV duplication — and means the KV cache analysis from the KV cache memory math guide applies directly.


The bottom line

The activation/parameter gap is what MoE both creates and exploits: most weights are idle on any given token, but they still need to be addressable, routable, and balanced. The single biggest lever is the fabric. Expert parallelism only pays off when the all-to-all collective rides hardware that can swallow it — an NVL72-class rack or equivalent. On the wrong substrate, MoE is slower than a same-active-parameter dense model and twice as fragile.

  • MoE wins on capability-per-active-FLOP at frontier scale; below that threshold dense almost always beats it on cost and tail latency.
  • HBM and bisection bandwidth, not FLOPs, set the serving economics. Plan capacity around them.
  • Routing imbalance is a permanent operational concern; capacity factors, drop policies, and auxiliary-loss-free bias balancing are mitigations, not solutions.
  • Batch size is structural: MoE needs enough concurrent tokens to keep every expert fed. Low-QPS MoE is wasteful by design.
  • Expert replication and prefix-aware scheduling are the two highest-leverage production knobs.

For the network primitives this depends on, see the NCCL guide and NVLink and rack-scale topology. For the prefill/decode split MoE strains, see disaggregated inference.


FAQ

Is MoE always better than dense at the same active parameter count? Not always, but usually at sufficient training scale. The capability advantage shows up most clearly at the frontier; at small scales, dense models with the same active params often match.

Can I run DeepSeek-V3 on one GPU? No. The total parameter count (671B) is far larger than any single GPU's HBM. Smallest practical inference setup is 8 H200 or B200 GPUs with EP, more comfortable at 16+.

How does MoE interact with quantization? Cleanly. Per-expert quantization works; some experts can be at lower precision than others if quality permits. FP8 weight-only is the typical production choice.

Why top-k routing instead of top-1? Top-1 routes each token to one expert. Top-k>1 routes to multiple and combines, which trains more stably and yields smoother gradients to the router. k=2 is the common default; some recent models use k=8 with many small experts.

What about MoE in attention layers? Most MoE applies only to the FFN block. MoE-attention exists in research but isn't dominant in production yet. Attention's structure is less amenable to expert specialization.

Can MoE models be distilled to dense? Yes, with some quality loss. Distillation from MoE teacher to dense student is a common path for edge deployments.

How many experts is the right number? There's no clean answer; it's a hyperparameter. Public MoE models range from 8 experts (Mixtral) to 256+ (DeepSeek-V3). More experts = more specialization but more all-to-all volume. Current trend: more, smaller experts.

Does MoE help training cost too? Yes, substantially. Training a 671B MoE that activates 37B per token costs roughly the same FLOPs as training a 37B dense model, but achieves much higher capability. This is part of why labs adopt MoE for frontier models. The all-to-all costs are mostly hidden behind compute at training-batch sizes, which is why training MoE is cheap relative to serving it.

How does expert-choice routing compare to top-k token-choice in production? Expert-choice gives perfect load balance by construction but doesn't translate cleanly to autoregressive decode (you'd need to know future tokens to pick experts greedily). It's used mainly during training; production decoders are almost universally top-k token-choice with auxiliary-loss-free or auxiliary-loss balancing.

What's the right routing scheme for a new MoE you're training in 2026? Default: token-choice top-k (k=2 for small expert counts, k=6–8 for fine-grained 128+ expert designs), with DeepSeek-style auxiliary-loss-free bias balancing plus a shared expert for stability and dropped-token recovery. Add z-loss on router logits.

Should I use MegaBlocks-style block-sparse kernels or grouped GEMM? Block-sparse (MegaBlocks, arXiv:2211.15841) removes capacity-factor padding entirely — you get the actual variable-size batches. Grouped GEMM is simpler and well-supported in vendor BLAS. For very high expert counts and aggressive capacity factors, block-sparse wins; for moderate counts, grouped GEMM is fine.

How do I decide between EP and TP for a given MoE layer? Each expert needs to fit on its parallelism unit. If one expert fits on one GPU, pure EP. If experts are too large, TP within an expert + EP across experts. The decision is almost mechanical from expert size and GPU HBM.

Does MoE work for small models? Below ~10B total parameters, MoE usually loses to a dense model of similar active size — the overhead of routing, all-to-all, and load imbalance dominates. The crossover where MoE starts to win is roughly when total params reach ~50B and you have rack-scale fabric to support it.

Are there latency penalties for top-8 vs top-2 routing? Yes, modest. Each additional expert per token is another partial output to combine and adds dispatch volume. Top-8 deployments (DeepSeek) lean on rack-scale fabric for the extra dispatch; the per-token quality lift is usually worth it.

Can I run MoE on a single H100 with offloading? Technically yes for small MoE (Mixtral 8x7B fits at FP8 with KV space on one H100 at 80 GB), no for frontier MoE (DeepSeek-V3 at 671B does not fit on any single GPU at any precision). For larger MoE on a single GPU, the typical pattern is offloading inactive experts to CPU memory and paging them in on demand. PCIe Gen5 at ~64 GB/s makes this roughly 10× slower than HBM-resident decode, so it is a development convenience, not a production strategy.

How does MoE interact with LoRA fine-tuning? Three patterns are used: LoRA on the router only (cheapest, adjusts which experts fire), LoRA on each expert independently (most expressive, multiplies adapter count by N experts), or LoRA on the shared expert and attention (preserves the routed-expert structure). For most fine-tuning use cases, LoRA on attention + shared expert is the sweet spot; full per-expert LoRA is research-grade and breaks multi-tenant serving. See our multi-tenant LoRA serving guide.

What kills MoE throughput in production most often? Three culprits, in order of frequency: (1) decode batch too small, so expert utilization collapses — fix with multi-tenant aggregation or larger concurrency; (2) routing imbalance creating stragglers — fix with replication of hot experts and capacity tuning; (3) all-to-all slower than expected due to fabric misconfiguration (wrong NCCL algorithm, PFC issues on RoCE, etc.) — fix with careful NCCL tuning.

Does MoE work for reasoning models (R1, o1-style)? Yes, and DeepSeek-R1 is the most-cited example — R1 is fine-tuned from V3, which is MoE. Reasoning models generate long chains of thought, which means each request produces many decode steps, which means MoE's per-step costs are amortized over more useful output. The economics work out well. See our reasoning model serving guide.

How do I monitor MoE serving in production? Mandatory metrics: per-expert token count per minute, per-GPU dispatch volume, capacity-overflow events, all-to-all duration histograms, per-layer load-imbalance ratio (max_expert_load / mean_expert_load). Useful metrics: routing entropy (low entropy = router collapse), shared expert hit rate (if you have one), per-expert HBM occupancy. Failure to instrument these is the #1 reason MoE deployments have mystery latency spikes.

Are MoE models harder to quantize than dense? Slightly. Each expert is a smaller MLP, which means quantization sensitivity is more granular — a single bad expert can degrade quality on a specific topic without affecting the average benchmark. The practical mitigation is per-expert calibration during PTQ, accepting that some experts may end up at FP8 while others stay BF16. FP8 weight-only is the production default for MoE in 2026 and works well across DeepSeek-V3 and Mixtral. See quantization tradeoffs.

Does MoE help inference cost on edge devices? No. MoE's total parameter count dominates HBM, and edge devices have tiny memory budgets. Edge inference uses dense models (often 1-7B). MoE-to-dense distillation is the bridge: train MoE at the frontier, distill to a smaller dense student for edge deployment. See our synthetic data and distillation guide.

What's the next step beyond top-k MoE? Two directions: (a) MoE in attention layers (MoA, arXiv:2406.14909) — early but interesting, and (b) hierarchical experts (groups of experts, with routing first to a group then to an expert within). Neither has displaced top-k token-choice as the production default; both are worth tracking.

Can I serve MoE on a CPU? For small MoE (Mixtral 8x7B at low quantization, INT4), yes, with the usual CPU caveats: ~5-15 tokens/s on a high-end Xeon or EPYC. For frontier MoE, no — the total HBM-equivalent memory needed is too large and the per-expert GEMMs are too compute-heavy. CPU MoE is a research and experimentation tool, not a production option.

Does MoE need a different post-training (RLHF/DPO) pipeline? Mostly the same as dense, with one caveat: the router's specialization can be perturbed by aggressive fine-tuning, causing it to collapse to a few experts. Mitigation is to freeze the router during early fine-tuning or use lower learning rates on router parameters. See our post-training RLHF/DPO guide.

How long does it take to train a frontier MoE? DeepSeek-V3 (671B total) trained on ~2.8M H800 GPU-hours over roughly 2 months on ~2k H800 GPUs. The math is dominated by active parameters times tokens, not total parameters, so frontier MoE training is comparable in GPU-hours to dense 37-70B training but spread across many more GPUs for HBM reasons. The serving GPU count is usually a fraction of the training count because batch sizes are smaller. See distributed training for the parallelism story.

Is MoE compatible with verifiable inference? Yes, though the router introduces extra steps that have to be included in proofs. The routing decisions are deterministic given the model weights and inputs, so they can be reconstructed and verified by any party with access to both. See verifiable inference and proof of sampling.

How does MoE affect checkpoint size and recovery? Checkpoints are huge (DeepSeek-V3 at FP16 is ~1.3 TB), so checkpoint I/O is a real engineering problem. Sharded checkpoints across the EP dimension are standard, and parallel write to a fast object store is non-negotiable. See our checkpoint storage and recovery guide for the storage architecture.

What's the practical difference between top-k softmax routing and expert-choice routing? Top-k softmax (the default in Mixtral, DeepSeek, most production MoE) routes each token to its top-k experts; load balance requires auxiliary loss. Expert-choice routing (Zhou et al., 2022) inverts: each expert chooses its top-N tokens; load balance is automatic by construction. Strength of expert-choice: no load-balancing loss needed; clean implementation. Weakness: some tokens may not be chosen by any expert, requiring a fallback. In production, top-k dominates because the routing model is simpler to fine-tune and because the failure mode (unbalanced experts) is well-understood.

How does DeepSeek-V3's auxiliary-loss-free balancing actually work? Instead of adding a load-balancing loss term to the training objective, DeepSeek adds a per-expert bias to the router's scores. During training, the bias is adjusted online: experts that received many tokens get their bias decreased; experts that received few get their bias increased. The effect is similar to auxiliary loss but doesn't compete with the language-modeling objective. The tech report (arXiv:2412.19437) describes the algorithm; the implementation is ~10 lines added to the router code.

What's the impact of routing precision (FP32 vs FP16) on MoE quality? The router computes softmax over expert scores; numerical precision matters because tiny score differences flip routing decisions. Most production deployments use FP32 for the router computation even when the rest of the model is BF16 or FP8. Cost is negligible (the router is a tiny fraction of total compute); the quality impact of FP16 routing is small but measurable, particularly at long sequence lengths where rounding errors accumulate.

How do I handle expert load imbalance at serving time (post-training)? Three patterns: (1) replicate hot experts across multiple GPUs and route to the least-loaded replica; (2) capacity-aware routing where the orchestrator nudges tokens away from overloaded experts; (3) post-hoc rebalancing where a periodic offline job adjusts router biases based on production traffic. (1) is the most common; (2) requires deeper engine integration; (3) is increasingly standard for long-running deployments where the traffic distribution drifts from training.

What's the right batch composition for MoE? Diversity helps. A batch of tokens from many different conversations spreads tokens across many experts, improving balance. A batch of tokens from a few long conversations concentrates routing — same conversation tends to route to similar experts repeatedly, causing local imbalance. Production serving stacks aggregate across users and across request types to maintain diversity.

How does MoE interact with speculative decoding? Speculative decoding generates draft tokens with a smaller model, then verifies with the target model. For MoE target models, each verification step routes through experts. If the draft model is dense and the target is MoE, the routing on the verification batch is correlated (drafts from the same conversation route similarly), which can cause imbalance spikes. Mitigations: speculative decode at low batch sizes; use a dense alternative for high-frequency drafts.

What's the cost penalty for top-8 routing vs top-2 routing in serving? Roughly 2–4× the all-to-all communication volume (more tokens routed per layer). On rack-scale NVLink hardware, this adds a few microseconds per layer — small absolute number but real. On InfiniBand-only hardware, top-8 routing costs noticeably more latency per token. The quality benefit at frontier scale (DeepSeek-V3 uses top-8) typically justifies the cost.

Can I serve different MoE models on the same hardware with shared experts? Not directly. Different MoE models have different expert weights, different routing schemes, different number of experts. Sharing the underlying base model is possible if both MoEs were upcycled from the same dense base, but that's a narrow case. In practice, each MoE model gets its own serving deployment.

How do I migrate from a dense to an MoE serving stack? Phased. (1) Identify the cost lever — typically capability per token or capacity per GPU. (2) Validate the MoE model meets quality targets on your eval set. (3) Set up the EP-aware serving infrastructure (engine support, networking, monitoring). (4) Run side-by-side traffic shadowing to validate latency, error rates, and quality. (5) Gradual rollout with per-tenant override. (6) Sunset the dense deployment once MoE is proven at full load. Typical migration: 3–6 months for a serious production change.

What MoE-specific tooling do I need for production debugging? At minimum: per-expert traffic dashboards (tokens routed per expert per time window); routing entropy plots (low entropy = collapse); capacity overflow alerts; all-to-all latency histograms per layer. Useful additions: per-expert quality regression detection (does adapting the router hurt any specific expert's outputs?); routing-decision sampling for offline analysis. The instrumentation is more elaborate than dense serving; budget engineer-weeks to build it out properly.

Is there a clear winner — MoE or dense — for 2026 and beyond? At frontier scale (>100B total params, multi-rack serving), MoE wins decisively on capability-per-dollar; this is why DeepSeek-V3, Llama-4, and the major frontier deployments are all MoE. At mid-scale (10–100B), it's a tossup; operational simplicity often makes dense the better choice. At small scale (<10B), dense wins. The 2026 industry trend is "MoE at the top, dense in the middle and below"; this will likely persist into 2027 and beyond.


Glossary

  • All-to-all — collective communication where every participant sends some data to every other participant. The core MoE communication primitive.
  • Auxiliary loss — additional training term encouraging even router output.
  • Capacity factor — multiplier on per-expert token capacity. Controls drop rate.
  • Dispatch / combine — the two halves of an MoE layer's communication: dispatching tokens to experts, combining outputs back.
  • Expert — one of the parallel FFN blocks in an MoE layer.
  • Expert parallelism (EP) — partitioning strategy that places different experts on different GPUs.
  • Routed experts — experts selected per-token by the router.
  • Shared expert — an always-active expert that processes every token, sometimes used alongside routed experts.
  • Top-k routing — selecting the k highest-scoring experts per token.
  • Token drop — what happens when an expert is over-subscribed past its capacity.

References

  • DeepSeek-V3 technical report — DeepSeek-AI, 2024. arXiv:2412.19437. Reference open-weight large MoE; describes auxiliary-loss-free load balancing and shared-expert design.
  • Switch Transformer — Fedus, Zoph, Shazeer, 2021. "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." arXiv:2101.03961. Introduces top-1 routing at scale.
  • GShard — Lepikhin et al., 2020. "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding." arXiv:2006.16668. Foundational paper on MoE all-to-all and capacity factor.
  • ST-MoE — Zoph et al., 2022. "ST-MoE: Designing Stable and Transferable Sparse Expert Models." arXiv:2202.08906. Analyses training stability tricks including z-loss.
  • Mixtral 8x7B — Jiang et al., 2024. "Mixtral of Experts." arXiv:2401.04088. Practical recipe for moderate-scale MoE.
  • MegaBlocks — Gale, Narayanan et al., 2022. "MegaBlocks: Efficient Sparse Training with Mixture-of-Experts." arXiv:2211.15841. Block-sparse kernels.
  • Tutel — Hwang et al., 2022. "Tutel: Adaptive Mixture-of-Experts at Scale." arXiv:2206.03382. Microsoft's MoE communication library.
  • Outrageously Large Neural Networks — Shazeer et al., 2017. arXiv:1701.06538. The original sparsely-gated MoE paper that re-launched the line.
  • DeepSeekMoE — Dai et al., 2024. "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models." arXiv:2401.06066. Fine-grained expert design and shared-expert strategy underlying DeepSeek-V3.
  • Expert Choice Routing — Zhou et al., 2022. "Mixture-of-Experts with Expert Choice Routing." arXiv:2202.09368. Inverted-perspective routing that guarantees load balance.

Per-architecture deep dive: 2026 MoE catalog

The 2026 MoE catalog spans 14B-active to 671B-total parameter models. Each makes specific choices in expert count, top-k, shared-expert design, and routing strategy that drive the serving requirements.

DeepSeek-V3 / R1 (671B total, 37B active, 256 experts, top-8)

The reference open-weight large MoE. 256 routed experts plus a shared expert per layer. Top-8 routing. Auxiliary-loss-free load balancing via bias updates (Dai et al., DeepSeek-V3 paper, arXiv:2412.19437). FP8 training and inference for the MoE blocks. R1 is the reasoning variant; same architecture, different post-training. Serves cleanly on GB200 NVL72 with EP=64 or EP=256. Active parameter cost is dense-7B-like; serving at scale requires the full 671B in HBM.

Llama-4 Maverick (17B active, 400B total) and Scout (17B active, 109B total)

Meta's 2025 MoE generation. Maverick is the larger production model; Scout is the open-weight variant. Specific expert counts and top-k details are released in Meta's Llama 4 announcement. Routing follows top-k design with shared expert. Both target latency-sensitive serving with dense-17B-equivalent compute.

Mixtral 8x7B and 8x22B

Mistral's 2024 MoE. 8 experts, top-2 routing. 8x7B has ~12.9B active parameters; 8x22B has ~39B active. Production-deployable on 8x H100 nodes. The simplicity of 8 experts makes load balancing tractable; routing imbalance is rarely a concern.

Qwen2-MoE and Qwen3-MoE

Alibaba's MoE line. Qwen2-MoE 14B and 57B variants, Qwen3-MoE successors. Mostly used in Asian markets; deployable on standard 8x H100 nodes.

Snowflake Arctic (480B total)

Hybrid dense + MoE design (a dense base layer plus MoE feed-forwards). Open-weights. Targeted at enterprise SQL / analytics use cases.

DBRX (132B total, 36B active, 16 experts)

Databricks' 2024 release. Fine-grained 16-expert design with top-4 routing.

Grok-1 (314B total, 8 experts, top-2)

xAI's open release of Grok-1. ~78B active. Larger experts than Mixtral, simpler routing.

Hunyuan-Large (389B total)

Tencent's MoE. Open weights released 2024.

Jamba-MoE (52B total)

AI21's hybrid Mamba-transformer MoE.

Phi-3.5-MoE

Microsoft's small-MoE for efficient inference. ~16 experts, smaller total parameters; targets cost-sensitive deployments.

MiniMax M1 (456B total)

MiniMax's 2024 large MoE.

Skywork-MoE

Open-weight 146B MoE from Kunlun.

Architecture comparison

Model Total Active Experts Top-k Shared Notes
DeepSeek-V3/R1 671B 37B 256 8 yes Aux-loss-free; FP8
Llama-4 Maverick 400B 17B (per Meta) (per Meta) (per Meta) Production frontier
Llama-4 Scout 109B 17B (per Meta) (per Meta) (per Meta) Open-weight
Mixtral 8x7B 47B 12.9B 8 2 no Simple, robust
Mixtral 8x22B 141B 39B 8 2 no Larger experts
DBRX 132B 36B 16 4 no Fine-grained
Grok-1 314B 78B 8 2 no Open-weight
Snowflake Arctic 480B 17B 128 2 yes Hybrid dense+MoE
Hunyuan-Large 389B 52B 16 1 yes Tencent open
Phi-3.5-MoE 42B 6.6B 16 2 no Cost-sensitive
Qwen2-MoE-57B 57B 14B 64 4 yes Alibaba

Serving implications by architecture

Small expert count (8): straightforward EP, low all-to-all overhead, fits 8x H100. Fine-grained (64-256): aggressive EP required, NVL72 strongly preferred, all-to-all becomes major cost. Shared expert: replicate across all EP ranks; small extra cost. Aux-loss-free (DeepSeek-V3): cleaner training, requires bias-update infrastructure.


Routing strategies catalog

Routing is where MoE engineering decisions concentrate. Survey of the production-relevant approaches.

Top-k routing

Each token picks the top-k highest-scoring experts. Standard since Shazeer 2017. K=1 is Switch Transformer style; k=2 is Mixtral default; k=8 is DeepSeek-V3. Higher k means more compute, more all-to-all traffic, more redundancy.

Expert-choice routing (Zhou et al., 2022)

Inverted: each expert picks the top-N tokens it wants. Guarantees load balance by construction. Trade-off: tokens may not be processed by any expert. Used in some Google research deployments.

Hash routing

Deterministic hash of token ID to expert. Cheap, deterministic, but doesn't adapt to content.

Soft routing

Probabilistic mix of all experts weighted by softmax of router logits. Expensive (every expert processes every token) but smooth.

Sinkhorn routing

Iterative balanced-assignment algorithm. Provides exact load balance at higher routing cost.

Auxiliary-loss-free (DeepSeek)

DeepSeek-V3's contribution. Instead of an auxiliary balance loss term that perturbs training, learn a bias per expert and update it after each step to push toward uniform load. Cleaner training, equal or better balance.

MoE-Lightning and dynamic routing

Recent research on dynamic top-k (vary k per token based on difficulty). Adopted experimentally in some 2026 production models.

Routing strategy comparison

Strategy Load balance Compute cost Training stability Production use
Top-k (Switch) Auxiliary loss k× FFN Mature Most production
Expert-choice By construction k× FFN, token drops Mature Google research
Hash Random uniform k× FFN Trivial Rare
Soft Perfect All-experts Stable Rare (cost)
Sinkhorn Exact k× FFN + iter Stable Niche
Aux-loss-free Bias-update k× FFN Cleaner DeepSeek-V3, growing
Dynamic top-k Variable Variable Research Experimental

All-to-all communication deep dive

The all-to-all dispatch and combine is where MoE serving spends a meaningful fraction of its budget. Mechanics matter.

What all-to-all does

For each MoE layer, dispatch sends each token to its assigned expert(s). After expert computation, combine returns the results to the original rank. Two all-to-all calls per layer. With L layers and EP=N, that's 2L all-to-alls per forward pass.

Bandwidth math

For batch size B, hidden dim H, top-k routing, dispatch sends B × k × H × 2 bytes (BF16). For B=8192, H=4096, k=8: 8192 × 8 × 4096 × 2 = ~537 MB per layer. With 80 layers in DeepSeek-V3: ~43 GB of all-to-all traffic per forward pass. At 1.8 TB/s NVLink: ~24ms per forward pass just on all-to-all. On 50 GB/s InfiniBand: ~860ms. The NVL72 advantage is concrete here.

DeepEP library

DeepEP is DeepSeek's open-source all-to-all communication library specifically tuned for MoE. Supports FP8 dispatch (further reducing bandwidth), pipelined dispatch+expert+combine, and EP=256 deployments. Reference implementation runs on GB200 NVL72 racks for DeepSeek-V3.

FP8 all-to-all

Quantizing the dispatch payload to FP8 halves the bandwidth requirement. Output quality impact is small for inference (zero training quality concern because training uses BF16 dispatch). DeepEP and several other libraries support this.

Pipelined dispatch + compute + combine

Naive sequencing serializes dispatch, expert compute, and combine. Pipelined execution overlaps them: while expert compute runs on batch A, dispatch is in flight for batch B. Reduces effective all-to-all latency.

EP composed with TP and PP

Production deployments compose EP, tensor parallelism (TP), and pipeline parallelism (PP). Example: DeepSeek-V3 on NVL72 with EP=64, TP=2 within the FFN, PP=2 across racks. The composition affects which traffic uses NVLink vs InfiniBand. See distributed LLM training for the underlying parallelism mechanics.


MoE inference engines compared

The serving stack matters as much as the architecture. Survey of 2026 engines.

vLLM

Open-source. MoE support added 2023; mature for Mixtral and similar. EP support for fine-grained MoE (DeepSeek-V3 scale) added through community plugins.

SGLang

Open-source. Strong MoE support including DeepSeek-V3. Used as the reference engine for DeepSeek's own deployments. Adopts the disagg-prefill-decode pattern naturally for MoE.

TensorRT-LLM (TRT-LLM)

NVIDIA's optimized engine. MoE support via custom kernels. Strong on H100 and B200; FP8 MoE optimized.

llama.cpp

CPU + small-GPU inference for MoE. GGUF format supports MoE. Used for local / edge inference of small MoE.

FasterMoE and MegaBlocks

Research-oriented MoE engines / kernels. MegaBlocks contributes block-sparse FFN kernels widely reused. FasterMoE focuses on dynamic expert placement.

Engine comparison

Engine MoE support EP scale Quantization Production use
vLLM Mature up to ~16 INT8, FP8 Broad
SGLang Strong; DeepSeek-V3 up to 256 FP8, FP4 Frontier MoE
TRT-LLM Optimized up to ~16 FP8 best NVIDIA-aligned
llama.cpp Adequate small GGUF Q2-Q8 Edge / local
DeepEP Library only up to 256 FP8 DeepSeek deployments
MegaBlocks Kernels n/a n/a Reused widely

MoE on Blackwell FP4

Blackwell (B100, B200, GB200 NVL72) introduces FP4 (NVFP4) tensor core support and FP8 generally. MoE benefits in specific ways.

FP4 expert weights

Quantizing expert weights to FP4 halves HBM footprint vs FP8 and quarters vs BF16. For 671B-parameter DeepSeek-V3, FP4 puts it at ~336 GB instead of ~672 GB BF16. Fits more comfortably on NVL72. Output quality impact is small with proper calibration.

FP8 all-to-all dispatch

Quantizing the dispatch payload to FP8 halves all-to-all bandwidth. Trivial in production; loss is negligible.

FP4 + FP8 + NVL72

The combined Blackwell MoE recipe: FP4 weights for storage, FP8 activations for compute, NVL72 for low-latency all-to-all. This is the reference pattern for serving DeepSeek-V3 on a single rack in 2026.

Quantization tradeoffs

Per-expert quantization (vs whole-model uniform) allows treating hot vs cold experts differently. The hottest experts use FP8 for quality; rarely-used experts can use FP4 or even lower. See quantization tradeoffs.


KV cache for MoE

Important detail often missed: MoE only affects the FFN block. Attention and KV cache are unchanged from dense models. Implications:

  • KV cache size is the same as a dense model with the same total layer count and hidden dim.
  • KV cache memory dominates serving cost for long-context MoE just as it does dense (see KV cache).
  • Cache eviction, paged KV (PagedAttention), and cache-affine routing apply identically.
  • The active-parameter count does not reduce KV cache size; only attention parameters do.

For a long-context MoE deployment, KV is often the dominant memory cost, with weights second.


MoE failure modes in production

Production-specific issues that don't appear in single-node testing.

Router collapse

Auxiliary-loss balance is too weak; the router collapses to using one or two experts. Detection: per-expert traffic histogram skewed. Recovery: stronger aux loss, switch to aux-loss-free, or restart training.

Expert death

A specific expert never receives tokens and gradient updates die. In serving this is harmless; in fine-tuning it can leave permanent gaps.

All-to-all hot spots

Skewed routing causes one EP rank to receive disproportionate traffic. Detection: per-rank latency variance. Recovery: re-shard experts, expert replication, dynamic placement.

Capacity-factor mismatch

If capacity factor is set too low, token drops degrade quality. If too high, compute is wasted. Tuning per workload.

EP=N scaling issues

At EP=256, communication becomes a substantial fraction of forward time. If fabric isn't NVL72-class, throughput collapses. Sizing must account for this.

LoRA + MoE conflicts

LoRA adapters on the experts add complexity; adapter must be loaded per active expert. See multi-tenant LoRA serving for the underlying serving pattern.


Cost-per-token math for MoE deployments

Worked example: DeepSeek-V3 (671B total, 37B active) on NVL72 rack.

  • GPU cost: NVL72 (72×B200) ~ $3M capex; $1500/hr cloud rental.
  • Throughput: ~600-1200 tokens/sec/GPU at moderate batch, depending on context length and decoding mode.
  • Aggregate: 72 × 800 = ~57,600 tokens/sec per rack.
  • Per-token cost (cloud, rental): $1500/hr / 57,600 tps / 3600s/hr = $7.2e-6 / token, i.e. ~$7 per million tokens.

Compare to dense Llama 70B on 8x H100: ~5000 tps, $30/hr, ~$1.7e-6 / token, ~$1.7 per million tokens. The dense model is cheaper per token but has much lower quality on hard tasks. The right comparison depends on the workload.

For inference cost economics at the platform level, MoE shifts the per-token cost curve favorably at high utilization and unfavorably at low utilization.


Benchmarks: MoE serving throughput by config

Public benchmark numbers, 2025-2026 Q2:

Config Hardware Tokens/sec/GPU Notes
Mixtral 8x7B BF16 8x H100 ~3000 vLLM, batch 32
Mixtral 8x7B FP8 8x H100 ~5000 TRT-LLM
Mixtral 8x22B FP8 8x H100 ~1800 vLLM, batch 16
DeepSeek-V3 FP8 NVL72 ~800 SGLang, batch 64
DeepSeek-V3 FP4 NVL72 ~1100 SGLang+DeepEP, batch 64
Llama-4 Maverick BF16 16x H100 (vendor figures) Meta tuned
Llama-4 Scout BF16 8x H100 (vendor figures) Open-weight
DBRX BF16 8x H100 ~2500 vLLM, batch 32

Numbers are illustrative and depend on context length and batch size. Vendor benchmarks may be more optimistic.


When to upcycle dense to MoE

Upcycling: take a trained dense model and split the FFN into multiple experts, retrain briefly to specialize.

When it works

  • Existing dense model has strong base quality.
  • Compute budget allows the additional MoE training.
  • Serving infrastructure supports the resulting fine-grained MoE.

When it doesn't

  • The dense model is too small; the MoE doesn't gain much capacity.
  • Serving infrastructure is dense-optimized; MoE adds operational complexity.

Snowflake Arctic upcycling

Snowflake Arctic (2024) upcycled their dense base into a hybrid MoE design. Production-deployable at moderate scale.

ALCO

Active research on "active learning" of which experts to upcycle from which dense weights.


MoE-specific FAQ

Q: How many experts is too many?

For most use cases, 8-16 experts is the sweet spot for serving simplicity. Beyond 64, you need rack-scale fabric (NVL72) to keep all-to-all from dominating. DeepSeek-V3's 256-expert design is at the upper bound of what's serving-viable in 2026.

Q: Does shared-expert design matter?

Yes. A shared expert (always-on) means every token gets a baseline transformation; the routed experts add capacity for specialization. Improves quality on hard tasks; small cost. DeepSeek and Hunyuan use it.

Q: How does MoE training stability compare to dense?

Less stable. Routing dynamics + aux-loss interactions cause loss spikes. Mitigations: z-loss (ST-MoE), aux-loss-free, careful warmup. Aux-loss-free has become the standard for new large MoE models.

Q: Can MoE be served on a single 8x H100 node?

Yes for moderate MoE (Mixtral 8x7B, 8x22B, DBRX). For DeepSeek-V3-class (671B), no — needs rack-scale.

Q: How does context length interact with MoE?

KV cache scales linearly with context; same as dense. MoE doesn't reduce this. For long-context MoE, KV cache dominates memory.

Q: What's the all-to-all overhead at EP=32?

For typical batch sizes and Mixtral-class hidden dims, 8-15% of forward time. At EP=256 it can be 30-50% on InfiniBand; under 10% on NVL72.

Q: Is FP4 production-ready for MoE?

Yes on Blackwell with proper calibration. Loss on standard benchmarks (MMLU, MMLU-Pro) is < 1pt vs BF16 with HQQ-class calibration. Used in DeepSeek-V3 FP4 deployments.

Q: Can I run MoE on AMD MI300X?

Yes, with some engines. vLLM has AMD support; performance gap vs H100/H200 closing but still trails. For frontier MoE (DeepSeek-V3 scale), NVIDIA NVL72 is the dominant deployment.

Q: How does MoE compare to MoR (mixture of routers)?

MoR uses multiple small routers instead of one large router. Active research; not yet in production large models.

Q: Does Mixtral's "8 experts" mean it has 8x dense parameters?

No. Mixtral 8x7B has ~47B total parameters, not 56B. The "8x7B" naming is misleading; the FFN is shared structurally, with 8 expert sub-blocks.

Q: How is MoE different on TPU?

TPU pods (v4, v5p, v6) handle MoE differently because ICI fabric and XLA compilation differ from NVLink+CUDA. Google's MoE deployments (Gemini) use TPU; specifics not public.

Q: What's the impact of FP8 dispatch on quality?

Negligible in production; the dispatch payload is small magnitude and quantization noise is averaged across many tokens. Output quality difference vs BF16 dispatch is well below noise threshold.

Q: Can experts be dynamically loaded from CPU memory?

Yes for cold experts. Tradeoff: load latency. Some research deployments do this; production typically keeps all experts in HBM for latency.

Q: How does MoE interact with speculative decoding?

The draft model is typically dense (smaller, lower latency). The MoE is the target; verification batch size is high (good for MoE throughput) but routing pressure depends on draft acceptance rate. See speculative decoding.

Q: What's the maximum EP that makes sense?

Bounded by the all-to-all latency budget. With NVL72 (1.8 TB/s NVLink), EP=64 to EP=256 is workable. Without NVL72, EP=8 to EP=32 typically. Beyond these, all-to-all kills throughput.

Q: Are MoE models cheaper to train than dense models of similar quality?

Yes, that's the entire pitch. Quality-per-FLOP for training is better for MoE because only top-k experts compute per token. The savings can be 2-4× depending on architecture.

Q: Why doesn't OpenAI publish MoE details?

Most likely because their production models are MoE and architecture details are competitive information. Public details on GPT-4 et al. are limited; community speculation suggests MoE designs.


Composed parallelism: EP + TP + PP at frontier scale

Real frontier MoE deployments compose three parallelism axes: expert parallelism (EP), tensor parallelism (TP), and pipeline parallelism (PP). The composition determines which traffic uses NVLink, which uses InfiniBand, and where the bottlenecks sit.

Why composition matters

A single parallelism axis has limits. Pure TP scales sublinearly past the NVLink domain. Pure PP introduces pipeline bubbles. Pure EP at scale = all-to-all storms. Composing them lets each axis stay in its sweet spot.

DeepSeek-V3 on NVL72 (worked composition)

  • EP=64 (within NVL72): experts distributed across 64 of the 72 GPUs. All-to-all stays on NVLink (1.8 TB/s).
  • TP=2 within FFN: larger experts split across 2 GPUs. NVLink communication.
  • PP=1 (single rack): all layers fit in HBM on one NVL72.
  • For multi-rack: PP=2 or PP=4 across racks, with InfiniBand between racks.

Net: NVLink absorbs the per-token all-to-all; InfiniBand carries only the per-step pipeline transfer. Throughput is bandwidth-bound on NVLink in the FFN block and compute-bound in attention.

Sizing rules of thumb

  • EP up to one NVL72 domain (72 GPUs). Beyond that, cross-rack EP via InfiniBand is painful but possible.
  • TP up to NVLink domain (8 GPUs in H100 servers, full NVL72 in GB200).
  • PP across racks where layer count permits.
  • Combine to hit the per-rank memory budget.

Multi-rack scaling

For larger MoE (1T+ parameters, projected for late 2026), multi-NVL72 deployments are needed. Cross-rack EP suffers; cross-rack PP works. Hybrid: keep EP within rack, layer the model across racks. The wrinkle: any layer's all-to-all stays intra-rack, but the model is split across racks.


MoE serving operations playbook

Day-2 operational concerns specific to MoE.

Monitoring

Per-expert traffic histogram (detects router collapse), all-to-all latency p50/p99, capacity-factor utilization, token-drop rate (should be < 1%), per-rank EP latency variance.

Alerting

Expert traffic skew > 2× uniform: investigate router. All-to-all p99 > 2× p50: fabric or routing issue. Token drop > 5%: increase capacity factor.

Tuning knobs

Capacity factor (typically 1.25-2.0), all-to-all algorithm (NVLS vs ring), expert replication (replicate hot experts to multiple ranks), routing temperature for inference-time balance.

Hot-expert replication

Some production deployments replicate the most-trafficked experts to multiple EP ranks. Detection happens online; replication is dynamic. Adds memory cost but smooths skew.

Cache-affine routing

For inference with KV-cache, route requests with the same prefix to the same rank to maximize cache hits. See KV cache. Interacts with EP placement.

Graceful degradation

If an EP rank fails mid-serving, redistribute its experts to remaining ranks (with quality cost). Production designs include this fallback.

Rolling weight updates

For multi-tenant serving where new model versions are deployed, MoE rolling updates are harder than dense because EP groups must update atomically. Pattern: blue-green deployment.


2026 frontier MoE deployments

Specific production deployments visible in 2026.

DeepSeek (internal)

DeepSeek-V3 and R1 served on NVL72 racks at SGLang. Cost-per-million-tokens is the publicly notable headline: roughly 1/10 of GPT-4-class pricing.

Together AI

Hosts DeepSeek-V3, Llama-4, Mixtral on NVIDIA fabric. Public pricing reflects the cost structure.

Anyscale

Multi-tenant MoE serving with workload-aware routing. Customer base includes enterprise AI.

Hyperscalers (Azure, AWS, GCP)

Each offers various MoE models via their respective ML platforms. Azure ML, AWS Bedrock, Google Vertex.

Open inference networks

Several decentralized inference networks serve open-weight MoE (Mixtral, DeepSeek-V3, Llama-4 Scout). See decentralized GPU compute.

What's specific to 2026

Llama-4 Maverick and Scout launched 2025; production deployment matured into 2026. DeepSeek-R1 (reasoning MoE) ramped 2025; serving infrastructure caught up. GB200 NVL72 broadly available, enabling EP=64+ for large MoE. FP4 production deployments emerged.


MoE quality benchmarks vs dense at similar active-parameter count

The honest comparison: a MoE with X active parameters vs a dense model with X parameters. Public benchmark numbers, 2025-2026 Q2:

MMLU and MMLU-Pro

Model Active params Total params MMLU MMLU-Pro
Mixtral 8x7B 12.9B 47B ~70 ~38
Llama 3.1 8B (dense) 8B 8B ~66 ~32
Llama 3.1 13B (dense) 13B 13B ~71 ~38
DBRX 36B 132B ~74 ~44
Llama 3.1 70B (dense) 70B 70B ~83 ~52
Mixtral 8x22B 39B 141B ~78 ~48
DeepSeek-V3 37B 671B ~88 ~63
Llama 3.1 405B (dense) 405B 405B ~89 ~62

Headline: DeepSeek-V3 (37B active) reaches dense-405B-equivalent quality on benchmarks. MoE's win is real at high total-parameter counts. At low total params (Mixtral 8x7B), the MoE is at best slightly better than the equivalent dense.

Hard reasoning benchmarks

Model GPQA AIME LiveCodeBench
Mixtral 8x7B low low low
DeepSeek-V3 ~60 high strong
DeepSeek-R1 (reasoning) ~70+ very high very strong
GPT-4o (dense, est.) ~50 high strong
o1 / o3 (reasoning) very high very high very strong

Reasoning models (R1, o-series) outperform non-reasoning at hard benchmarks even at similar active parameter counts. MoE plus reasoning post-training is the 2026 frontier recipe.

Latency comparison

Model TTFT 8x H100 TPOT 8x H100 Notes
Llama 3.1 8B dense 50ms 8ms Reference fast
Mixtral 8x7B 70ms 14ms MoE overhead
Llama 3.1 70B dense 200ms 35ms Reference slow
DeepSeek-V3 on NVL72 250ms 30ms Comparable to 70B dense

DeepSeek-V3's serving latency on NVL72 is competitive with Llama 70B dense, despite ~10× the total parameters, because active compute is similar.


Cross-references and further reading

For the related stack:


MoE serving on alternative hardware

NVIDIA NVL72 is the reference. Alternatives have specific trade-offs.

AMD MI300X / MI325X

192 GB HBM per accelerator (more than H100's 80 GB or H200's 141 GB). 8x MI300X servers can hold larger MoE without TP. Software stack (ROCm, vLLM-AMD) maturing through 2025-2026. Production deployments at AMD-backed partners. Trade-offs: less mature MoE kernel optimization, fewer FP4 features than Blackwell, all-to-all bandwidth depends on Infinity Fabric (within node) and Ethernet (cross-node).

Intel Gaudi 3

Intel's accelerator. 128 GB HBM per Gaudi3. Smaller production deployment vs NVIDIA. Software ecosystem (SynapseAI) less mature for MoE.

Google TPU pods (v5p, v6)

Google's own MoE deployments (Gemini, internal models) use TPU. Inter-chip interconnect (ICI) is a 3D torus, different topology from NVLink. MoE all-to-all maps differently on torus. Production performance on TPU is excellent for Google's models; less so for ported models.

Apple silicon (M-series for inference)

Limited to small MoE that fits unified memory. Used in on-device experiments. Not a frontier serving target.

MoE hardware comparison

Hardware HBM per accelerator All-to-all fabric MoE production scale
NVIDIA H100 80 GB NVLink/NVSwitch Mixtral, moderate MoE
NVIDIA H200 141 GB NVLink/NVSwitch Mixtral 8x22B, DBRX
NVIDIA B200 (NVL72) 192 GB NVLink (rack-wide) DeepSeek-V3-class
AMD MI300X 192 GB Infinity Fabric Up to Mixtral 8x22B
Intel Gaudi 3 128 GB Limited Niche
Google TPU v5p 95 GB ICI torus Internal (Gemini)

Per-architecture serving recipes

Specific recipes for the major models.

Mixtral 8x7B on 8x H100

vLLM or TRT-LLM. EP=8, no TP, no PP. FP8 weights. Batch 32-64. Throughput ~3000-5000 tps. Reference simple deployment.

Mixtral 8x22B on 8x H100

vLLM with FP8. EP=8, no TP, no PP. Larger experts, tighter HBM budget. Batch 16-32. Throughput ~1500-2500 tps.

DBRX on 8x H100

vLLM with FP8. EP=16, top-4 routing makes all-to-all more demanding. Batch 16-32. Throughput ~2000-3500 tps.

DeepSeek-V3 / R1 on GB200 NVL72

SGLang with DeepEP. EP=64, TP=2 within FFN. FP8 (or FP4 for memory-tight). Batch 32-64. Throughput ~600-1200 tps per GPU. Single rack serves the full model.

Llama-4 Maverick / Scout

Meta-tuned recipes. Open implementations via vLLM and SGLang. Specific tuning per Meta documentation.

Grok-1 on 8x H100

Open weights. 8 large experts, top-2. Heavier HBM than Mixtral. Throughput moderate.

Small MoE (Phi-3.5-MoE) on 2-4x H100

llama.cpp or vLLM. EP=4-8. INT4 quantization typical. Latency-optimized. Throughput high in tokens-per-second-per-dollar.

Multi-LoRA on Mixtral

Mixtral base + per-tenant LoRA adapters. Adapter must be loaded for each active expert; storage doubles. See multi-tenant LoRA serving.