Prompt20
All posts
quantizationint4int8fp8fp4awqgptqnvfp4kv-cachesmoothquantspinquantinferenceguide

Quantization: The Complete Guide

The definitive guide to LLM quantization: weights vs activations, INT vs FP formats, AWQ and GPTQ, KV-cache quantization, where quality breaks, and how to choose a precision for production.

By Prompt20 Editorial · 92 min read

Quantization is the single largest lever for cheaper LLM inference, and the most over-confidently deployed. Strip a model from FP16 to INT4 and you cut HBM by 4× and decode bandwidth by roughly the same. Strip it carelessly and you lose ground on hard tasks while your aggregate benchmark numbers look fine.

The take: FP8 is free — production-ready, hardware-supported, near-lossless across the published literature. Ship it by default. INT4 is a real engineering project that needs workload-specific eval (AWQ and GPTQ work, but neither is automatic). Anything below 4 bits is research, not production, unless you've done quantization-aware training and have the eval rigor to back it up. The free lunch ends at 4 bits.

This guide explains what's actually happening at each precision and how to choose without trusting either the marketing or the worst-case scaremongering. The four bit-widths that matter (INT4, INT8, FP8, NVFP4); method-by-method differences across AWQ, GPTQ, SmoothQuant, SpinQuant; KV-cache quantization (KIVI, H2O, FP8 KV) — often a bigger production win than weight quantization; what "preserving quality" actually requires when you stratify by task difficulty; what vLLM, TensorRT-LLM, SGLang, and llama.cpp support today. Pair with KV cache, speculative decoding, and LLM serving to compound the savings.

Table of contents

  1. Key takeaways
  2. Mental model: quantization in one minute
  3. The quantization landscape in 2026
  4. What quantization does
  5. Weights vs activations vs KV cache
  6. Integer vs floating-point formats
  7. Scaling schemes: per-tensor, per-channel, per-group
  8. AWQ, GPTQ, SmoothQuant, and friends
  9. INT4 / INT8 / FP8 / NVFP4 compared
  10. AWQ / GPTQ / SmoothQuant / SpinQuant deep dive
  11. FP8 and FP4 in practice
  12. KV-cache quantization
  13. KV-cache quantization deep dive
  14. Quality preservation in practice
  15. Where quality actually breaks
  16. Hardware support in 2026
  17. Production deployments
  18. Choosing a precision
  19. What to measure
  20. Open problems
  21. Per-format deep dive: FP4, MXFP4, NVFP4, ternary, binary
  22. Per-technique deep dive: OmniQuant, SqueezeLLM, SpQR, QuIP, HQQ, EXL2, EXL3
  23. QAT and QLoRA: training-side quantization
  24. Calibration methods: per-channel, MSE/Hessian/percentile
  25. Activation outliers and SmoothQuant insight
  26. Attention quantization: FP8 attention on Hopper/Blackwell
  27. Per-model 2026 quantization choices
  28. Per-stack support deep dive: vLLM, SGLang, TRT-LLM, llama.cpp, ExLlama
  29. Inference benchmarks per format
  30. When INT4 breaks: math, coding, reasoning, long context
  31. FP4 on Blackwell production status
  32. Quantization for fine-tuning: QLoRA, ReLoRA, NEFTune
  33. Quantization with batching
  34. Accuracy recovery techniques
  35. Engineering economics of self-quantization
  36. Quantization safety: refusal behavior
  37. 2026 papers and what's next
  38. The bottom line
  39. FAQ
  40. Glossary
  41. References
  42. Per-format deep dive with bit-budget math
  43. Per-technique catalog
  44. KV cache quantization in depth
  45. Attention quantization
  46. Per-stack capability matrix
  47. Inference benchmarks by precision
  48. Quantization decision matrix
  49. Quantization safety considerations
  50. 2026 quantization research highlights
  51. Hardware-specific quantization paths
  52. Quantization for specific workloads
  53. Quantization in fine-tuning workflows
  54. Quantization deployment checklist

Key takeaways

  • Quantization saves HBM (4-8× smaller weights at INT4) and HBM bandwidth (proportional speedup on decode).
  • FP8 is the safe default in 2026. Production-ready, hardware-supported, near-lossless on most models.
  • INT4/FP4 is the cost-aggressive choice. Real quality risk; requires calibration and workload-specific eval.
  • KV-cache quantization is the largest single win for long-context serving. Often more valuable than weight quantization.
  • Outliers in activations are the central technical problem. Methods (AWQ, SmoothQuant) succeed largely by handling them.
  • Quality breaks first on hard tasks (math, code, long-context recall) while average benchmark numbers look fine. Eval on your workload, not the leaderboard.
  • Recommendation: FP8 for production; W4A16 (4-bit weights, 16-bit activations) when memory is tight; below 4 bits only with QAT and careful validation.

Quantization at a glance

Question Quick answer
Just need to ship something cheap? FP8 W8A16, FP8 KV
Memory is the hard constraint? INT4 AWQ W4A16
On Blackwell, want frontier cost? NVFP4 with mixed precision on outer layers
Serving > 32k context? Add FP8 or INT4 KV
Below 4 bits? QAT or accept quality cliff
MoE? Per-expert calibration, FP8 default
Edge/CPU? llama.cpp GGUF Q4_K_M
Don't have time to evaluate? FP8. Just FP8.

Mental model: quantization in one minute

The named problem is the precision-throughput tradeoff, with a sharp version called the calibration cliff. Every bit you remove from a weight shrinks HBM and the bytes the decoder has to stream, but each bit also makes the network worse — usually imperceptibly, until at some bit-width and some distribution it suddenly is not. Quantization is the engineering of where that cliff sits and how far you can walk toward it without falling off.

Think of quantization as JPEG for weights. Like JPEG, you pick a quality level (precision), exploit the fact that the signal is not uniform (a few weights and activations are outliers, most are small), and accept that some inputs will be reconstructed less well than others. AWQ and SmoothQuant are basically smarter colour spaces; per-group scaling is the equivalent of per-block DCT. The image still "looks the same" at INT4 in aggregate, but if you zoom into the hard pixels — math, code, long-context recall — you can see the artefacts.

Aspect FP16/BF16 baseline Quantized (e.g. INT4 W4A16)
Bytes per weight 2 0.5
HBM footprint, Llama-3 70B 140 GB ~38 GB
Decode bandwidth ~3.5× faster
Quality on easy tasks reference within noise
Quality on hard tasks reference task-dependent, can drop sharply
Calibration cost none minutes to a few GPU-hours

In code, the production surface is a one-liner:

# bitsandbytes 4-bit linear
import bitsandbytes as bnb
layer = bnb.nn.Linear4bit(in_features, out_features, quant_type="nf4")

Underneath: scale s per group, store q = round(w / s) in 4 bits, dequantize on-the-fly inside a fused GEMM. AWQ adds per-channel scaling chosen to protect the activation outliers; GPTQ adds a Hessian-aware error-compensation pass.

The sticky number: FP8 matches FP16 within ~0.5 perplexity on standard benchmarks at well under 1% of the cost of QAT. Below 4 bits, that free-lunch property disappears. The rest of this guide is the map of where the cliff actually lives.


The quantization landscape in 2026

By 2026 quantization is no longer "INT8 or bust" — it's a stack of formats, methods, and kernels with distinct sweet spots. A quick field map:

Bit widths and formats. FP16 / BF16 (baseline), FP8 (E4M3 / E5M2, the production default for both weights and activations), INT8 (well-supported, older lineage), INT4 (the standard aggressive choice via AWQ or GPTQ), FP4 (E2M1, emerging on Blackwell), NVFP4 (NVIDIA's block-scaled FP4 with FP8 micro-scales, the production frontier for 4-bit), MXFP4 / MXFP6 (the OCP Microscaling specification), NF4 (QLoRA's normal-float-4 used for fine-tuning), and the sub-4-bit territory (ternary, binary, 1.58-bit BitNet).

Methods. Post-training: AWQ (Lin et al., 2023, arXiv:2306.00978 — salient-channel preservation), GPTQ (Frantar et al., 2022, arXiv:2210.17323 — iterative error-compensating), SmoothQuant (Xiao et al., 2022, arXiv:2211.10438 — weight-activation magnitude rebalancing for W8A8), SpinQuant (Liu et al., 2024, arXiv:2405.16406 — rotation-based outlier suppression), QuaRot (related rotation-based method), LLM.int8() (Dettmers et al., 2022, arXiv:2208.07339). Training-aware: QLoRA (Dettmers et al., 2023, arXiv:2305.14314 — fine-tuning with NF4), full QAT.

KV-cache methods. FP8 KV (vendor-standard), INT4 KV with per-head scaling, KIVI (Liu et al., 2024, arXiv:2402.02750 — 2-bit asymmetric per-channel + per-token), H2O (eviction-based, not strictly quantization but pairs with it), GEAR (outlier-aware), and the various attention-sink-plus-eviction hybrids.

Kernels and libraries. Marlin (fast INT4 GEMM for Ada/Hopper), Machete (next-gen Marlin), CUTLASS FP8/FP4 paths, NVIDIA Transformer Engine with NVFP4 support (NVIDIA TE docs), llm-awq, AutoAWQ, AutoGPTQ, bitsandbytes (8-bit and NF4), ExLlamaV2 / V3 (consumer-GPU INT4), llama.cpp's GGUF Q-formats (Q2_K through Q8_0), and Apple MLX for Apple Silicon.

Serving stack support. vLLM (FP8, INT8, AWQ/GPTQ INT4, Marlin/Machete kernels, FP8 KV), TensorRT-LLM (NVIDIA's most-tuned stack — FP8 mature, NVFP4 emerging), SGLang (FP8 weight + KV, AWQ INT4), llama.cpp (the consumer reference, down to ~2 bits), Hugging Face Transformers (built-in AWQ/GPTQ/bitsandbytes), and lmdeploy.

Hardware substrates. H100 / H200 / B200 / GB200 NVL72 (FP8 tensor cores universal; FP4 on Blackwell), MI300X / MI325X (FP8 across the line, FP4 emerging), Apple Silicon (custom INT4/INT8 paths), and consumer Ada / RTX 50-series (INT4 via Marlin-class kernels).

The throughline: in 2026 you don't pick "8 bit or 4 bit" — you pick a (format, method, kernel, hardware) tuple. Picking any one in isolation leaves performance on the table.

A 2026 quick-reference matrix

Bit width Format Method Best kernel Typical quality regression
16 BF16 none cuBLAS / hipBLASLt 0 (baseline)
8 FP8 E4M3 per-tensor or per-channel Transformer Engine, FlashAttn-3 <0.3 pt
8 INT8 SmoothQuant for W8A8, per-channel for W8A16 CUTLASS, FBGEMM 0.3-0.7 pt
4 INT4 AWQ or GPTQ, per-group 128 Marlin / Machete 0.5-2 pt
4 NVFP4 block scale (16) with FP8 micro-scale Transformer Engine 0.5-1.5 pt
4 MXFP4 OCP block scale vendor-portable, less mature similar to NVFP4
2-3 various QAT or KIVI for KV research task-dependent, large
1-2 BitNet b1.58 trained-from-scratch only reference impl quality cliff for retrofits

The bit-width column is meaningless without the method column. AWQ INT4 ≠ naive INT4 by an order of magnitude in quality.


What quantization does

A neural network stores numbers — weights, activations, gradients — in some numeric format. The default is FP16 or BF16: 16-bit floating point. Each number takes 2 bytes.

Quantization swaps these for lower-precision representations. INT8 uses 1 byte. INT4 uses half a byte. FP8 uses 1 byte but with a floating-point structure. FP4 uses half a byte.

The savings come in three places:

1. Memory. Smaller numbers, less HBM. A 70B model is 140 GB in FP16, 70 GB in FP8 / INT8, 35 GB in INT4, 17.5 GB in FP4. The smaller version fits on smaller GPUs and frees HBM for KV cache.

2. Bandwidth. Decode is bottlenecked by reading the weight matrix from HBM. If weights are 4× smaller, the read is 4× faster, and decode throughput rises accordingly. This is usually the largest practical speedup. (See our LLM serving guide for where this fits in the overall stack.)

3. Compute. Lower-precision matmuls are faster on hardware that supports them — FP8 tensor cores, INT4 tensor cores. For inference (small-batch decode), this matters less than bandwidth; for prefill and training, it matters more.

The trade is quality. Quantization is approximation: you're representing the original numbers with fewer bits, so you lose some information. Whether the lost information matters depends on which numbers, how much loss, and what the model was using them for.

Why memory savings translate to throughput

The arithmetic is simple but worth being explicit about. For decode (the dominant cost at scale), the GPU's job per token is to read all the weights from HBM, multiply them against a single token's activation vector, and write the result. The runtime per token is bounded below by (weight_bytes) / HBM_bandwidth. If you halve weight_bytes, you halve the lower bound. For a 70B model in BF16, that floor is ~42 ms/token on H100; in FP8 it drops to ~21 ms/token; in INT4 to ~10.5 ms/token. With batching, the same weight read serves more tokens, but the floor still applies per-batch. This is why quantization is the single largest lever on decode cost — it directly attacks the bandwidth bottleneck. See our KV cache memory math for the related KV-side numbers.

Quantization vs distillation vs pruning

Quantization is one of three orthogonal techniques to make models cheaper. Distillation trains a smaller student from a larger teacher; pruning removes weights entirely; quantization keeps all weights but at lower precision. The three compose: a distilled 7B model can be INT4-quantized and pruned. In practice, distillation provides the largest quality-preserving cost reduction (often 5-10× cheaper for ~95% of the quality), quantization is the next-largest (2-4× for ~98%), and pruning trails (1.3-1.5× for ~98%). For production teams, distillation + quantization is the standard stack; pruning is rarely worth the engineering. See our synthetic data and distillation guide for the distillation side.


Weights vs activations vs KV cache

Three things in a transformer can be quantized, with different trade-offs.

Weights

The model's parameters. Fixed after training, quantized once offline, loaded once per inference replica.

  • Pros: Permanent memory savings. Lower bandwidth on every decode step.
  • Cons: Approximation error baked in. Recovery requires re-quantization.
  • Common precisions: FP16, BF16, FP8, INT8, INT4, FP4.

Activations

Intermediate values produced during the forward pass. Computed fresh per request, per layer.

  • Pros: Lower-precision compute (e.g., INT8 matmul). Less HBM traffic for intermediate state.
  • Cons: Outliers in activations are common and large. Quantizing them aggressively often hurts quality more than weight quantization at the same bit count.
  • Common precisions: FP16 (default), FP8, INT8. Below INT8 is risky.

KV cache

Per-token K and V tensors stored for attention. Grows with sequence length.

  • Pros: Huge memory savings on long-context serving — KV often dominates HBM. Faster attention reads.
  • Cons: Errors compound through attention. Per-head and per-channel sensitivity varies.
  • Common precisions: FP16 default, FP8 (production-safe), INT4 (aggressive but viable with care).

Naming conventions

Common shorthand for "what's quantized at what":

  • W16A16 — weights and activations both FP16/BF16. The baseline.
  • W8A16 — INT8 or FP8 weights, FP16 activations. Conservative quantization.
  • W4A16 — 4-bit weights, FP16 activations. The popular production choice.
  • W8A8 — both weights and activations at 8 bits. Aggressive.
  • W4A4 — both at 4 bits. Frontier of viability.
  • KV8 / KV4 — KV cache at 8 or 4 bits.

Most production deployments quantize weights and KV cache aggressively but keep activations at higher precision.

Why weight-only quantization is easier than activation quantization

Weights are static. You can analyze their distribution offline, run any calibration you want, and pick scales optimally without time pressure. Activations are produced fresh per forward pass and vary with input — a math prompt produces different activation magnitudes than a chat prompt. Activation quantization at low bit counts requires either online scale estimation (slow, lossy) or extensive calibration that covers the input distribution (drift-prone). The asymmetry explains why W4A16 is production-standard while W4A4 remains frontier engineering: weights are friendly to quantize, activations are hostile.

Practical implications of the W*/A* matrix

Choice Memory savings Compute speedup Typical engineering cost
W8A16 (FP8 weight) 2× weights small (decode is BW-bound) trivial
W4A16 (AWQ/GPTQ) 4× weights 1.8-2× decode medium
W8A8 (SmoothQuant) 2× weights, 2× activations meaningful in prefill medium
W4A8 4× weights, 2× activations strong in both phases hard
W4A4 4× both best on FP4-capable HW research-grade
W4A16 + FP8 KV 4× weights, 2× KV strong, KV is huge gain medium
W4A16 + INT4 KV 4× weights, 4× KV largest practical hard, quality risk

The unique observation: KV quantization is often more impactful than weight quantization at long context, because per-request KV cache can exceed weight footprint above ~100k tokens.


Integer vs floating-point formats

Two ways to spend a fixed number of bits, with very different precision distributions.

Integer formats (INT8, INT4)

Distribute representable values uniformly across a range. INT8 with symmetric quantization represents 256 values evenly between -127 × scale and 127 × scale.

  • Precision is constant across the range.
  • Outliers far from the bulk of the distribution waste range (or force the scale large, reducing precision for the bulk).
  • Hardware support: universal on modern GPUs.

Floating-point formats (FP8, FP4)

Distribute representable values logarithmically — more density near zero, less far from it. FP8 has 256 values but most are clustered near zero.

  • Precision is high where activations and weights cluster (near zero) and low at the extremes.
  • Handles outliers gracefully without crushing the bulk precision.
  • Hardware support: FP8 is universal on H100/H200/B200/MI300X. FP4 has growing support; not yet universal.

Two FP8 flavors

NVIDIA exposes two FP8 formats:

  • E4M3 (4-bit exponent, 3-bit mantissa): smaller range, finer precision. Used for weights and activations forward-pass.
  • E5M2 (5-bit exponent, 2-bit mantissa): larger range, coarser precision. Used for gradients.

For inference, E4M3 is the typical choice.

When to pick which

For inference, FP8 generally outperforms INT8 at the same bit count because neural network values cluster near zero where FP8 has more precision. INT8 has a longer hardware history and well-tuned kernels.

Why FP8 wins on near-zero distributions

Neural network weights and activations are not uniformly distributed across their range — they cluster heavily near zero, with a long tail of outliers. Most weight values in a trained LLM fall within ±0.3 of zero. INT8 with symmetric scaling allocates 256 evenly-spaced levels across the range, so the dense region near zero gets the same precision as the sparse tails. FP8 E4M3, in contrast, has 240 of its 256 values within ±240 (with denser spacing near zero), giving it ~5× more precision in the dense region. The math literally aligns FP8's resolution with where the data lives. INT8 only catches up if you use per-channel or per-group scaling aggressively enough to compensate.

At 4 bits, the same logic favors FP4, but kernel quality is the deciding factor: a well-tuned INT4 path can outperform a poorly-tuned FP4 path. In 2026, the choice is roughly:

  • FP8: production default. Best quality-per-bit at 8 bits.
  • INT8: viable, well-supported, slightly worse quality than FP8.
  • INT4 with AWQ/GPTQ: production standard for 4-bit quantization. Well-tuned kernels available.
  • FP4: emerging. Better theoretical quality than INT4; kernel ecosystem still developing.

Scaling schemes: per-tensor, per-channel, per-group

Quantization isn't just "represent each number with fewer bits." Every quantized value is paired with a scale factor (and possibly a zero point) that maps it back to its original range. The granularity of those scales matters enormously.

Per-tensor

One scale for an entire weight tensor (a whole linear layer's matrix). Simplest, fastest, lowest metadata overhead.

  • A single outlier in the tensor forces the scale up, reducing precision for everything else.
  • Quality at low bit counts suffers severely.

Per-channel

One scale per output channel of a weight matrix. Each row of the matrix gets its own scale.

  • Channels with naturally larger weights get larger scales; channels with smaller weights get tighter scales.
  • Modest metadata overhead, substantial quality improvement at INT8/INT4.

Per-group

One scale per group of weights — typically groups of 32, 64, or 128 elements along the input dimension.

  • Finer-grained than per-channel; captures local weight variation.
  • More metadata but much better quality at low bit counts.
  • Standard for production INT4 (group size 64 or 128 is common).

Double quantization and metadata savings

Per-group quantization at group size 64 adds about 3% metadata overhead (one FP16 scale per 64 INT4 values = 16 bits / 256 bits = 6.25% if naive; more like 3% with packed layouts). The scales themselves can be quantized — "double quantization" stores group scales as INT8 with a per-tensor scale of their own, halving the metadata cost again. QLoRA (Dettmers et al., 2023) introduced this for fine-tuning; modern serving stacks support it for inference too. The saving is small (1-2% of total weight bytes) but free quality-wise.

Asymmetric vs symmetric quantization

Symmetric quantization maps the value range [-max, +max] to [-127, +127] (for INT8) with no zero point. Asymmetric uses [min, max] mapped to [0, 255] with a zero point offset. Symmetric is cheaper at runtime (no zero-point math), works well when the distribution is centered near zero (which most LLM weights are), and is the production default. Asymmetric helps when the activation distribution is skewed (post-ReLU activations, which are non-negative); some activation paths use asymmetric INT8 even when weights stay symmetric.

The trade-off

Granularity Metadata overhead Quality at INT4 Notes
Per-tensor ~0% Poor Rarely used at low bits
Per-channel ~0.5% Good Common at INT8
Per-group 128 ~1.5% Very good Standard for INT4
Per-group 64 ~3% Excellent Aggressive quality
Per-group 32 ~6% Near-lossless Diminishing returns

Smaller group sizes recover more quality but cost more metadata storage and bandwidth. The sweet spot for production INT4 is group size 64-128.


AWQ, GPTQ, SmoothQuant, and friends

The named quantization methods differ mostly in how they decide what to keep precise.

GPTQ

Iteratively quantize columns of the weight matrix, updating remaining columns to compensate for accumulated error. Calibration-based: uses a small dataset to measure activation distributions and pick scales.

GPTQ's iterative algorithm derives from Optimal Brain Surgeon (Hassibi & Stork, 1993), adapted to weight quantization. It computes an inverse-Hessian approximation per layer, uses it to weight the quantization error, and updates remaining columns to minimize that weighted error. The math is elegant; the practical cost is the offline compute (hours for a 70B model on an 8x H100 box) and the calibration data dependency (typically 128-512 examples from a representative dataset).

  • Strong quality at INT3/INT4.
  • Slow to apply (compute-intensive).
  • Per-group scaling.

AWQ (Activation-aware Weight Quantization)

Key insight: most quantization error comes from a small fraction of "salient" weight channels. Identify those, give them higher effective precision (via per-channel scaling that preserves them), and quantize the rest aggressively.

  • Faster to apply than GPTQ.
  • Often matches or beats GPTQ at INT4.
  • Doesn't require iterative compensation; one-pass.

SmoothQuant

Targets activation quantization. Some activation channels have large magnitude; quantizing the whole activation tensor uniformly loses precision on the bulk. SmoothQuant rebalances weight and activation magnitudes so neither has extreme outliers.

  • Enables W8A8 with minimal quality loss.
  • Complementary to weight-only methods.

LLM.int8() / outlier handling

Earlier method: detect outlier activation channels at runtime and keep them in FP16 while quantizing the rest. Slower but very robust.

The mixed-precision matmul pattern (most channels INT8, outlier channels FP16) was the first practical demonstration that 8-bit serving could be near-lossless on large models. Modern methods (SmoothQuant, AWQ, SpinQuant) outperform it by addressing the outliers structurally, but LLM.int8() remains a useful baseline and a working fallback when better methods aren't available for a specific model architecture.

QAT (Quantization-Aware Training)

Train the model with quantization simulated in the forward pass. Best quality at very low bit counts (3-bit, 2-bit territory).

  • Most expensive option (full retraining).
  • Required for sub-4-bit quantization without quality cliffs.
  • Rarely done on already-trained large models due to cost.

Light-touch QAT: fine-tuning the quantized model

A middle ground: take a PTQ-quantized model, freeze the quantization, and fine-tune the unquantized parameters (e.g., the LoRA adapters in QLoRA) to recover quality. Cheaper than full QAT, more effective than pure PTQ. Useful when you can afford a few hundred GPU-hours of fine-tuning but can't afford to retrain from scratch. The pattern is common in production: quantize a base model with AWQ, fine-tune with QLoRA on your task, deploy the quantized base + adapter combination.

Block FP and microscaling: the OCP push

The Open Compute Project's Microscaling specification (MXFP4, MXFP6, MXFP8, MXINT8) standardizes block-scaled low-precision formats with vendor portability as a goal. NVIDIA's NVFP4 is a variant; AMD has published similar block formats. The specification fixes block size (32 elements) and scale format (E8M0 — 8-bit exponent, no mantissa) to ensure cross-vendor weight exchange. Production adoption is gated on multi-vendor kernel availability; ROCm and CUDA both ship support but the ecosystem is younger than the NVIDIA-only NVFP4 path. Worth tracking for cross-vendor deployments.

Method-selection rule of thumb

  • INT8 weight-only: per-channel without anything fancy works fine.
  • INT4 weight-only: AWQ or GPTQ with per-group 64 or 128.
  • W8A8: SmoothQuant.
  • W4A4 or lower: QAT, or accept quality loss.

INT4 / INT8 / FP8 / NVFP4 compared

A more granular side-by-side than the WA notation conveys. Numbers are typical, not universal — kernel maturity and model shape shift them.

Format Bytes/value Quality vs FP16 Hardware Best paired with Production status
BF16 2 baseline universal reference
FP8 E4M3 1 ~lossless on most models H100+, MI300X+ per-tensor or per-channel scaling default in 2026
INT8 1 ~0.5 pt regression typical universal SmoothQuant for W8A8, per-channel weight-only otherwise mature, slightly worse than FP8
INT4 (AWQ/GPTQ) 0.5 0.5–2 pt regression on hard tasks Marlin/Machete kernels on Ada/Hopper/Blackwell per-group 64 or 128 production standard for 4-bit
FP4 E2M1 0.5 better than INT4 on outlier-heavy layers B200, MI325X (emerging) per-group + double-quant scales emerging
NVFP4 0.5 + micro-scales near-FP8 quality at FP4 cost B200 / GB200 with Transformer Engine block-scaled with FP8 micro-scales production-frontier 2026
NF4 0.5 designed for fine-tuning, not serving universal QLoRA paired-up training-side default
MXFP4 / MXFP6 0.5 / 0.75 similar to NVFP4 OCP-spec, vendor-portable block scales standards-track

Practical guidance by format.

  • FP8 (E4M3) weight + FP8 KV. The new default. Near-lossless, 2× bandwidth and memory savings vs BF16, and supported across all current data-center GPUs. Per-tensor scaling works; per-channel improves quality on edge cases. Production stacks have well-tuned kernels.
  • INT8. Older lineage with the most mature tooling. Slightly worse quality than FP8 because integer formats don't match the near-zero distribution of weights/activations. Still useful on older hardware (A100) or where INT8 kernels are better-tuned than FP8 for your specific shapes.
  • INT4 with AWQ or GPTQ. The practical sub-byte format in 2026. Per-group 64 or 128. Marlin (Frantar/Castro/Alistarh, github.com/IST-DASLab/marlin) and Machete kernels deliver near-memory-bandwidth-bound decode throughput. AWQ is faster to apply, often slightly better at the same bit-budget; GPTQ has a longer track record.
  • NVFP4. NVIDIA's block-scaled FP4 introduced with Blackwell. Each block of 16 values has an FP8 micro-scale; the values themselves are E2M1 FP4. The block scaling recovers most of the precision lost in pure FP4 while keeping the 0.5-byte payload. Transformer Engine (docs.nvidia.com/deeplearning/transformer-engine) is the canonical implementation. Best for B200 / GB200 NVL72 deployments where you want maximum throughput per HBM byte.
  • FP4 (plain E2M1). Without block scaling, pure FP4 is fragile. Production deployments using FP4 are almost always using a micro-scaled variant (NVFP4 or MXFP4), not raw FP4.
  • MXFP4 / MXFP6. The Open Compute Project's Microscaling format. Vendor-portable, similar to NVFP4 in spirit. Adoption growing through 2026.
Quantization tradeoffs in 2026 — infographic explaining what quantization does (mapping FP16/FP32 to INT8/INT4 with a scale and zero-point), the quantization levels table (INT8, INT4, INT3, INT2, INT1 with size, speedup, and accuracy impact), where tradeoffs come from (information loss, outliers, distribution shift, hardware reality) and mitigations (per-channel scales, outlier handling, calibration, QAT, kernel optimization), a Llama-2-7B quality vs compression example, PTQ vs QAT, what to consider (goal, model and data, hardware, evaluation), the practical workflow (define goal → calibration set → INT8 → INT4 → lower with QAT → ship and monitor), and quick recommendations.
Quantization tradeoffs at a glance. Quantization maps FP16 / FP32 weights and activations to lower-precision integers (INT8, INT4, INT3, INT2) using a scale and zero-point. INT8 gives most of the benefit with very small quality loss; INT4 is a sweet spot for many workloads with 4× size reduction; INT3 / INT2 are high-risk without QAT. Tradeoffs come from information loss, outliers, distribution shift, and hardware reality — mitigated by per-channel / per-group scales, outlier handling, diverse calibration data, QAT at extreme low-bit, and kernel optimization. Use PTQ to explore, QAT to push the limits. Quantization isn't about using the fewest bits — it's about using the right bits for the right model, data, and hardware.

AWQ / GPTQ / SmoothQuant / SpinQuant deep dive

The most-deployed methods, with what they actually do under the hood and where each wins.

AWQ (Activation-aware Weight Quantization)

Insight: most LLM weight quantization error comes from a small fraction (~1%) of "salient" weight channels — those that multiply against the largest activation magnitudes. AWQ identifies salient channels by calibrating on a small dataset, then applies a per-channel scaling that pre-amplifies the salient weights (compensating later by dividing the activation), so they survive quantization with higher effective precision.

The salient-channel idea is elegant: by scaling salient weights up before quantization (so they round to larger integers) and then dividing the corresponding activation channel by the same factor at runtime, AWQ effectively gives those channels higher precision without changing the runtime cost. The per-channel scaling factors are absorbed into the surrounding layer norms or linear projections so the runtime has no additional operations. AWQ thus pays its quality bill in offline calibration and zero runtime overhead, which is part of why it has become the production default.

  • One-pass, no iterative compensation. Fast to apply.
  • Per-group 128 typical, group 64 for more aggressive recovery.
  • Mature kernel stack (llm-awq, AutoAWQ, vLLM, TensorRT-LLM).
  • Sweet spot: W4A16 production deployments. Reference: arXiv:2306.00978.

GPTQ

Insight: quantize columns of the weight matrix one at a time, and after quantizing each column, update the remaining (still-FP16) columns to compensate for the error introduced. This is Optimal Brain Surgeon adapted to weight quantization, with a closed-form Hessian-based update per column.

  • Iterative, compute-intensive (minutes to hours per model).
  • Excellent quality at INT3 / INT4.
  • Per-group scaling.
  • Mature kernel ecosystem (AutoGPTQ, ExLlama, Marlin-compatible).
  • Sweet spot: where you can afford the offline compute and want the best-quality 4-bit. Reference: arXiv:2210.17323.

SmoothQuant

Insight: in W8A8, weight quantization is easy (weights are well-behaved) but activation quantization is hard (a few channels have huge outliers). SmoothQuant transfers some of the magnitude from activations to weights via a per-channel scaling: y = (X / s) · (s · W). Activations now have smaller dynamic range; weights have a slightly bumpier range that's still easy to quantize.

  • Enables W8A8 at near-FP16 quality.
  • Complementary to AWQ/GPTQ (which handle weight-only).
  • Often combined: SmoothQuant for activation handling + per-channel weight quant.
  • Sweet spot: when you actually need INT8 activation compute (prefill, training-side). Reference: arXiv:2211.10438.

QuaRot vs SpinQuant: rotation methods compared

QuaRot uses fixed (Hadamard) rotations; SpinQuant learns the rotation through a brief optimization. Hadamard rotations are free at compute time (a Hadamard transform has efficient implementation) and require no calibration data, making QuaRot attractive for offline-only environments. SpinQuant's learned rotation produces measurably better quality, especially at W4A4, but requires a calibration step. Both approaches absorb the rotation into adjacent linear layers so there is no runtime overhead. The combined SpinQuant + AWQ + per-group recipe is the closest the field has come to "lossless W4A4" on standard chat workloads, though math and code benchmarks still show measurable regressions.

SpinQuant

Insight: outliers are an artifact of the basis. Rotate the weight and activation tensors by a learned (or Hadamard-based) orthogonal matrix and the outliers spread out across channels. After rotation, both weights and activations quantize much better.

  • Enables W4A4 with much better quality than naive approaches.
  • Rotation is a matrix multiplication absorbed into adjacent linear layers — no runtime cost.
  • Pairs with AWQ/GPTQ for the post-rotation quantization step.
  • Sweet spot: aggressive W4A4 or W4A8 frontier deployments. Reference: arXiv:2405.16406. Related: QuaRot.

When AWQ outperforms GPTQ and vice versa

AWQ tends to win on outlier-heavy models (those with large activation magnitudes in specific channels — most modern dense Llama-class models). GPTQ tends to win on models with smaller activation variance, on quantizations below INT4 (3-bit, 2-bit territory where its iterative error compensation helps), and on long-context-sensitive workloads where its hessian-aware approach better preserves attention. AWQ is dramatically faster to apply (minutes vs hours for a 70B); for many teams that alone settles the question. Production rule: start with AWQ; if you observe per-layer regressions on hard tasks, try GPTQ; if neither works, escalate to SpinQuant + AWQ or QAT.

Quantization noise vs softmax stability

Attention softmax is the most quantization-sensitive operation in a transformer. Small noise in pre-softmax logits gets exponentiated; a logit perturbation of 0.5 in a high-magnitude attention head can flip the softmax distribution from "attend to token 5" to "attend to token 47." This is why KV quantization is so much more sensitive than weight quantization, and why per-head scaling matters so much for KV. Some heads are robust (they attend diffusely); others are sharp (they attend to one or two tokens) and any quantization noise flips them. Per-head sensitivity profiling is part of any serious KV quantization deployment.

Method-selection cheat sheet

Goal Method
W8A16 production default FP8 with per-tensor scaling
INT8 weight-only (older HW) Per-channel quantization, no special method
W4A16 production INT4 AWQ or GPTQ, per-group 128
W8A8 with INT8 compute SmoothQuant + per-channel weight
W4A4 frontier SpinQuant + AWQ/GPTQ
W4 with QAT Train with simulated quantization
Fine-tuning quantized models QLoRA / NF4

FP8 and FP4 in practice

FP8 has become the production default for inference at 8 bits because:

  1. Quality: typically indistinguishable from FP16 on standard benchmarks. Numerical evaluation: cosine similarity of outputs > 0.999 on most layers.
  2. Hardware: H100, H200, B200, MI300X all support FP8 tensor cores. Throughput at FP8 is 2× FP16 on Hopper.
  3. Tooling: NVIDIA's Transformer Engine and FP8 paths in TensorRT-LLM, vLLM, and SGLang are mature. (For training-side FP8, see our mixed-precision training guide.)

FP8 quantization is simpler than INT4 quantization. Often it's "pick a scale per tensor, quantize, done." Per-channel FP8 helps quality further but isn't always necessary.

FP4 is the emerging frontier:

  • B200 has FP4 tensor cores. 4× FP16 throughput at FP4.
  • Per-group FP4 quantization (group size 32 or 64) recovers quality lost in pure tensor-level FP4.
  • Kernel maturity is uneven; ecosystem catching up through 2026.
  • The published numbers from NVIDIA on B200 NVFP4 show ~1.8× higher tokens/s/GPU on Llama-3 70B vs FP8 on the same hardware, with sub-1 point quality regression on MMLU. The gap to FP8 has narrowed faster than many expected.
  • Software cost: NVFP4 requires Transformer Engine, FlashAttention-3, and the latest TensorRT-LLM build. Stacks pinned to older versions cannot use it.

The practical question for FP4 is whether the additional engineering and validation cost is worth ~2× the speedup over FP8. For frontier hosted serving at huge scale: yes. For most deployments: not yet.

FP8 calibration in practice

Even though FP8 is "near-lossless," small choices matter. The standard recipe: per-tensor scaling for weights (each linear layer's weight matrix gets one scale), per-token scaling for activations during prefill, per-batch scaling for activations during decode. Calibration runs a few hundred representative examples through the model in BF16 and records the 99.9th-percentile absolute value per tensor; the FP8 scale is set so that value maps to the maximum representable FP8 magnitude. Aggressive teams use 99th percentile and accept some clipping; conservative teams use max and waste some dynamic range. The difference is real but small (sub-0.5 pt typically).

NVFP4 block scaling under the hood

NVFP4 stores values in blocks of 16. Each block has a single FP8 (E4M3) scale factor. The FP4 values within the block are interpreted relative to that scale. The math: each FP4 value is a 4-bit code (16 possible values) mapped to specific positions in the E2M1 grid, then multiplied by the block's FP8 scale to recover the real value. The block scale costs 1 byte per 16 values = 0.5 bits per value of metadata overhead, so the effective payload is 4.5 bits per weight. Compared to per-tensor FP4 (4 bits flat), the extra 0.5 bit/value is what recovers near-FP8 quality. Transformer Engine ships kernels that do all of this in-register, so the runtime cost is negligible.


KV-cache quantization

For long-context serving, KV cache dominates HBM. A single 128k-token request on a 70B model produces 43 GB of KV cache at FP16. Quantizing it is one of the largest practical wins.

FP8 KV cache

The conservative production choice. Quality impact is small and well-understood. Throughput improvement on attention reads ~2×. Memory savings 2×.

Most major serving stacks (vLLM, TensorRT-LLM, SGLang) support FP8 KV cache as a flag.

INT4 KV cache

Aggressive. 4× memory savings, 4× attention bandwidth.

Quality risks:

  • Errors compound through attention (each future token attends to all past quantized KVs).
  • Per-head sensitivity varies — some heads tolerate aggressive quantization, others don't.
  • Numerical instability in softmax with very low-precision attention scores.

Mitigations:

  • Per-head and per-channel scaling.
  • Keep certain "important" heads at higher precision.
  • Quantize K but not V (V values often more sensitive).

Specialized KV-cache schemes

  • KIVI: 2-bit KV cache with per-token and per-channel grouping. Frontier research.
  • GEAR: outlier-aware KV compression.

For most production: FP8 KV is safe and substantial. INT4 KV is workload-dependent and requires evaluation.

KV cache cost dominates at long context

The crossover where KV cache exceeds weight memory for a 70B model: ~256k tokens at FP16, ~512k at FP8. Beyond that, the KV cache alone outweighs the model. For 405B with extended context windows, the crossover happens earlier. The implication: at long context, KV quantization is the cost lever, not weight quantization. Halving weight bytes saves you 70 GB on a 70B model; halving KV bytes saves you potentially hundreds of GB at 1M-token context. For a deep dive see our KV cache memory math.


KV-cache quantization deep dive: KIVI, H2O, and friends

For workloads with long context or many concurrent requests, KV cache often dominates HBM. KV quantization is therefore frequently a bigger production win than weight quantization — and it deserves the same kind of method-level attention.

FP8 KV (production default)

The conservative, broadly supported option. Each K and V tensor stored in FP8 (E4M3 typical). 2× memory savings, 2× attention bandwidth, near-lossless on most benchmarks. Supported flags in vLLM, SGLang, TensorRT-LLM. If you serve sequences longer than ~8k tokens and aren't running FP8 KV, you're leaving the easiest production win on the floor.

INT4 KV

4× memory savings, 4× attention bandwidth. Quality risks:

  • Errors compound through attention (every future token attends to the quantized history).
  • Softmax in attention is exponentially sensitive to small differences in large logits.
  • Per-head sensitivity varies — some heads are robust, others aren't.

Practical mitigations: per-head scaling (separate scales for each attention head), per-channel scaling on the head dimension, keeping a "sink" of the first few tokens at higher precision (pairs with the StreamingLLM observation that initial tokens act as attention anchors), and quantizing K more aggressively than V (V values are typically more sensitive in practice).

KIVI

KIVI (Liu et al., 2024, arXiv:2402.02750) is the canonical reference for 2-bit KV. Key insight: K and V have different distributions and should be quantized differently. K is best quantized per-channel (along the head dimension); V is best quantized per-token (along the sequence dimension). With this asymmetric scheme plus group scaling, 2-bit KV becomes viable on many benchmarks. Tuning-free — no calibration required.

The 8× memory savings of 2-bit KV over BF16 KV is the largest practical lever for million-token-context serving. KIVI-class methods are still research-grade for hard tasks (math, long-context retrieval) but for chat and casual long-context workloads they are increasingly production-deployed. Pair with KV cache memory math to size deployments correctly.

H2O

H2O (Zhang et al., 2023, arXiv:2306.14048) isn't strictly quantization but addresses the same problem: KV cache size. It identifies "heavy hitters" — tokens that have historically received high attention scores — and evicts the rest. Combined with FP8 or INT4 KV, you compound the memory savings. Risky for retrieval-heavy workloads where the evicted token might become important later.

StreamingLLM and attention sinks

StreamingLLM (Xiao et al., 2023, arXiv:2309.17453) observed that LLMs accumulate disproportionate attention on the first few "sink" tokens. Keeping those at full precision while quantizing or windowing the rest stabilizes quality on streaming workloads. Common pattern in chat deployments.

GEAR

GEAR adds an outlier-aware residual on top of quantized KV — quantize aggressively, then store a low-rank correction for the outliers. Closes the quality gap on hard tasks at modest extra cost.

Asymmetric K/V quantization in practice

K and V tensors behave differently. K is multiplied against query vectors and goes through softmax, where small errors get exponentiated. V is multiplied against attention weights after softmax, where errors propagate linearly. Empirically, V tolerates more aggressive quantization than K. The practical recipe: K at FP8 or per-channel INT4 with finer scaling; V at INT4 or even INT3 with coarser scaling. Some production stacks use FP8 K + INT4 V as a compromise that captures most of the memory savings of full INT4 KV with most of the quality stability of full FP8.

Sliding-window KV plus quantization

For very long contexts, combining KV quantization with a sliding-window attention pattern is often more effective than aggressive quantization alone. Keep the most recent 4-8k tokens at FP8 (high precision, no compression), older tokens at INT4 (more compression, less recency-critical), and the oldest tokens evicted or summarized. The window-aware tiering can deliver effective 8-16× KV compression on streaming workloads. See our long-context attention guide for the broader sliding-window picture.

Choosing a KV strategy

Workload Default
Short context (≤8k), high QPS FP8 KV
Long context (32k–128k), latency-sensitive FP8 KV + paged attention
Long context, memory-constrained INT4 KV with per-head scaling + sink tokens
Streaming chat INT4 or 2-bit KV + StreamingLLM sinks
Research / aggressive KIVI 2-bit + GEAR residuals

Quality preservation in practice

The literature is full of "quantized to within 1 point of FP16 on MMLU." Production teams keep finding that this doesn't predict what users notice. A few patterns from real deployments:

Stratify by difficulty. Aggregate scores average across easy and hard items. Quantization degrades hard items disproportionately — math word problems, multi-step code, long-context retrieval — while easy items are unaffected. Pick out the hard subset of each eval and report scores on it separately.

Eval on instruction-following and tool calls. A quantized model often answers fluently but starts ignoring constraints from the system prompt or producing slightly wrong tool-call JSON. Both are user-visible and don't show up on MMLU.

Multi-language and multi-modal regressions. A model calibrated and evaluated on English text may regress noticeably on non-English languages or on multi-modal inputs. The outlier channels triggered by non-English tokens or vision-encoder outputs are different. For multi-language or multi-modal deployments, calibrate and evaluate on representative inputs across all modalities. See our multimodal serving guide for the modality-specific concerns.

Long-context regression is real. A model that's lossless at 4k can lose meaningful quality at 64k+ as quantization noise compounds across attention operations. Always include a long-context retrieval test if you serve long contexts. See long-context attention for the underlying dynamics.

Safety alignment is fragile. There's accumulating evidence that aggressive quantization can soften refusal behavior — the model still complies but with weaker boundaries. Test your jailbreak / red-team suite on the quantized model before shipping.

Compute-precision skew. Decode is bandwidth-bound, so weight precision matters most. Prefill is compute-bound, so activation precision (FP8 vs BF16 activations) matters more there. Some production stacks ship asymmetric precision: heavier quantization on the decode path, lighter on the prefill path. The complexity is high but the cost win on a long-context, low-output workload (RAG-style) can be 20-30%.

Distribution drift matters. Calibration sets capture a snapshot of activation distributions. Production traffic drifts (new languages, new tools, new prompt formats). Plan to re-calibrate periodically; some teams run a shadow pipeline that re-quantizes against the last week of traffic.

Compound regressions. Quantization stacks: weight INT4 + KV INT4 + activation INT8 each look fine in isolation but together can break in non-obvious ways. Test the actual production stack as a whole, not each piece individually.

A serviceable eval suite, in order of importance.

  1. Workload-representative production-trace replay.
  2. Per-task scores on hard subsets (math, code, multi-hop, long-context retrieval).
  3. Instruction-following / constraint adherence at long prompt lengths.
  4. Tool-call structural validity rate.
  5. Safety / refusal behavior on your jailbreak suite.
  6. Latency and throughput on production hardware with production batch sizes.
  7. Then, finally, aggregate benchmarks.

This is also why an investment in eval infrastructure compounds — every quantization rollout, every model swap, every kernel update wants the same suite.


Where quality actually breaks

Quantization fails in characteristic ways. Knowing them helps you predict and detect failures.

Outlier activations

Some activation values are vastly larger than the median. They appear in specific channels and specific layers. Uniformly quantizing the whole tensor crushes precision on the bulk while still failing to represent the outliers.

This is the main reason weight-only quantization (which doesn't quantize activations) is easier than weight-and-activation quantization. It's also why outlier-aware methods (AWQ, SmoothQuant) outperform naive approaches.

Quantitatively, outliers in production LLMs are typically 10-100× the median magnitude and concentrated in 0.1-1% of activation channels. The skew is workload-correlated — different prompts trigger different outlier channels, so calibration sets must cover the deployed input distribution. A common pathology: calibrate on English chat, deploy on multilingual or code workloads, see quality regress because the new workload activates different outlier channels not seen in calibration. The fix is broader calibration data or methods (SpinQuant, rotation-based) that are less sensitive to which channels are outlier.

Attention sensitivity

Attention softmax is exponentially sensitive to small differences in large logits. Quantization noise in attention scores can flip the softmax distribution. KV-cache quantization is the most error-prone area. The sensitivity is layer-dependent — middle layers tend to have more diffuse attention patterns that tolerate noise; some early and late layers have sharp attention to specific tokens that breaks under quantization. Per-layer KV precision (mixed FP8 and INT4 KV across layers) is an active production tuning lever.

Long-context drift

Errors compound across sequence length. A model that scores fine at 4k tokens may drift at 128k as quantization error accumulates per token. See our long-context attention guide for why long sequences amplify numerical noise. The needle-in-haystack regression curve for INT4 KV typically looks like: ~99% recall at 4k, ~95% at 32k, ~85% at 128k, < 70% at 1M. Compare to FP16 KV which holds > 95% recall through 128k on most modern long-context models.

Math and code

Quantized models often degrade on math and code tasks before they degrade on chat. The hypothesis: these tasks require precise intermediate reasoning, and small quantization errors propagate to wrong answers. GSM8K and MATH typically show 1-3 point drops at INT4; HumanEval and MBPP show 2-5 point drops. The harder benchmarks (AIME, competition math, complex code synthesis) show double-digit drops in extreme cases. Reasoning models suffer most because their chain-of-thought multiplies any per-step quantization error across many steps. See our reasoning model serving guide for why this matters more for o1/R1-style serving.

Structured output

JSON generation, tool calls, code outputs. Small errors in token probabilities can flip "valid output" to "invalid syntax." A model that's 99.5% valid JSON at BF16 may drop to 96% at INT4 — the headline number looks fine but the 3% failure rate breaks downstream pipelines that assume parseable output. Constrained-decoding libraries (Outlines, XGrammar, lm-format-enforcer) partially compensate by masking invalid tokens, but they don't fix the deeper issue of degraded reasoning about what to output.

Instruction following on long prompts

A common observation: a quantized model still answers fluently but ignores some constraints from the prompt at longer lengths.

The benchmark-vs-reality gap

A model that scores within 0.5 points of FP16 on MMLU can lose 5+ points on hard tasks (math, code, long-context retrieval). Aggregate benchmarks average across many easy items; users notice the hard tails.

Layer-wise sensitivity

Not all layers are created equal. In practice, the first and last few transformer layers are more sensitive to quantization than the middle. The first layers do basic tokenization-adjacent processing where precision propagates downstream; the last layers produce the output logits where small errors flip the softmax. Many production stacks keep the first and last 2-4 layers at higher precision (FP8 or BF16) while quantizing the middle to INT4. The cost in memory is small (a 70B model has 80 layers, keeping 8 of them at higher precision adds maybe 10% to the weight footprint) and the quality recovery is consistent across benchmarks.

Cross-quantization compounding

A common production mistake: A/B test each quantization decision individually, then ship the combination. Each decision passes ("FP8 weights are fine", "INT4 KV is fine", "INT8 activations are fine") but the combination breaks because the errors compound. The fix is to always evaluate the final stack, not the individual components. Specifically, the interaction between activation quantization and KV quantization is non-additive — quantized activations produce quantized KV writes, which propagate through attention to amplify the activation error. Always evaluate the actual production stack.


Hardware support in 2026

Precision H100 H200 B200 MI300X MI325X
FP16/BF16
FP8
INT8
INT4
FP4 ~ ~

Software/kernel maturity differs from hardware support. NVIDIA's stack is most mature for FP8 and INT4. AMD's ROCm has been catching up rapidly. Specific serving stacks have different sweet spots — check before assuming a hardware-supported precision has a tuned kernel for your model and shape.

Throughput numbers by chip and precision

For 70B decode, single-stream, batch 32, ~4k context:

Hardware BF16 FP8 INT4 (Marlin) NVFP4
H100 SXM 80 GB 28 tok/s 56 tok/s 110 tok/s n/a
H200 SXM 40 78 155 n/a
B200 SXM 75 145 280 220
MI300X 32 64 120 n/a
MI325X 42 82 155 n/a
RTX 4090 (24 GB, doesn't fit full 70B) n/a n/a 60 (4-bit GGUF) n/a

Numbers are illustrative for the 2025-2026 kernel stack and shift quickly. The relative ratios (FP8 ≈ 2× BF16, INT4 ≈ 4× BF16) are robust; absolute numbers depend on kernel tuning. NVFP4 on B200 sometimes underperforms INT4 because NVFP4 kernels are newer and less aggressively optimized.

Kernel maturity by serving stack

Format vLLM 0.8 TRT-LLM 0.18 SGLang 0.4 llama.cpp
BF16 baseline Mature Mature Mature Mature
FP8 W8A16 Mature Mature (best) Mature n/a (CPU/consumer)
FP8 W8A8 Mature Mature Mature n/a
INT8 W8A8 (SmoothQuant) Mature Mature Mature Q8_0 GGUF
INT4 AWQ Mature (Marlin) Mature Mature Q4_K_M GGUF
INT4 GPTQ Mature Mature Mature Q4_K_S GGUF
NVFP4 Beta Mature on B200 Beta n/a
INT4 KV Beta Mature Mature n/a (consumer)
FP8 KV Mature Mature Mature n/a
2-3 bit quantization No No No Mature (Q2_K, Q3_K)

The pragmatic split: NVIDIA-only frontier production runs TensorRT-LLM; cross-vendor production runs vLLM; consumer and CPU deployments run llama.cpp; SGLang is a strong choice for prefix-heavy workloads.


Production deployments

Hosted providers. OpenAI, Anthropic, Google publish little about precisions used. Latency and pricing patterns suggest a mix of FP8 and INT4 weight quantization with FP8 KV cache. Anthropic's prompt-caching feature pricing is consistent with cache values stored at FP8 or INT4; the cache read price at ~10% of input price reflects bandwidth savings from the smaller payload. OpenAI's o-series pricing is consistent with INT4 weights and FP8 KV on the cheaper tiers. Treat these as informed speculation; the published numbers match no other interpretation neatly.

vLLM. Supports FP8, INT8, INT4 (AWQ, GPTQ), Marlin and Machete kernels for fast INT4 decode.

TensorRT-LLM. NVIDIA's stack. Strong FP8 support; INT4 via SmoothQuant and AWQ; FP4 emerging.

SGLang. FP8 weight and KV, INT4 weight (AWQ), KV-cache quantization options.

llama.cpp. The reference for aggressive quantization (down to 2-3 bits) on consumer hardware. Quality varies by quant; community-tested recipes.

Hugging Face Transformers. Built-in support for AWQ, GPTQ, bitsandbytes 4-bit and 8-bit.

llm-compressor (Neural Magic). Production-grade quantization library covering FP8, INT8, INT4, sparse-quantization combinations. The standard tooling for teams that want to produce their own quantized checkpoints rather than consume community ones.

Hugging Face Quanto and Optimum. Cross-vendor quantization with bitsandbytes underneath for the common path; useful for cross-platform deployments where TRT-LLM is not available.

Public model quantizations worth knowing

A few reference points for what production-quality open-weight quantizations look like in May 2026:

  • Llama 3.1 70B Instruct — AWQ INT4 at ~35 GB, near-lossless on MMLU and HumanEval. The de facto baseline for "is quantization safe."
  • Llama 3.1 405B — FP8 quantization is the only practical way to serve it on a single 8x H200 node; INT4 AWQ pushes it to 4x H200 with measurable quality drop on math.
  • DeepSeek-V3 671B — FP8 native (released in FP8 from training). INT4 quantizations exist but require careful per-expert handling.
  • Qwen 2.5 72B — AWQ INT4 and GPTQ INT4 both widely deployed, similar quality to the Llama series.
  • Mixtral 8x22B — AWQ INT4 with per-expert calibration; GPTQ variants also available. Per-expert quantization matters more here than for dense models.
  • Phi-3-medium and small models — surprisingly resilient to INT4 and even INT3 with QAT.

The general lesson: for dense models in the 7-70B range, AWQ INT4 with per-group 128 is a safe default that the community has validated extensively. For MoE, per-expert quantization is the next step.

Mixed precision in production

The most aggressive production deployments use heterogeneous precision across the model. A representative stack:

Component Precision
Embedding BF16
First 2 layers FP8
Middle layers (3-78) INT4 AWQ
Last 2 layers FP8
LM head BF16
KV cache FP8
Activations (decode) BF16
Activations (prefill) FP8

This is more engineering than uniform quantization but recovers quality on the sensitive parts. The memory savings are 80-90% of uniform INT4 with maybe a third of the quality regression. Most production stacks support some form of mixed precision via per-layer config.


Choosing a precision

A pragmatic decision tree:

  1. Need maximum quality, no memory pressure: BF16. Done.
  2. Want some efficiency, no quality risk: FP8. Verify on your eval suite (will likely pass).
  3. Memory-bound, can tolerate small quality regression: W8A16 with FP8 weights, or W4A16 with AWQ INT4.
  4. Memory severely constrained, willing to invest in validation: W4A16 with AWQ or GPTQ, FP8 KV cache. Eval thoroughly.
  5. Long-context serving with KV-cache pressure: any of the above, plus FP8 or INT4 KV cache.
  6. Frontier cost-cutting on hosted serving: FP4 weights, FP8 KV. Frontier engineering. Pair with speculative decoding for compounding throughput wins.
  7. Edge / consumer hardware: INT4 or below via llama.cpp. Accept quality loss.

Pick the highest precision that fits, not the lowest that nominally works.

Cost-vs-quality at 70B-class

A rough mapping for a 70B model deployed on H200 at production scale, normalized to BF16:

Precision Memory (GB) $/M output tokens Quality regression Notes
BF16 baseline 140 1.0× 0 Reference
FP8 W8A16 70 0.55× < 0.3 pt Production default
FP8 W8A16 + FP8 KV 70 + smaller KV 0.50× < 0.5 pt Best safe stack
INT4 AWQ W4A16 35 0.35× 0.5-1 pt Memory-pressure default
INT4 AWQ + FP8 KV 35 + smaller KV 0.32× 0.5-1.5 pt Aggressive but standard
INT4 AWQ + INT4 KV 35 + much smaller 0.28× 1-3 pt, workload-dep Frontier cost
NVFP4 W4A16 (B200) 35 (+0.5/value scale) 0.25× 0.5-1 pt B200-only frontier
W4A4 NVFP4 + SpinQuant 17.5 0.20× 1-3 pt Frontier engineering

The cliff is between FP8 and INT4 (small quality drop, large cost drop). The next cliff is below INT4 (quality drop accelerates, cost drop diminishes). Most production teams sit between FP8 and INT4 AWQ. See our broader inference cost economics post for how quantization stacks with other levers.

When to skip quantization entirely

A short list. For tiny models (sub-3B) on consumer hardware where memory isn't the bottleneck, the kernel overhead of dequantization can outweigh the bandwidth savings — BF16 may be faster. For reasoning models where every token is high-value (o1-style chain-of-thought) and quality regressions compound across the chain, the cost of aggressive quantization can outweigh the savings. For agentic workloads with structured outputs where any tool-call malformation breaks the pipeline, conservative precision is worth the cost. For initial-deploy and prototyping, skip quantization until you have load and a quality baseline.


What to measure

A credible evaluation:

  • Workload-representative tasks. Not just MMLU; the things your users actually do.
  • Hard items separately. Stratify by difficulty. Aggregate scores hide tail behavior.
  • Long-context tasks if relevant. Quantization quality often degrades with context length.
  • Latency and throughput on your hardware. Theoretical bandwidth savings only show up if kernels are tuned.
  • Comparison to your prior precision. A/B vs the baseline you'd ship.

Skip:

  • Marketing tables from quantization paper authors.
  • "Within 1 point of FP16" claims on aggregate benchmarks.
  • Throughput numbers without specified hardware and kernel.

A repeatable quantization evaluation protocol

A practical protocol that catches most issues:

  1. Baseline. Lock down a BF16 reference deployment of your model on your hardware with your serving stack. Record latency, throughput, and per-task scores.
  2. Single-axis sweep. Quantize only weights (FP8, INT4 AWQ, INT4 GPTQ). For each, measure perplexity on workload-representative data, structured-output validity rate, and per-task scores stratified by difficulty.
  3. KV sweep. Add FP8 KV, then INT4 KV. Measure long-context retrieval accuracy and ITL.
  4. Combined. Run the candidate production stack (weights + KV + activation choices) against the same suite.
  5. Production canary. Deploy to 1-5% of traffic for a week. Watch for user-perceptible regressions (complaints, retry rates, downstream metrics).
  6. Re-calibration cadence. Set up a monthly perplexity check on production-trace replays to detect drift.

This is 3-5 days of work for a serious team. Skipping it produces the "ship and pray" failures that the field is full of.

Common quantization mistakes

Three recurring failure modes from production teams:

  1. Calibrating on the wrong data. Using a generic instruction dataset (Alpaca, OpenAssistant) for a domain-specific deployment (medical, legal, code). The outlier channels activated by domain data are different from those in the calibration set, and quantization fails on the deployed workload. Fix: calibrate on real production traffic samples.

  2. Skipping the long-context test. Running benchmarks at 4k context, deploying for 32k+ context, finding KV quantization regressions in production. Fix: always include a needle-in-haystack or RULER subset matching your deployed context length.

  3. Trusting the headline benchmark. A 0.3-point MMLU drop is real but uninformative. The relevant question is whether your users notice, which requires either trace replay or live A/B. Many "fine on MMLU" quantizations show measurable user-perceived regressions in production.

Tooling for evaluation

  • lm-evaluation-harness (EleutherAI) — standard for benchmark replay. Covers most public eval datasets cleanly.
  • HumanEval and EvalPlus — code-generation evaluations sensitive to quantization. EvalPlus tightens the test cases and catches regressions HumanEval misses.
  • MT-Bench and Arena-Hard — chat quality with LLM judges. Useful for spotting "fluent but worse" regressions.
  • Long-context evals — RULER, NIAH (needle-in-a-haystack), LongBench. KV quantization regressions show up here first.
  • Internal trace replay — your most important tool. Quantization regressions show up workload-specifically, and your traffic captures them better than any public benchmark.

Open problems

Sub-3-bit weight quantization. Quality cliffs at 2-bit are sharp. QAT helps; the open question is whether 1-2 bit is viable at frontier scale.

Activation quantization at INT4. W4A4 is the open challenge — viable in research, fragile in practice.

Mixed-precision routing. Different layers, different precisions, picked automatically. Manual versions exist; automated is research.

Calibration drift. Calibration uses a snapshot of activation distributions. Production traffic distributions drift. Re-calibration cadence is poorly understood.

Quantization of attention itself. Most quantization targets weights, activations, and KV; the attention scores and softmax outputs stay in higher precision. Quantizing them produces a sharper memory and compute win at the cost of accuracy. Active research; no production stack ships it broadly yet.

Cross-vendor format portability. A quantized model produced for one vendor's hardware (NVFP4 on NVIDIA) does not run unmodified on AMD or Apple Silicon. The OCP Microscaling spec aims to fix this; adoption is slower than the hardware. Production teams targeting multi-vendor often double-quantize, producing both NVIDIA-native and OCP variants.

KV-cache quantization for very long contexts. Errors compounding through millions of attention operations. Workload-specific tuning still needed.

Quantization-aware training at frontier scale. Pretraining a frontier model directly in INT4 or 2-bit is theoretically attractive (massive memory savings during training) but practically hard — the optimizer dynamics differ from BF16 training and the resulting loss curves are less predictable. BitNet b1.58 is the most-cited example of trained-from-scratch low-precision; scaling it to 70B+ parameters remains open.

Per-expert quantization for MoE. Naive uniform quantization across experts ignores that some experts handle harder distributions. Per-expert calibration is straightforward; per-expert precision selection (FP8 for hot experts, FP4 for cold) is an open optimization problem. See our MoE serving guide for the broader context.

Quantization for tool-augmented agents. Tool calls require precise JSON or function-call syntax; any structured-output corruption breaks the agent. The community has not built strong eval suites for quantization × agent reliability; this gap will close as agentic workloads grow. See our agent serving infrastructure guide.


Per-format deep dive: FP4, MXFP4, NVFP4, ternary, binary

Beyond FP8 and INT4, several lower-precision formats have emerged in 2024-2026.

FP4 (4-bit floating point)

A 4-bit floating-point format with 1 sign bit, 2-3 exponent bits, 0-1 mantissa bit. Variants: E2M1 (2 exponent, 1 mantissa) and E3M0 (3 exponent, 0 mantissa). The exponent range lets FP4 represent a wider dynamic range than INT4, useful for activations with outliers.

Quality at FP4 weight-only: comparable to INT4 with AWQ on most benchmarks, slightly better on math and code. Hardware support: Blackwell (B200) has native FP4 tensor cores; H100 and earlier emulate via FP8.

MXFP4 (Microscaling FP4)

MX format, Microsoft 2023. FP4 with a per-block scaling factor (typically per 32 elements) stored in a shared FP8 or BF16 scale. The block scaling captures outlier patterns more efficiently than per-tensor or per-channel scaling.

Quality at MXFP4: very close to FP8 on most benchmarks. MXFP4 is the format used in OpenAI's "fast" inference paths for some models in 2025.

NVFP4 (NVIDIA FP4)

NVIDIA's variant of MX-FP4, with per-block scaling tuned for Blackwell tensor cores. Blackwell's FP4 tensor cores natively support NVFP4. Quality and throughput are close to MXFP4; the difference is hardware-specific kernel optimization.

Ternary (1.58-bit)

BitNet b1.58 (Microsoft, 2024) showed that LLMs trained from scratch with ternary weights (-1, 0, +1) can match FP16 quality on benchmarks up to 3B parameters. The ternary representation requires 1.58 bits on average (log2(3)).

Production status as of May 2026: research-stage but with growing momentum. Falcon-Edge (Falcon 1B ternary) is the largest published deployment. The path to production requires native hardware support for ternary GEMMs; current GPU paths emulate and don't achieve the theoretical throughput.

Binary (1-bit)

The extreme: 1-bit weights (-1, +1). BitNet a4b4 demonstrated viability at small scale; quality drops significantly above 100M parameters with current techniques. Mostly research; not production-deployed for LLMs in 2026.

Format comparison table

Format Bits Quality vs FP16 Hardware native Best for
FP16/BF16 16 Baseline All modern Default
FP8 (E4M3) 8 -0 to -0.5 points H100+, MI300 Default for serving
INT8 8 -0 to -1 points All modern Older hardware
FP8 (E5M2) 8 -0.5 to -1 points H100+ Some KV cache uses
INT4 (AWQ) 4 -1 to -3 points Emulated, all Memory-constrained
FP4 / MXFP4 4 -0.5 to -2 points B200, emulated Future default
NVFP4 4 -0.5 to -2 points B200 Blackwell production
INT3 3 -3 to -7 points Emulated Research
Ternary (1.58-bit) ~1.58 -1 to -5 points (QAT) Emulated Research
Binary (1-bit) 1 -10+ points Emulated Research

Per-technique deep dive: OmniQuant, SqueezeLLM, SpQR, QuIP, HQQ, EXL2, EXL3

Beyond AWQ, GPTQ, and SmoothQuant, several other quantization methods are worth knowing.

OmniQuant

OmniQuant (2023) uses learnable clipping bounds and per-channel weight transformations to recover quality at INT4 and below. More sophisticated than AWQ; quality is comparable, sometimes slightly better. Slower calibration (hours vs minutes for AWQ on a 70B model). Niche; AWQ is the more practical default.

SqueezeLLM

SqueezeLLM uses non-uniform quantization (k-means-style clustering of weights) instead of uniform integer quantization. Better quality at low bits; runtime overhead for the lookup-based dequantization. Trades inference speed for quality; useful when memory is extremely constrained and slight slowdown is acceptable.

SpQR (Sparse-Quantized Representation)

SpQR (Dettmers et al., 2023) stores most weights at low precision (3-4 bit) and outlier weights at full precision (FP16) in a sparse format. The sparse outlier path adds complexity; quality recovery is good. Used in some research stacks; production adoption limited.

QuIP / QuIP#

QuIP (Quantization with Incoherence Processing) applies random rotations to weights and activations before quantization, reducing outlier sensitivity. QuIP# (2024) improves further with better incoherence processing. Best quality among 2-3 bit methods; runtime overhead from the rotation step.

HQQ (Half-Quadratic Quantization)

HQQ (Mobius Labs, 2024) is a fast, calibration-free quantization method. Quantizes a 70B model in 5-10 minutes (vs hours for AWQ/GPTQ). Quality is competitive at INT4; slightly worse at INT3. Used when fast quantization matters (e.g., adapting to new fine-tuned models frequently).

EXL2 / EXL3 (ExLlamaV2 / ExLlamaV3)

ExLlama's quantization format. EXL2 supports variable bits per layer (some layers at INT4, others at INT3, configured per the layer's quantization sensitivity). EXL3 (2025) extends to FP4 and adds GPU-optimized dequant kernels. Quality at the same average bit-rate is among the best; format is ExLlama-specific (less portable than GGUF or HuggingFace formats).

Technique-by-purpose

Method Best for Quality at INT4 Calibration time Production stack support
GPTQ Older default Very good Hours All major
AWQ Modern default Best Hours All major
SmoothQuant Activation quant Good Hours TRT-LLM, vLLM
OmniQuant Sub-4-bit Very good Hours-days Limited
SqueezeLLM Memory-extreme Good Hours Limited
SpQR Outlier-heavy Good Hours Limited
QuIP# 2-3 bit Best for sub-4 Days Research
HQQ Fast iteration Good Minutes Some
EXL2/3 ExLlama-only Very good Hours ExLlama

QAT and QLoRA: training-side quantization

Post-training quantization (PTQ, the default in production) operates on a trained model. Training-side quantization adapts the training process to produce a quantization-friendly model.

Quantization-Aware Training (QAT)

QAT inserts fake-quantization operations into the forward pass during training, so the model learns weights that survive quantization. Higher cost (training is more expensive) but better quality at low bits.

For production: QAT is used by frontier labs for FP8 and INT8 models that will be quantized for deployment. The quality preservation is excellent — QAT FP8 typically matches FP16 to within noise. Open-source QAT recipes (TorchAO, NVIDIA's QAT toolkit) make this accessible.

QLoRA

QLoRA (Dettmers et al., 2023) is fine-tuning on top of quantized base weights. The base model is loaded at NF4 (a 4-bit normalized float format); LoRA adapters are added in BF16. Memory required to fine-tune 70B drops from 280 GB (full FP16 fine-tune) to ~48 GB (QLoRA NF4 with 16-bit LoRA), making large-model fine-tuning feasible on a single 80GB GPU.

Quality of QLoRA fine-tunes: within 1-2% of full-precision LoRA fine-tunes on most benchmarks. The de facto standard for fine-tuning large models on commodity hardware.

ReLoRA

ReLoRA extends QLoRA with periodic merging of LoRA adapters back into base weights, allowing accumulating updates beyond LoRA's rank limit. Used for longer-running fine-tunes; production adoption growing.

NEFTune

NEFTune adds noise to embeddings during fine-tuning. Improves instruction-following quality. Often combined with QLoRA; nearly-free quality boost.


Calibration methods: per-channel, MSE/Hessian/percentile

Calibration is the process of choosing scale factors for quantization. The method substantially affects quality, especially at INT4 and below.

Min-max calibration

Use the per-tensor or per-channel min and max of activations on calibration data to set scales. Simple, fast, slightly suboptimal — outliers stretch the range and reduce precision for typical values.

Percentile-based calibration

Clip activations at the Nth percentile (typically 99.9% or 99.99%) before computing scales. Sacrifices precision on extreme outliers in exchange for higher precision on the typical range. Better than min-max for INT8 / INT4 weight + activation quantization.

MSE-minimizing calibration

Choose scales to minimize the mean-squared-error between the quantized and unquantized tensor over calibration data. Better quality than percentile in most cases; slower (requires search over scale candidates).

Hessian-based calibration (GPTQ-style)

GPTQ uses second-order (Hessian) information from calibration data to choose quantization rounding that minimizes the impact on subsequent activations. The mathematical foundation of GPTQ's superior quality over min-max methods.

Per-channel vs per-tensor vs per-group

Granularity of scales. Per-tensor: one scale for the whole tensor (smallest, lowest quality at low bits). Per-channel: one scale per output channel (better, standard). Per-group: one scale per group of 64-128 elements within a channel (best for INT4, used by AWQ and GPTQ). Per-token activations: dynamic per-token scales for activations.

The cost of finer granularity is metadata storage (extra scales) and dequant overhead. Per-group with group size 128 is the standard tradeoff for INT4 weight quantization in 2026.


Activation outliers and SmoothQuant insight

The challenge with quantizing activations (not just weights): a few channels in each activation tensor have values 100-1000× larger than the rest. These outliers force a wide quantization range, reducing precision for typical values.

The Hopper observation

Outlier channels are consistent: the same channels are outliers across different inputs. This consistency is the basis for outlier-aware techniques.

SmoothQuant

SmoothQuant (Xiao et al., 2022) shifts the quantization difficulty from activations to weights via a mathematical transformation. For each linear layer, divide activations by a per-channel smoothing factor and multiply weights by the same factor. The math is equivalent, but now the activations are smoother and easier to quantize.

The trick: choose smoothing factors that balance the quantization difficulty between activations and weights. Both must remain quantizable; pushing too much to weights makes weights hard to quantize and pushes too little leaves activations hard.

Activation-aware fine-tuning

For models where SmoothQuant alone is insufficient, light fine-tuning with the quantized activation path in the forward pass adapts the model to be more outlier-robust. Used in production by stacks targeting aggressive INT8 weight + activation quantization.


Attention quantization: FP8 attention on Hopper/Blackwell

Attention has its own quantization story. The challenge: the softmax in attention is sensitive to precision in ways that simple matmuls are not.

FP8 attention on Hopper

Hopper supports FP8 attention via FlashAttention 3. The Q, K, V projections and the QK^T matmul run in FP8 (E4M3); the softmax operates in higher precision (FP32 internally); the V projection runs in FP8. Quality loss vs FP16 attention: typically under 0.5% on benchmarks.

FP8 attention on Blackwell

Blackwell extends FP8 with NVFP4 support for some attention paths. The 2026 production state: FP8 attention is the default for new deployments on Hopper and Blackwell. FP4 attention is experimental.

Attention numerics

Three failure modes when quantizing attention. (1) Softmax loss of precision — fixed by computing softmax in higher precision. (2) Mass loss at extreme values — fixed by careful exponent handling. (3) KV cache precision interacting with attention — the KV cache precision (FP8, INT4) compounds with the activation precision; very aggressive combinations need eval.

Production status

Attention quantization is more conservative than weight quantization. Most production stacks use FP8 attention with FP8 weights and FP8 (or INT8) KV. Going to INT4 KV with FP8 attention is feasible with KIVI-style per-channel-K quantization. INT4 attention itself is not yet production-mature.


Per-model 2026 quantization choices

Practical quantization recipes for specific 2026 models:

Llama-3 70B

  • FP8 (W8A8): production default. Use NVIDIA's Llama-3.1-70B-Instruct-FP8 checkpoint. Quality matches FP16 to within noise.
  • INT4 weight-only (AWQ): for memory-constrained deployments. Quality loss 1-2 points on MMLU; larger on math/code.
  • NVFP4 on Blackwell: emerging; quality and throughput leading the FP4 production stack.

Qwen2.5-72B

  • GPTQ INT4: the most-deployed open quantization; Qwen team provides official GPTQ checkpoints.
  • AWQ INT4: alternative with similar quality.
  • FP8: less common in the Qwen ecosystem; recipes available via vLLM.

DeepSeek-V3 (MoE)

  • FP8: native — DeepSeek-V3 was trained with FP8 mixed precision from scratch. FP8 is the production default.
  • FP4 (NVFP4) on Blackwell: experimental; promising results.
  • MoE-specific consideration: expert weights have different outlier patterns than dense layers; per-expert calibration recommended.

Llama-4 (MoE)

  • FP8: native, similar to DeepSeek-V3.
  • MXFP4: experimental for inference.

Mistral Large 3

  • FP8: Mistral's recommended path for production. AWQ INT4 supported for memory-constrained inference.

Smaller models (8B-30B)

  • INT4 (AWQ) is more common for smaller models where memory efficiency matters more than quality.
  • FP8 is supported but the per-parameter quality gap to INT4 is smaller, making INT4 more attractive.

Per-stack support deep dive: vLLM, SGLang, TRT-LLM, llama.cpp, ExLlama

vLLM

Supports FP8 (W8A8 and W8A16), INT8, INT4 (AWQ and GPTQ), and FP4 (Blackwell). Newest formats land in vLLM first; broad model support. Production-ready for FP8 and AWQ INT4.

SGLang

Similar format coverage to vLLM. SGLang's distinguishing feature is tight integration of quantization with its prefix-caching pipeline — quantized KV caches integrate with RadixAttention without quality regression.

TensorRT-LLM

FP8, INT4 (AWQ), and FP4 (Blackwell) supported with NVIDIA-optimized kernels. The fastest among the stacks for the supported formats, especially FP8 and FP4. Format conversions and engine builds add operational complexity vs vLLM.

llama.cpp

GGUF format with many quantization variants: Q2_K, Q3_K_S, Q3_K_M, Q3_K_L, Q4_K_S, Q4_K_M, Q5_K_S, Q5_K_M, Q6_K, Q8_0, IQ-series (newer, learned-quantization variants). The dominant choice for CPU and Apple Silicon inference. Quality at Q4_K_M is comparable to AWQ INT4 with slightly different trade-offs.

ExLlamaV2 / ExLlamaV3

EXL2 format. Variable bits per layer based on quantization sensitivity. Best quality at the average bit rate among open formats. ExLlama-specific; less portable.

Format support comparison

Stack FP8 INT8 INT4 (AWQ) INT4 (GPTQ) FP4/NVFP4 GGUF EXL2/3
vLLM Yes Yes Yes Yes Yes (Blackwell) No No
SGLang Yes Yes Yes Yes Yes No No
TRT-LLM Yes Yes Yes Limited Yes No No
llama.cpp Limited Yes No No Limited Yes No
ExLlamaV3 No Limited No Limited Yes No Yes

Inference benchmarks per format

Llama-3 70B on H100 SXM, May 2026, vLLM 0.8:

Format Throughput (decode, batch 16) TTFT (4K prompt) Quality (MMLU)
FP16 150 tok/s/GPU 280 ms 79.5
FP8 240 tok/s/GPU 200 ms 79.3
INT8 230 tok/s/GPU 210 ms 78.9
AWQ INT4 290 tok/s/GPU 240 ms 77.8
GPTQ INT4 280 tok/s/GPU 245 ms 77.5
NVFP4 (B200) 420 tok/s/GPU 160 ms 78.5

The pattern: FP8 is the best quality/throughput trade-off on H100. AWQ INT4 wins on memory and decode throughput at modest quality cost. NVFP4 on Blackwell delivers the best of all worlds — high throughput, low memory, modest quality cost.


When INT4 breaks: math, coding, reasoning, long context

Most benchmarks (MMLU, HellaSwag, ARC) show 1-2 point INT4 drops vs FP16. The drop is much larger on specific workloads:

Math (GSM8K, MATH)

INT4 typically drops 3-7 points on math benchmarks. The cause: math requires exact intermediate computation; small errors propagate. AWQ helps but doesn't fully fix it.

Coding (HumanEval, MBPP)

INT4 drops 2-5 points on coding. Function-calling and structured output suffer particularly — the model is asked to produce exact syntactic tokens, where quantization noise can flip a token decision.

Reasoning (GPQA, MATH-500)

INT4 drops 3-8 points on long reasoning chains. Each step's small error compounds across thousands of tokens.

Long-context retrieval (NIAH, RULER)

INT4 weights with INT4 KV: 5-15 points drop on multi-needle retrieval at long context. The KV quantization is usually the bigger contributor.

Structured output (JSON, XML)

INT4 increases JSON validation failure rate by 2-5× vs FP16 on complex schemas. Production stacks that require strict JSON output usually run at FP8 or with INT4 + constrained decoding.

What to do

If your workload includes any of the above categories, evaluate INT4 explicitly. If quality drops are unacceptable, fall back to FP8 (negligible quality loss) or apply per-workload mitigations (constrained decoding for JSON, more aggressive eval for math).


FP4 on Blackwell: production status

Blackwell (B200, B100) is the first GPU generation with native FP4 tensor cores. Production deployment status as of May 2026:

Hardware

B200 GPUs available in NVL72 configurations from NVIDIA partners. Sufficient supply for hyperscale buyers; constrained for enterprise. Pricing per GPU is 2-3× H100, but per-token cost at FP4 throughput is lower.

Software

vLLM, SGLang, and TRT-LLM all support FP4 (NVFP4 specifically) on Blackwell. Llama-4, DeepSeek-V3, and several other frontier models have official FP4 checkpoints.

Quality

FP4 with appropriate per-block scaling (NVFP4) shows quality close to FP8: typically within 1 point on MMLU, slightly larger gaps on math and reasoning. With QAT or activation-aware fine-tuning, the gap closes further.

Production deployments

Microsoft Azure: serving Llama-4 and OpenAI o-series on Blackwell with FP4 in 2026. Google: Gemini 2.5 inference reportedly uses TPU v5p with bf16/int8 — TPUs don't have FP4. Meta: research on FP4 training and inference at scale; deployed for some internal workloads.

When to use FP4

For new Blackwell deployments, FP4 is now the default for most workloads. The exceptions: workloads where the FP4 quality drop is unacceptable (high-stakes reasoning, structured output) — fall back to FP8.


Quantization for fine-tuning: QLoRA, ReLoRA, NEFTune

Quantization isn't just for serving — it transforms fine-tuning economics.

QLoRA standard recipe

Load base model at NF4 (4-bit normalized float). Add LoRA adapters at BF16, rank 16-64. Train the LoRA only; base weights are frozen. Memory: ~48 GB for 70B fine-tune (single 80GB GPU).

Cost: ~1 GPU-day per epoch for a 70B model on a 100K-example dataset on H100. Comparable to full-precision fine-tune costs at much lower hardware requirements.

ReLoRA

Periodically merge LoRA adapters into base weights, reset LoRA, continue training. Allows accumulating updates beyond LoRA's rank limit. Used for longer fine-tunes (multiple epochs) where pure LoRA's rank is limiting.

NEFTune

Add Gaussian noise to embeddings during fine-tuning. Improves instruction-following quality with no cost overhead. Standard companion to QLoRA in many recipes.

Practical fine-tuning stack

For most teams in 2026: QLoRA + NEFTune at NF4 base + BF16 LoRA, rank 32, trained for 2-3 epochs on instruction data. Quality reaches within 1-2 points of full fine-tune on most benchmarks. The cost differential is enormous — full fine-tune of 70B is multi-GPU multi-day; QLoRA is single-GPU few-day.


Quantization with batching

Quantization interacts with batching in subtle ways.

Batch-aware dynamic activation scaling

For activations quantized to FP8 dynamically (re-computed per batch), larger batches give more samples for scale estimation. Smaller batches can produce noisier scales, causing intermittent quality drops.

Production fix: use static activation scales calibrated offline. The slight quality loss vs dynamic scales is more than recovered by stability.

Per-batch outlier handling

Some batches have unusual outlier patterns that the standard calibration didn't anticipate. Production stacks detect this (track per-batch max-min ratios) and either widen the dynamic range temporarily or flag the batch for offline analysis.

KV quantization with continuous batching

When new requests join a batch mid-flight, their KV needs to be quantized with the same scales as the existing batch's KV. Either use global scales (slightly worse quality) or per-request scales with careful bookkeeping (better quality, more complexity).


Accuracy recovery techniques

When quantization quality is borderline acceptable, several techniques can recover the lost points.

HQQ recovery

After fast HQQ quantization, run a short calibration-based fine-tune. Recovers 0.5-1 points typically. Useful when HQQ's speed advantage matters more than max quality.

Activation-aware fine-tuning

After quantization, fine-tune lightly with the quantized forward pass active. The model adapts to the quantization noise pattern. Recovers 1-2 points; cost is one fine-tune cycle.

Distillation from full-precision

Distill from the FP16 base model into the quantized variant. The quantized model learns to match FP16 outputs. Most effective recovery technique but most expensive.

Outlier-aware quantization

For models where outliers are the dominant quality cost, apply SpQR or similar techniques to keep outliers at higher precision. Recovers 1-3 points.


Engineering economics of self-quantization

Should you quantize your own model or use a pre-quantized one?

Pre-quantized advantages

  • Calibration done; no need to provide calibration data.
  • Tested by the community; known quality.
  • Available immediately.

Self-quantization advantages

  • Use your own calibration data (your distribution, not generic).
  • Choose your own format / method / hyperparameters.
  • Verify quality on your specific workload.

When self-quantization matters

If your workload distribution is unusual (medical text, code, multilingual), domain-specific calibration data improves quantized quality by 1-3 points. Worth the engineering investment.

For generic chat workloads, pre-quantized models (NVIDIA's FP8 checkpoints, Hugging Face's AWQ INT4 variants) are sufficient.


Quantization safety: refusal behavior

A subtle but real concern: quantization can change refusal behavior.

What changes

The model's behavior on borderline-unsafe queries can shift after quantization. INT4 quantized models occasionally refuse queries that FP16 answers (false positives) or answer queries that FP16 refuses (false negatives). The behavior is workload-specific and hard to predict.

Why it happens

Refusal is often determined by a few sharp activation patterns; quantization noise can flip these patterns. The effect is small on average but visible at scale.

Mitigation

For production deployments with safety guardrails, run safety eval on the quantized model before deployment. Run external guardrails (Llama Guard 3, Anthropic's safety classifier) independently of the model so quantization changes don't drift the safety posture.


2026 papers and what's next

FP4 training (2025-2026)

Several 2025-2026 papers (NVIDIA, Microsoft, DeepSeek) demonstrate FP4 mixed-precision training at scale. Production adoption is starting; full FP4 training is expected to be the standard for new frontier models by 2027.

Ternary at scale

BitNet b1.58 demonstrated ternary at 3B; subsequent papers push to 7B and 13B with similar quality. The path to 70B+ ternary depends on hardware support; current GPUs emulate ternary inefficiently.

Binary inference

Research only as of 2026. The path to production requires architectural changes (specifically designed for binary representation), not just quantizing existing transformers.

NVFP4 in production

Now the dominant format for Blackwell production deployments. By 2027 expect NVFP4 to be the production default for most workloads, with FP8 reserved for quality-sensitive applications.

Learned compression

Research direction: models that include learned compression of activations and weights as part of the architecture, rather than post-hoc quantization. Early results show promise but no production deployments yet.


The bottom line

The precision-throughput tradeoff is real, but in 2026 most of its sharpest edges have been sanded. The single biggest lever is picking the right precision per tensor type rather than per model: FP8 weights and activations, INT4 weights with FP16 activations when memory is tight, FP8 (or INT4) for the KV cache once context grows. Uniform "quantize the whole model to X" is leaving both quality and throughput on the table.

  • FP8 is the production default. It is hardware-supported on H100/H200/B200, near-lossless on every published frontier model, and cheap to calibrate.
  • INT4 weights are the cost-aggressive pick, but AWQ or GPTQ plus a workload-specific eval is the price of admission.
  • KV-cache quantization is often the largest single win for long-context serving — bigger than weight quantization on a 128k workload.
  • The cliff is task-shaped: math, code, structured output, and long-context recall degrade first while leaderboards look healthy.
  • Below 4 bits is research territory; budget for QAT or accept the loss.

To compound the savings, pair with KV cache and speculative decoding. For where the bandwidth being freed actually flows, see LLM serving.


FAQ

Is FP8 always better than INT8? Usually for inference. INT8 still has the edge on some workloads with very well-tuned kernels and on hardware without FP8 support. On A100 (no FP8 hardware), INT8 is the only 8-bit choice that runs at full throughput. On H100 and newer, FP8 wins by 0.3-0.7 points on most benchmarks. The remaining edge cases for INT8 are highly tuned production paths where INT8 kernels have specific shape optimizations FP8 hasn't received yet.

Can I quantize after fine-tuning, or do I need to fine-tune the quantized model? Post-training quantization (PTQ) works for FP8 and most INT4 setups. For sub-4-bit, fine-tune-then-quantize or QAT are necessary.

Does quantization affect safety alignment? There's evidence aggressive quantization can degrade alignment fine-tuning effects. Test refusals and instruction-following on a quantized model before shipping.

Is INT4 quality worse than FP4? At the same bit count and well-tuned kernels, FP4 typically beats INT4 on quality, by a small margin. The gap narrows with good per-group scaling on INT4.

Does quantization help training? Yes. FP8 training is production-deployed (Hopper FP8 paths). FP4 training is research-grade. Quantization for training is a different topic, with different trade-offs.

How do I know if my quantization broke something? Run workload-representative evals before and after. Track per-task scores, not aggregates. Use multiple eval seeds and compare distributions, not just means.

What's the cost of dequantization on the fly? Negligible for well-tuned kernels. The dequantize-then-matmul path can saturate HBM bandwidth without leaving cycles on the table.

Is there a free lunch at FP8? Close to it. FP8 weight quantization on a well-supported model gives you 2× bandwidth savings, 2× memory savings, and usually <1% quality regression. For most production: yes, do it.

How does NVFP4 differ from plain FP4? NVFP4 adds a per-block FP8 micro-scale on top of E2M1 FP4 values (typically blocks of 16). The micro-scales let each block adapt to local magnitude, recovering most of the precision lost in pure FP4 while keeping the half-byte payload. It's NVIDIA's Blackwell-era frontier-serving format.

Should I use AWQ or GPTQ for new W4A16 deployments? Either works. AWQ is faster to apply and slightly better on outlier-heavy models; GPTQ has a longer track record and excellent quality at INT3. For most teams: AWQ first, fall back to GPTQ if specific layers regress.

Does quantization interact with MoE? Cleanly. Per-expert quantization is straightforward (each expert is an independent FFN). Some production stacks mix precisions — hot experts at FP8, cold experts at FP4 — to balance HBM and quality. KV-cache quantization is orthogonal and stacks on top.

Does quantization interact with speculative decoding? Yes, usually positively. A quantized target model with a smaller dense draft is a great combination — see speculative decoding. The smaller draft tolerates more aggressive quantization than the target.

Is BitNet or ternary quantization production-ready? Not yet. 1.58-bit BitNet shows promising results on trained-from-scratch models; retrofitting existing models below 2 bits remains a research project. For production in 2026, NVFP4 / INT4 with QAT is the floor.

How do I quantize KV cache without breaking long-context retrieval? Start with FP8 KV (almost always safe). If you need more, go to per-head INT4 KV with sink tokens preserved at higher precision. Test on needle-in-haystack and your real retrieval workload — KV quantization is the most error-prone area at long context.

Is FP8 training the same as FP8 inference? No. FP8 training uses both E4M3 (forward) and E5M2 (gradients) and requires loss scaling, gradient scaling, and master FP32 weights to remain stable. FP8 inference uses only E4M3 for weights and activations and skips all the training machinery. The hardware paths are the same; the software stack is much simpler for inference. See our FP8 training tradeoffs guide for the training-side complications.

Should I quantize embedding and LM head layers? Usually keep them at higher precision. Embedding tables are accessed sparsely (only the active token IDs) so quantizing them saves little memory and risks accuracy. The LM head is the final projection to vocabulary logits; quantization noise there directly affects sampling distributions. Conservative production stacks keep both at BF16 or FP8 even when the rest of the model is INT4.

Does quantization help cold-start latency? Yes, significantly. Loading a 70B model from NVMe in BF16 is ~140 GB; in INT4 it's ~35 GB. Loading time drops from ~60 s to ~15 s on PCIe Gen5 NVMe. For autoscaling deployments where cold-start latency matters, the savings on warm-up are independent of the runtime quality argument.

How do I detect a botched quantization? The fastest signal: perplexity on a held-out workload-representative dataset. A FP8 quantization that increases perplexity by less than 1% is fine; more than 3% is suspicious; more than 10% is broken. Perplexity does not catch all issues (alignment regressions, structured-output errors) but it catches the gross failures cheaply. Combine with structured-output validity rate and a refusal-test suite for the harder failures.

Are there models that resist aggressive quantization more than others? Yes. Models trained with aggressive learning rates, very deep architectures, or significant MoE routing tend to have spikier activation distributions and quantize worse. Models trained with z-loss, careful weight init, and conservative learning rates quantize better. Llama 3.x quantizes cleanly; DeepSeek-V3 quantizes well due to its training stability; some research models with shallow training have unexpected quantization sensitivity. Always check published quantization benchmarks before committing to aggressive precision.

Does quantization stack with LoRA fine-tuning? Yes. QLoRA (NF4 base + LoRA adapters in BF16) is the standard fine-tuning recipe. For serving, quantized base + BF16 LoRA adapters works cleanly; the adapter weights are small enough to stay at higher precision without budget concerns. The combination delivers most of the cost savings of full quantization with the flexibility of per-tenant adapters.

How often should I re-calibrate? Whenever the workload distribution drifts meaningfully — typically every 3-6 months for stable products, monthly for fast-evolving ones. The signal is rising perplexity or degrading task scores on production-trace replays. Some teams run continuous shadow calibration that produces a re-quantized model every week and tracks how much it differs from the in-production version.

Is there a downside to FP8 in production beyond quality? Two minor ones. First, FP8 kernels can have lower utilization than BF16 on small shapes because the FP8 tensor cores have different shape requirements. Second, FP8 makes debugging harder — when something goes wrong in a forward pass, FP8 numbers are less intuitive to read than BF16. Neither is a reason to avoid FP8 in production; just be aware.

What's the difference between NF4 and INT4? NF4 (Normalized Float 4) uses non-uniform quantization steps designed to match the normal distribution typical of neural network weights. INT4 uses uniform steps. NF4 typically gives 0.5-1 point better quality on weight-only quantization at the same bit budget. NF4 is the default for QLoRA fine-tuning; INT4 (via AWQ or GPTQ) is more common for inference serving.

Should I use INT4 or FP4 on Blackwell? FP4. NVFP4 has native hardware support on Blackwell with better throughput than INT4 (which is emulated through INT8). Quality is comparable; NVFP4 is the right default. Reserve INT4 for cases where you need GPU-portable model artifacts that also run on Ampere/Hopper.

Does quantization help with prefill or decode? Both. Decode benefits more in relative terms because decode is bandwidth-bound — smaller weights and KV mean less HBM traffic per token. Prefill benefits in absolute terms because the compute volume is much larger; saving 30% of FLOPs on a 50ms prefill is more wall-clock than saving 30% on a 5ms decode step. Compounding across both phases is the win.

Can I mix precisions across layers? Yes. EXL2 explicitly supports per-layer precision; some layers at INT4, others at INT3. AWQ and GPTQ both allow specifying which layers get which precision. The technique is most useful when a few "sensitive" layers (typically early layers and attention output projections) need higher precision to preserve quality at aggressive quantization elsewhere.

Does quantization affect tokenization? No. Tokenization is a CPU operation on text; quantization is a precision choice for tensors. They are independent. Some teams confuse them because both can change generation behavior; the mechanisms are different.

How does quantization interact with FlashAttention? FlashAttention 3 supports FP8 attention natively. The Q, K, V projections and the QK^T matmul run in FP8; the softmax operates in higher precision internally; the attention-V product runs in FP8. For INT4 KV with FP8 attention, the KV is dequantized inline to FP8 for the attention computation, then the result is stored in INT4 again. Production stacks handle this transparently.

What's the quality difference between Q4_K_M (GGUF) and AWQ INT4? Roughly comparable. Q4_K_M typically scores within 0.5 points of AWQ INT4 on MMLU; sometimes Q4_K_M is slightly better, sometimes AWQ. GGUF's advantage is broader hardware support (CPU, Apple Silicon); AWQ's advantage is faster GPU inference with vLLM/SGLang/TRT-LLM. Pick based on deployment target, not quality.

Should I use GPTQ or AWQ in 2026? AWQ for new deployments. AWQ's quality is slightly better than GPTQ at the same bit budget on most benchmarks, and AWQ's calibration is faster. GPTQ remains common in legacy deployments and in the Qwen ecosystem where official checkpoints are GPTQ. Both work for production.

Does quantizing affect output determinism? Yes. Different quantization formats and methods produce different floating-point outputs, so a quantized model is bit-different from its FP16 source. Within a single quantized model, output is deterministic given fixed inputs and seeds — quantization noise is baked into the weights and is not stochastic.

Can I run a quantized model on CPU? Yes, with llama.cpp (GGUF formats) or with ONNX Runtime + INT8. CPU inference of quantized 7B-30B models is feasible on modern x86 with AVX-512 / AMX. Speed depends on model size and CPU; an M3 Ultra can run 70B Q4_K_M at usable speeds; a server-class Xeon Sapphire Rapids handles 30B comfortably. For larger models, GPU is still required.

How do I quantize a custom fine-tuned model? Same recipes as base models. Run AWQ or GPTQ calibration on a held-out subset of your fine-tuning data (or a representative sample of production traffic). Verify quality on your benchmark. The fine-tuning-specific consideration: LoRA adapters can be merged into base weights before quantization, or kept separate and applied at inference time. Both work; merging is simpler at serving time.

What is KIVI and why is it the dominant KV quant? KIVI (Liu et al., 2024) is a KV quantization scheme that uses per-channel scaling for the K cache and per-token scaling for the V cache. The asymmetry is motivated by the different outlier patterns in K vs V. KIVI achieves INT2 KV at quality close to FP8 KV, much better than naive INT4 KV. The dominant choice for aggressive KV compression in 2026 production.

Does quantization change behavior on adversarial inputs? Sometimes. INT4 models can be more or less robust to adversarial prompts than their FP16 counterparts; the direction is workload-specific. For safety-critical deployments, include adversarial probes in your quantization eval. Most production stacks find quantization-induced safety drift to be small but non-zero.

What is the future of quantization research? Three directions. (1) Sub-4-bit formats with QAT closing the quality gap to FP8. (2) Hardware-codesigned formats (ternary, binary) with native silicon support in next-gen chips. (3) Learned quantization that adapts to per-model and per-workload distribution. By 2028 expect production deployments to routinely run at 2-3 average bits, with frontier deployments still on FP4-FP8 for the highest quality demands.


Glossary

  • AWQ — Activation-aware Weight Quantization. Outlier-channel-preserving INT4 method.
  • Calibration set — small dataset used to measure activation distributions for scale selection.
  • E4M3 / E5M2 — the two NVIDIA FP8 formats; differ in exponent and mantissa bit counts.
  • GPTQ — iterative, calibration-based weight quantization with error compensation.
  • Group size — number of weights sharing a scale in per-group quantization.
  • Outlier — an activation or weight value far from the typical magnitude. Drives most quantization error.
  • PTQ — Post-Training Quantization. Quantize an already-trained model.
  • QAT — Quantization-Aware Training. Train with quantization simulated in the forward pass.
  • SmoothQuant — rebalances weight and activation magnitudes for W8A8 quantization.
  • W4A16 / W8A8 / etc. — notation for weight precision / activation precision.
  • Zero point — offset value in asymmetric quantization, paired with the scale.

References

  • LLM.int8() — Dettmers et al., 2022. "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." arXiv:2208.07339. Foundational outlier-handling.
  • GPTQ — Frantar et al., 2022. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." arXiv:2210.17323. Iterative INT4 with error compensation.
  • AWQ — Lin et al., 2023. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv:2306.00978. Salient-channel preservation.
  • SmoothQuant — Xiao et al., 2022. "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models." arXiv:2211.10438. W8A8 via magnitude rebalancing.
  • FP8 Formats for Deep Learning — Micikevicius et al. (NVIDIA, Intel, ARM), 2022. arXiv:2209.05433. The E4M3 / E5M2 specification.
  • KIVI — Liu et al., 2024. "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache." arXiv:2402.02750.
  • QLoRA — Dettmers et al., 2023. "QLoRA: Efficient Finetuning of Quantized LLMs." arXiv:2305.14314. NF4 quantization for fine-tuning.
  • Marlin — Frantar, Castro, Alistarh, 2024. Fast INT4 GEMM kernels. See github.com/IST-DASLab/marlin.
  • bitsandbytes — Hugging Face's 8-bit and 4-bit quantization library. github.com/bitsandbytes-foundation/bitsandbytes.
  • llama.cpp — community reference for aggressive sub-INT4 quantization on CPU and consumer GPUs. github.com/ggerganov/llama.cpp.
  • SpinQuant — Liu et al., 2024. "SpinQuant: LLM Quantization with Learned Rotations." arXiv:2405.16406. Rotation-based outlier suppression enabling W4A4.
  • NVIDIA Transformer Engine documentation — canonical reference for FP8 and NVFP4 implementations. docs.nvidia.com/deeplearning/transformer-engine.
  • H2O — Zhang et al., 2023. "Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models." arXiv:2306.14048. KV eviction by attention heuristic.

Per-format deep dive with bit-budget math

Each numeric format trades range, precision, and storage. The bit-budget arithmetic determines what is and isn't representable.

FP16 (half precision)

1 sign + 5 exponent + 10 mantissa = 16 bits. Range ~±65,504. ~3-4 decimal digits of precision. Standard for inference before FP8.

BF16 (brain float)

1 sign + 8 exponent + 7 mantissa = 16 bits. Range ±3.4e38 (matches FP32 range). Lower precision than FP16 (2 decimal digits) but no overflow concerns. Training default since 2020-ish.

FP8 (E4M3 and E5M2)

8 bits total. E4M3 (4 exponent, 3 mantissa) used for weights and activations in forward; range ~±448. E5M2 (5 exponent, 2 mantissa) used for gradients in backward; range ~±57,344. NVIDIA Transformer Engine handles per-tensor or per-row scaling.

INT8

8 bits, signed integer. Range ±127. Per-tensor or per-channel scaling factor maps to fp range. Standard for weight-only and weight+activation quantization on pre-Hopper hardware.

INT4

4 bits. 16 levels. Per-group scaling (typically group 32, 64, or 128) gets close to FP8 quality on weights. Activation quantization at INT4 is hard.

FP4 (NVFP4, MXFP4)

NVIDIA Blackwell introduced NVFP4 with hardware support; MXFP4 (Microscaling) is the OCP standard. 4 bits with E2M1 (2 exponent, 1 mantissa). Per-microblock scaling (typically 32-element groups). On Blackwell tensor cores, FP4 matmul runs at 2× FP8 throughput.

Ternary / binary

3 levels (-1, 0, 1) for ternary; 2 levels for binary. Research models (BitNet, 1.58-bit BitLLM) show feasibility but production deployment is rare. Quality recovery requires aggressive QAT.

Bit-budget for a 70B model

Format Bits/param 70B model size
FP32 32 280 GB
BF16/FP16 16 140 GB
FP8 8 70 GB
INT8 8 70 GB
INT4 (group 128) ~4.5 39 GB
FP4 (NVFP4) ~4.5 39 GB
INT2 (Q2_K llama.cpp) ~2.6 23 GB
Ternary ~1.58 14 GB

The "~4.5" for INT4/FP4 reflects the scale factor overhead per group.


Per-technique catalog: how each algorithm works

The major quantization algorithms, what they do, and where they shine.

PTQ-static

Quantize weights once using calibration data. Activations quantized at runtime with pre-computed scales. Most production deployments.

PTQ-dynamic

Activation scales computed online per-batch. More flexible, slightly slower.

GPTQ (Frantar et al., 2022)

Hessian-aware second-order quantization. Iteratively quantizes weight columns while compensating remaining columns. INT4 with minimal quality loss for large models. Reference for INT4 weights.

AWQ (Lin et al., 2023)

Activation-aware Weight Quantization. Scales weights based on activation statistics to preserve salient channels. Slightly different from GPTQ; faster, simpler.

SmoothQuant (Xiao et al., 2023)

Shifts the quantization difficulty from activations to weights via per-channel scaling. Enables W8A8 (INT8 weights + activations) on dense models.

ZeroQuant-V1/V2

DeepSpeed's quantization toolkit. V2 supports INT4 weights with INT8 activations.

OmniQuant (Shao et al., 2023)

Learnable equivalent transformations on weights and activations. Outperforms heuristic methods for aggressive quantization.

SqueezeLLM

Sensitivity-based non-uniform quantization. Allocates more bits to sensitive weights.

SpQR (Sparse-Quantized Representation)

Dense-quantized base + sparse outlier matrix. Strong INT3-INT4 quality.

QuIP and QuIP#

Quantization with Incoherence Processing. Random rotation makes weights more amenable to quantization. INT2 production-grade.

HQQ (Half-Quadratic Quantization)

Fast, no-calibration quantization with surprisingly good quality. Drop-in for many workflows.

EXL2 and EXL3

ExLlama formats. Per-tensor variable bitwidth (mix 2-bit and 8-bit per tensor). Used heavily by local-inference community.

bitsandbytes 8-bit / 4-bit / NF4

Hugging Face's library. NF4 (NormalFloat 4-bit) is the QLoRA paper's recommended format.

AQLM (Additive Quantization)

Multi-codebook quantization. Reaches 2-3 bits with strong quality recovery.

Technique comparison

Technique Bits target Quality at bits Calibration Production use
GPTQ INT4 Strong Yes (~128 samples) Mature; vLLM, TRT-LLM
AWQ INT4 Strong Yes Mature; vLLM
SmoothQuant W8A8 Good Yes Mature
OmniQuant W4A4 Best in class Yes (longer) Growing
HQQ INT4-INT2 Good No Growing for fast workflows
QuIP# INT2 Surprisingly good Yes Niche
AQLM INT2-INT3 Excellent Yes (slow) Local inference
NF4 (bitsandbytes) INT4 Good No QLoRA fine-tuning
EXL2 mixed Tunable Yes ExLlama community
GGUF Q2-Q8 INT2-INT8 Tunable Yes llama.cpp, edge

KV cache quantization in depth

KV cache is often the dominant inference memory cost. Quantizing it has different rules than quantizing weights.

KIVI (per-channel K, per-token V)

Liu et al., 2024. Keys quantized per-channel (each head dim has its own scale), values per-token. Asymmetric design respects the different statistical properties. INT2 KV cache with minimal quality loss.

KVQuant

KV-specific quantization with non-uniform bit allocation.

FP8 KV cache

Direct application of FP8 to K and V. Easiest; supported by TRT-LLM, vLLM, SGLang. Quality loss minimal.

INT4 KV cache failure modes

Aggressive INT4 KV (without KIVI-style per-channel) shows quality regression on long-context tasks. Specifically: retrieval-from-context accuracy drops, "needle in haystack" benchmarks degrade. The pattern: outlier channels in K dominate attention scores; quantizing them uniformly destroys the signal.

KV cache quantization stacks

Stack KV cache options Production state
vLLM FP8, INT8 FP8 well-tested
SGLang FP8, INT8 FP8 default for many configs
TRT-LLM FP8 Mature
llama.cpp Q4_0, Q8_0 KV Edge / local
ExLlama FP8, INT8 Local

Attention quantization: FP8 attention on Hopper, FP4 on Blackwell

The attention operation itself can be quantized.

FlashAttention-3 with FP8

FlashAttention-3 (mid-2024 release for Hopper) supports FP8 attention via the H100 FP8 tensor cores. ~2× throughput vs BF16. Used in production by major inference vendors.

TC-Gen5 (Blackwell) and FP4 attention

Blackwell tensor cores (TC-Gen5) support FP4 attention via NVFP4. Production status: emerging in 2025-2026 deployments. Throughput further increases.

Quality preservation

Attention quantization is generally more sensitive than FFN quantization. Outlier handling matters. Production deployments either use FP8 for attention with FP4 for FFN, or careful calibration.


Per-stack capability matrix

What each inference stack actually supports:

Stack FP8 weights FP4 weights INT4 weights FP8 KV FP8 attn FP4 attn
vLLM Yes (FP8 E4M3) Limited AWQ, GPTQ Yes Yes (Hopper) Limited
SGLang Yes Yes (Blackwell) AWQ, GPTQ Yes Yes Yes (Blackwell)
TRT-LLM Yes (mature) Yes (NVFP4) AWQ, GPTQ Yes Yes Yes
llama.cpp Limited No GGUF Q2-Q8 Q4-Q8 No No
ExLlama (v2/v3) Limited No EXL2/3 Yes Yes No
MLX (Apple) Yes No INT4, INT8 Yes Limited No
OpenVINO Yes No INT4, INT8 Yes Yes (Intel) No

What this means for builders

For frontier serving on Blackwell, TRT-LLM and SGLang give the most complete FP4 path. For Hopper, vLLM and TRT-LLM cover FP8 well. For edge / local, llama.cpp GGUF is the lingua franca.


Inference benchmarks by precision

Public benchmark numbers, Llama-3.1 70B on 8x H100 unless noted, 2025-2026 Q2:

Format Throughput (tps) TTFT (ms) MMLU delta vs BF16
BF16 ~3500 200 reference
FP8 ~5500 150 -0.2 pts
INT8 (W8A8 SmoothQuant) ~4800 170 -0.5 pts
INT4 (AWQ) ~7500 130 -1.0 pts
INT4 (GPTQ) ~7000 130 -1.2 pts
FP4 (NVFP4 on B200, est.) ~12000 100 -1.0 pts
INT3 (AQLM) ~7800 130 -2.5 pts
INT2 (QuIP#) ~8000 130 -5 to -8 pts

For B200, throughput numbers should be multiplied 2-3× vs H100 baseline; the MMLU deltas largely transfer.


Quantization decision matrix

When to pick what.

Latency-bound on H100/H200

FP8 if available (Hopper FP8 tensor cores). INT8 or INT4 AWQ otherwise. Watch attention quantization.

Latency-bound on B200

FP4 (NVFP4) for FFN; FP8 for attention. Calibrate carefully.

Throughput-bound at high batch

INT4 (AWQ or GPTQ) gives largest throughput. Quality cost ~1-2 pts MMLU.

Memory-bound (large context, limited HBM)

KV cache FP8 or INT8. Weight quantization to INT4. Combine for maximum context per GPU.

Quality-critical (math, code, reasoning)

BF16 or FP8 only. INT4 risks regression on hard tasks.

Edge / consumer (Apple, AMD consumer GPU)

GGUF Q4_K_M or Q5_K_M for llama.cpp. MLX 4-bit on Apple silicon.

Decision flowchart summary

  1. Hardware? H100→FP8, B200→FP4, edge→GGUF, AMD MI300→FP8/INT4.
  2. Quality tolerance? < 1pt MMLU loss → FP8 or BF16. 1-2pts OK → INT4.
  3. Use case? Math/code/reasoning → conservative (FP8). Chat/RAG → aggressive (INT4 OK).
  4. Stack? Production frontier → TRT-LLM or SGLang. Open / community → vLLM. Edge → llama.cpp.

Quantization safety considerations

Quantization can change model behavior in subtle ways that matter for safety.

Refusal behavior shifts

Aggressive INT4 quantization sometimes weakens refusal training: the model may comply with prompts it would refuse at BF16. Pattern: refusal training adds small magnitude weights that are sensitive to rounding. Mitigation: calibrate with refusal-class prompts; verify refusal rate post-quantization.

Jailbreak susceptibility

Some research shows INT4 models are more susceptible to certain jailbreaks. Evaluation post-quantization is mandatory for production.

Hallucination rate

Aggressive quantization can increase hallucination rate on factual tasks. Mostly small (< 1% absolute) but workload-specific.

Bias and fairness

Less studied but observed: some quantization choices interact with bias mitigation training. Validate per-deployment.

Mitigation playbook

  1. Re-run safety evals post-quantization.
  2. Compare refusal rates against BF16 baseline.
  3. Test against known-jailbreak suites.
  4. Watch for new failure modes specific to the quantization stack.

2026 quantization research highlights

What's new and worth tracking.

FP4 training (not just inference)

Research demonstrates FP4 training is feasible. NVIDIA Blackwell hardware supports both inference and training in FP4. Production training in FP4 is emerging, primarily for cost-sensitive pretraining runs.

Ternary / 1.58-bit models

BitNet b1.58 and successors show ternary models can match FP16 quality at scale. Production deployment is rare but research is active.

Binary inference

Truly 1-bit weights. Quality gap to FP16 remains substantial; useful for niche hardware.

NVFP4 in production

NVFP4 has shipped in TRT-LLM, SGLang. Microsoft Azure and several inference vendors offer NVFP4 endpoints. Quality vs FP8 is workload-specific.

Calibration data sensitivity

Recent papers show calibration data choice matters more than expected. Domain-matched calibration (use math problems if you serve math) gives substantially better outcomes than generic.

Layer-wise mixed precision

Different layers use different precisions based on sensitivity. EXL2 and EXL3 expose this; production stacks adopting through 2026.


Additional FAQ

Q: What's the difference between weight-only and weight+activation quantization?

Weight-only stores weights at low precision but uses BF16/FP16 for activations during compute. Saves memory and bandwidth but doesn't help compute. Weight+activation quantizes both, requiring lower-precision compute (FP8 tensor cores, INT8 matmul). Provides compute speedup as well.

Q: Is FP8 production-ready in 2026?

Yes. Hopper (H100/H200) FP8 has been production since 2023. Blackwell FP8 is mature. TRT-LLM, vLLM, and SGLang all default to FP8 for new deployments. Quality delta is small.

Q: When does FP4 break for production?

For most chat workloads: FP4 with proper calibration is fine. For hard reasoning (math, code, complex logic) FP4 sometimes shows 2-3pt MMLU regression. Decision: serve reasoning workloads on FP8, throughput-sensitive chat on FP4.

Q: What is "calibration" in quantization?

Run a small dataset (~128-1024 samples) through the model to collect activation statistics. The statistics are used to set per-channel or per-tensor scales. Calibration data should match the deployment workload.

Q: Does AWQ require fine-tuning?

No. AWQ is post-training quantization (PTQ). The "activation-aware" part refers to using activation statistics from calibration, not fine-tuning.

Q: What's NF4 and when to use it?

NormalFloat 4-bit. Uses non-uniform code-points that match the normal distribution of weights. Used in QLoRA for fine-tuning. Production inference rarely uses NF4 directly; preferred formats are GPTQ or AWQ INT4.

Q: Can I serve a GGUF model in vLLM?

Limited. GGUF is the llama.cpp format. vLLM supports its own quantization formats (AWQ, GPTQ, FP8). Converting GGUF to vLLM-compatible requires re-quantizing.

Q: What's the memory savings of FP8 vs BF16?

50% for weights, 50% for KV cache, 50% for activations. For a 70B model, BF16 takes ~140 GB; FP8 takes ~70 GB. Critical for fitting models on single nodes.

Q: Is there a quality-free quantization?

Practically no. Even FP8 has small quality loss (< 0.5pt MMLU). The art is choosing quantization aggression that meets workload quality bar.

Q: Should I quantize myself or use pre-quantized models?

Pre-quantized for most use cases. Calibration data and tuning take expertise. Hugging Face hosts pre-quantized variants of all major open models. Self-quantize when you have specific calibration needs.

Q: How does quantization interact with LoRA?

LoRA adapter weights are typically BF16 even when base is quantized. Inference combines the quantized base with full-precision LoRA delta. Quality preservation is good. See multi-tenant LoRA serving.

Q: Can I quantize the embedding layer?

Yes, but it's usually the layer with the largest quality impact per bit removed. Many production stacks keep embeddings at BF16 even when other layers are quantized.

Q: What about lm_head (the output projection)?

Same as embeddings — often kept at BF16. Quantizing the final projection has outsized quality impact.

Q: How does INT4 quantization interact with MoE?

Per-expert quantization works; the hottest experts can stay at FP8 while rarely-used go to INT4 or FP4. See mixture-of-experts serving.

Q: Are there quantization-aware fine-tuning libraries?

QLoRA (Dettmers et al., 2023) is the canonical reference. Fine-tunes quantized base with full-precision LoRA. bitsandbytes integration in PEFT.

Q: How long does quantization take?

GPTQ on a 70B model: 1-8 hours on a single A100/H100. AWQ: similar. HQQ: minutes (no calibration). OmniQuant: longer (training-style optimization).

Q: Can I requantize a quantized model to lower bits?

Generally no — quality degrades faster than re-quantizing from BF16. Better to keep the BF16 base and re-quantize.

Q: What's a "group size" in INT4 quantization?

The number of weights that share a single scale factor. Common: 32, 64, 128. Smaller groups give better quality but more overhead. 128 is the production sweet spot for INT4.

Q: Does INT4 hurt long-context performance?

Sometimes. Long-context retrieval ("needle in haystack") can degrade with aggressive quantization, especially when KV cache is also quantized. Test specifically.


Cross-references and further reading


Hardware-specific quantization paths

Each hardware target has a different optimal quantization recipe.

NVIDIA H100 / H200 (Hopper)

FP8 tensor cores. FP8 (E4M3) for matmul, INT8 for legacy paths. INT4 weight-only via dequantization-on-the-fly to FP16/FP8. Production sweet spot: FP8 for forward, FP8 KV cache.

NVIDIA B100 / B200 (Blackwell)

FP4 (NVFP4) tensor cores in addition to FP8. 2× FP8 throughput via FP4. Sweet spot: FP4 FFN + FP8 attention, or full FP4 for non-reasoning workloads.

NVIDIA L40S / L4

Lovelace generation. FP8 supported. Common for inference-only deployments.

NVIDIA RTX 4090 / 5090 (consumer)

FP8 supported on Lovelace and newer. Used heavily in local-inference setups. Quantization sweet spot: INT4 GPTQ for memory budget, FP8 for compute.

AMD MI300X / MI325X / MI350X

ROCm support for FP8. INT8 mature. INT4 emerging. Production deployments use FP8 for parity with Hopper.

Intel Gaudi 3

INT8 and FP8 support. Software stack less mature.

Apple Silicon (M2/M3/M4)

MLX supports INT4 and INT8 quantization. Unified memory architecture means quantization mostly saves memory, not compute (no specialized tensor cores).

Mobile NPUs (Qualcomm Hexagon, Google Tensor)

INT4 / INT8 native. FP16 supported. Production inference on phone uses these aggressively.

Hardware comparison summary

Hardware Best precision Tensor cores HBM/Memory Notes
H100/H200 FP8 FP8, INT8 80/141 GB HBM Mature production
B100/B200 FP4 FP4, FP8, INT8 192 GB HBM New frontier
L40S FP8 FP8, INT8 48 GB GDDR Inference SKU
RTX 4090 FP8/INT4 FP8, INT8 24 GB GDDR Local power user
MI300X FP8 FP8, INT8 192 GB HBM AMD alternative
Gaudi 3 FP8/INT8 FP8, INT8 128 GB HBM Niche
Apple M3/M4 INT4 Limited Unified On-device

Quantization for specific workloads

The right precision varies by what the model is being asked to do.

Chat (general)

INT4 (AWQ/GPTQ) is acceptable. Quality cost minimal for conversational use. Most user-facing chat is INT4.

Code generation

INT4 sometimes acceptable; FP8 safer. Code is more sensitive to small mistakes than prose. Recent benchmarks show INT4 GPTQ Llama-70B drops code-completion accuracy by 1-3pts vs BF16.

Math reasoning

FP8 or BF16. INT4 quality regression is real and large (3-7pt drop on GSM8K, MATH benchmarks). Don't aggressively quantize math models.

RAG

INT4 acceptable; the bottleneck is retrieval quality, not LLM precision. See RAG production architecture.

Agent / tool use

FP8. Agents make many calls; small quality regressions compound. Worth the extra cost.

Long-context retrieval

Sensitive to KV cache precision more than weights. FP8 KV cache safe; INT4 KV cache risky for long context.

Multimodal

Vision encoder typically kept at higher precision. Language head can be quantized. Joint quantization needs careful calibration. See multimodal serving.

Reasoning models (R1, o-series)

Conservative quantization. Thinking traces are long; quality compounds. FP8 or BF16 typical.

Workload-quant decision table

Workload Aggressive (max throughput) Conservative (max quality)
Chat INT4 / FP4 FP8
Code FP8 BF16
Math FP8 (no further) BF16
RAG INT4 / FP4 FP8
Agent FP8 BF16
Long-context FP8 weights + FP8 KV BF16 + FP8 KV
Multimodal FP8 language + BF16 vision BF16 throughout
Reasoning FP8 BF16

Quantization in fine-tuning workflows

Fine-tuning a quantized base model is a common pattern.

QLoRA (Dettmers et al., 2023)

Quantize base to NF4, train LoRA adapter in BF16. Memory savings: 4× vs full BF16 fine-tuning. Quality preservation: surprisingly strong. The canonical pattern for fine-tuning frontier models on consumer GPUs.

NEFTune

Add small noise to embeddings during fine-tuning. Improves quality. Stacks cleanly with QLoRA.

ReLoRA

Periodically merge LoRA into base and start a new LoRA. Enables longer fine-tuning at LoRA cost.

Full QAT (Quantization-Aware Training)

Train the model with simulated quantization in the forward pass. Best quality at low bits (INT4, INT2). Expensive.

Post-fine-tune requantization

Fine-tune in BF16, then quantize the result. Common but loses some quality vs QAT.

Fine-tune workflow comparison

Workflow Bits trained Bits served Memory Quality
QLoRA 4 (base) + 16 (adapter) 4 + 16 Lowest Strong
NEFTune-LoRA 16 16 Medium Best
ReLoRA 4 + 16 4 + 16 Lowest Improving
Full QAT simulated 4 4 High Best for low bits
FT + PTQ 16 4 Medium Common

Quantization deployment checklist

For going from a BF16 model to a production-quantized serving deployment:

  1. Pick target precision based on hardware and workload (table above).
  2. Choose technique (GPTQ, AWQ, FP8 native, FP4 native). Match to stack.
  3. Prepare calibration data. ~128-1024 samples representative of production traffic. Domain-matched.
  4. Run quantization. Verify no NaN/Inf in outputs.
  5. Eval on standard benchmarks (MMLU, MMLU-Pro, GSM8K for relevant workload).
  6. Eval on production-specific tasks (your evals).
  7. Eval safety (refusal rate, jailbreak suite).
  8. Benchmark throughput and latency vs BF16 baseline.
  9. A/B test in production on small traffic %.
  10. Monitor in production for quality regressions.

Common pitfalls

  • Calibration data mismatch with production distribution.
  • Quantizing embedding/lm_head when shouldn't.
  • Missing the activation outlier issue.
  • Not re-running safety evals.
  • Mismatched KV cache and weight precision.
  • Stack-specific bugs (early FP4 implementations had quality issues).

Cost-economics of quantization at scale

Quantization is one of the highest-leverage cost-reduction techniques.

Per-token cost reduction

Format Throughput multiplier vs BF16 Per-token cost reduction
BF16 1.0× baseline
FP8 1.5-1.8× 33-44%
INT4 (AWQ) 2.0-2.5× 50-60%
FP4 on B200 3.0-3.5× 66-71%

Capex implications

Same throughput in INT4 / FP4 vs BF16 means fewer GPUs. For a deployment serving 100k QPS sustained:

  • BF16: ~400 H100 GPUs.
  • FP8: ~250 H100.
  • INT4: ~170 H100.
  • FP4 on B200: ~100 B200.

GPU capex savings: 50-75% from quantization alone, before considering Blackwell's FP4 throughput edge.

Memory budget

For a 70B model with 100k context:

  • BF16 weights + BF16 KV: 140 + ~50 = 190 GB → 3x H100 minimum.
  • FP8 weights + FP8 KV: 70 + 25 = 95 GB → 2x H100.
  • INT4 weights + FP8 KV: 35 + 25 = 60 GB → 1x H100.

Quantization fits more model on fewer GPUs, which is often the binding constraint.

Quantization vs other levers

Quantization sits alongside speculative decoding, batching, and disaggregated inference as cost levers. They stack: FP8 + speculative + disaggregated can deliver 4-6× over BF16 single-shot.

For broader context see AI inference cost economics and disaggregated inference.


Open-source quantized model ecosystem

Where to get pre-quantized models in 2026.

Hugging Face

The de-facto registry. Pre-quantized variants of Llama, Mistral, Qwen, DeepSeek, etc. Multiple formats (GPTQ, AWQ, GGUF, EXL2, MLX). Common naming: Llama-3.1-70B-Instruct-GPTQ-INT4.

TheBloke (community)

Historically the leading source for community-quantized variants. Active through 2024; succeeded by other community publishers as the ecosystem grew.

Vendor-quantized

NVIDIA NIM, AMD's optimized models, Meta's optimized Llama variants. Often production-tuned.

Model-publisher quantized

DeepSeek publishes FP8 variants of V3; Mistral publishes FP8 / INT4 variants of Mixtral; Meta publishes quantized Llama variants. Use these when available.

What to look for

  • Published evaluation results (MMLU, code, math).
  • Calibration data description.
  • Stack-specific compatibility (vLLM vs TRT-LLM vs llama.cpp).
  • Safety eval results.

Quantization for batched inference

Batched inference has specific implications for quantization.

Why batching changes things

At batch size 1, kernel launch overhead dominates and quantization gives less speedup. At batch size 32+, compute saturates the hardware and quantization's throughput multiplier appears in full.

Per-tensor vs per-batch scaling

Per-tensor scaling factors are static; per-batch can adapt. Most production stacks use static (calibrated) scaling for the throughput benefit.

Batched FP8

FP8 attention and FFN at high batch is the production sweet spot on Hopper. ~5500 tps Llama 70B 8x H100.

Batched INT4

INT4 + batched gives the highest throughput on Hopper. ~7500 tps Llama 70B 8x H100. Some quality cost.

Batched FP4 on Blackwell

The 2026 frontier configuration. FP4 + batch 64+ + B200 NVL: 10000+ tps Llama 70B equivalent.

Tail-latency considerations

Quantization-accelerated kernels often have tighter latency distributions than dense BF16. Tail latency improvements are real.


Quantization observability

Production deployments need to monitor quantization-specific signals.

Quality regressions

A/B test new quantization recipes on small traffic %, monitor for win-rate regression vs baseline. Automate.

NaN / Inf

Aggressive quantization can produce numerical issues under specific inputs. Monitor for NaN counts in production logs.

Per-input quality variance

Some inputs hit quantization edge cases. Track per-prompt quality if possible; investigate outliers.

Hardware-specific issues

FP8 tensor core errors or FP4 misconfigurations cause silent corruption. Hardware ECC monitoring + per-tensor sanity checks.

Migration from BF16 → FP8 → FP4

Run all three in parallel for a window. Compare quality. Cut over when confident.

Tools

NVIDIA NIM has built-in quantization quality monitoring. vLLM and SGLang expose metrics. Custom evaluation harnesses are common.


Quantization research priorities for 2026-2027

What's not yet solved.

W4A4 (4-bit weights + 4-bit activations) without quality regression

OmniQuant and SpinQuant get close; production-grade not yet.

Provably lossless quantization

Theoretical guarantees that quantization preserves capability above a threshold. Open research.

Sub-2-bit production quality

BitNet b1.58 and AQLM at INT2 work; production deployment at frontier scale is open.

Better KV cache quantization

KIVI is strong; INT2 KV with good long-context performance is the goal.

Calibration-free quantization

HQQ shows promise; production stacks rarely use it yet. Could simplify deployment.

Quantization for FP4 training

Production FP4 training is emerging. Quality vs BF16 training is the open question.

MoE-specific quantization

Per-expert quantization, hot-vs-cold expert precision differentiation. Open research.

Multimodal quantization

Vision encoders are sensitive; quantization recipes for joint vision-language are nascent.