Speeding Up PyTorch for AI: CUDA Graphs, torch.compile, AOT Inductor, FlashAttention, Kernel Fusion — The Complete Guide
The definitive guide to making PyTorch fast on GPUs: CUDA Graphs, torch.compile (Dynamo + Inductor), AOTInductor, FlashAttention 1/2/3, CUTLASS, ThunderKittens, Triton, TensorRT, dynamic-shape handling, profiling — and how production inference stacks combine them.
Run a small batch on a fast GPU and watch the utilization meter. It rarely hits the ceiling. The reason isn't that the work is too small — it's that the overhead per unit of work is large enough to matter. This is the kernel-launch tax, and it explains most of the throughput difference between a naive PyTorch inference loop and a tuned production stack. But launches are only one of three taxes you're paying. The other two — redundant HBM round-trips inside attention, and the cost of every operator going through Python — show up the moment you fix the first one.
The take: making PyTorch fast for modern LLMs is a four-layer problem and any complete answer touches all four. (1) Kernel-launch overhead — fixed by CUDA Graphs. (2) Kernel count and HBM traffic — fixed by torch.compile (Dynamo frontend + Inductor backend) and ahead-of-time variants like AOTInductor. (3) Attention-specific bandwidth waste — fixed by FlashAttention 1/2/3, CUTLASS, and Hopper-tuned kernels written in Triton or ThunderKittens. (4) Cross-cutting compilation for production — handled by TensorRT / TensorRT-LLM on NVIDIA, or hand-written hot kernels for the highest-value paths. Every serious 2026 inference stack (vLLM, SGLang, TensorRT-LLM, Hugging Face TGI) combines all four.
This guide explains where each tax comes from, the techniques that remove it, the trade-offs, and how to reason about which one helps which workload.
Table of contents
- Key takeaways
- Mental model: CUDA graphs and torch.compile in one minute
- The landscape in 2026
- Where the time actually goes
- Why decode is launch-bound
- CUDA graphs
- torch.compile
- Kernel fusion
- FlashAttention generations explained
- CUDA Graphs vs torch.compile — when to use which
- AOTInductor for production
- Profiling tools (Nsight, PyTorch Profiler)
- The shape-specialization problem
- Combining graphs and compile
- Trade-offs and limitations
- Profiling for launch overhead
- Production usage in 2026
- When it doesn't help
- torch.compile internals: Dynamo, AOTAutograd, Inductor
- Per-backend tour: Inductor, IPEX, ONNXRT, TensorRT, vLLM/SGLang/TGI compile modes
- CUDA Graphs mechanics: stream capture, allocators, mutable args
- CUDA Graph gotchas: cuBLAS warmup, NCCL streams, allocator pools
- Per-stack CUDA Graph adoption (vLLM, SGLang, TRT-LLM)
- Inductor Triton template kernels and the autotune cache
- torch.compile with FSDP, DDP, Megatron, and custom ops
- Numerical precision, FP8/quantized compile, reproducibility
- Benchmarks: eager vs compile vs graphs on Llama 3, Mistral
- Production deployment workflow: AOT, warmup, version pinning
- CPU-bound vs GPU-bound regimes and Blackwell changes
- torch.compile decision matrix and the "fast eager vs slow compile" trade-off
- Reproducibility and determinism in compiled code
- Profiling compiled code: what's different
- CUDA graphs for training: the rarer use case
- Blackwell-specific compile considerations
- Comparison tables
- The bottom line
- FAQ
- Glossary
- References
Key takeaways
- Each kernel launch costs ~5-20 µs of CPU and dispatch time, regardless of how small the kernel itself is.
- A transformer forward pass has hundreds to thousands of kernels per layer. At small batch sizes, launch overhead can be 30-60% of step time.
- Decode is the worst case: small kernels, sequential steps, latency-critical. Prefill amortizes launch cost over large work; decode doesn't.
- CUDA graphs capture a sequence of launches once and replay it cheaply. Drops dispatch overhead to near-zero.
- torch.compile generates better kernels — fewer of them, with fusion. Reduces both kernel count and HBM traffic.
- They're complementary: compile produces fewer, better kernels; graph capture launches them cheaply.
- Cost: both are shape-specialized. Variable shapes require bucketing and pre-compilation.
- Production reality: every serious inference stack uses both. The combination is one of the largest practical decode-throughput wins available.
The full stack at a glance
[Request]
│
▼
[Python serving layer] ── routing, batching, scheduling
│
▼
[PyTorch model graph] ── torch.compile (Dynamo + Inductor) traces, fuses
│
▼
[Compiled kernels] ── Triton + CUTLASS + FlashAttention
│
▼
[CUDA Graphs] ── capture/replay, dispatch overhead ~0
│
▼
[GPU] ── tensor cores, HBM, NVLink
Each layer addresses a different bottleneck. Skipping any one leaves performance on the table. The "PyTorch is slow" complaints of 2022-2023 were largely about workflows that used eager mode for production; in 2026 the stack is mature enough that most teams reach FP8 + FA3 + CUDA Graphs + Inductor + AOTInductor without writing custom kernels.
Quick comparison: launch-overhead remedies
| Technique | What it fixes | Decode speedup | Shape-flexible | Setup cost | When to use |
|---|---|---|---|---|---|
| Eager PyTorch | Nothing (baseline) | 1.0× | Yes | None | Prototyping, training |
| CUDA Graphs only | CPU dispatch overhead | 1.5-2.0× | Bucketed | Capture pass | Decode with predictable shapes |
| torch.compile only | Kernel count + HBM traffic | 1.3-1.5× | Bucketed | Seconds-minutes | Any workload, especially element-wise heavy |
| Compile + Graphs | Both | 2.0-3.0× | Bucketed | Both | Production decode at any scale |
| TensorRT-LLM | Both, vendor-tuned | 2.0-3.5× | Bucketed | Engine build | NVIDIA-only production |
| Custom Triton | Hand-fused critical paths | Workload-dep. | Manual | Engineering | Hot kernels in a custom stack |
Numbers are typical decode wins at small-to-moderate batch on Hopper-class GPUs; your mileage varies. See References.
Where each technique fits in the optimization order
A typical optimization journey for a new serving deployment:
- Run baseline. Eager PyTorch generate(), no compile, no graphs. Measure tokens/s/GPU at production batch sizes.
- Enable FlashAttention. If not already on, this is the single largest win for long-context workloads (drop-in, no risk).
- Enable CUDA Graphs. Largest dispatch-overhead reduction at small batch. Test on representative shapes.
- Enable torch.compile or use TRT-LLM. Kernel fusion and shape specialization. Pin reasonable buckets.
- Add FP8 quantization. Halves bandwidth on the decode path.
- Add speculative decoding. Optional, large win on large targets.
- Custom Triton kernels. Only on hot paths the compiler isn't fusing. Engineering-intensive.
Skip steps at your peril; the order matters because later steps depend on earlier ones being correct.
Mental model: CUDA graphs and torch.compile in one minute
The named problem is the launch-overhead tax. Every CUDA kernel costs a few microseconds of CPU work to dispatch, regardless of how small the kernel is. A transformer decode step fires hundreds of kernels, each doing tiny amounts of work. At small batch sizes, the GPU spends most of its wall-clock time waiting for Python and the CUDA driver to tell it what to do next — the hardware is idle while the CPU types.
Two mental images cover most of it. CUDA Graphs are a recorded macro: you run the sequence of kernels once with the driver watching, save the recording, and from then on you "press play" instead of typing every keystroke. torch.compile is a transpiler: it watches your Python with Dynamo, lowers it to an FX graph, then has Inductor (which targets Triton) emit a smaller number of bigger, fused kernels. Graphs cut dispatch cost. Compile cuts kernel count and HBM round-trips. They are orthogonal and compose.
| Layer | Without | With |
|---|---|---|
| Per-step dispatch | 5–20 µs × hundreds of kernels | Single replay call |
| Kernel count per layer | hundreds | tens after fusion |
| HBM traffic | every op reloads tensors | fused ops keep them in registers |
| Decode tokens/sec (small batch) | baseline | 2–3× with both |
| Shape flexibility | full dynamism | bucketed, recompile on miss |
| Sticky number | n/a | CUDA Graphs eliminate ~5–20 µs/step of launch overhead |
The production one-liners:
# torch.compile: fuses kernels, specializes on shape
model = torch.compile(model, mode="reduce-overhead")
# CUDA Graphs: capture once, replay forever
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
out = model(static_inputs) # capture
g.replay() # replay, near-zero CPU cost
mode="reduce-overhead" even wires CUDA Graphs in for you. In production stacks (vLLM, SGLang, TensorRT-LLM) both layers are on by default, paired with FlashAttention for the attention block and FP8 weights for the bandwidth-bound parts. The rest of this guide is how that combination is assembled and where it falls over.
The landscape in 2026
The PyTorch acceleration stack has consolidated. A 2026 inference engineer chooses from a finite menu rather than rolling everything by hand.
The frontend / compiler layer
- torch.compile — the default. Dynamo traces Python bytecode; Inductor lowers FX graphs to Triton (GPU) or C++/OpenMP (CPU). Production-grade for most LLMs as of PyTorch 2.4+. Handles dynamic shapes via guards and recompilation buckets — the chronic pain point of 2023 is largely solved in 2026 with
dynamic=Trueandmark_dynamic. - AOTInductor — ahead-of-time variant. Produces a compiled
.sothat loads without Python or even PyTorch at runtime. Used for low-latency production serving and on-device deployment. - TensorRT / TensorRT-LLM — NVIDIA's commercial compiler. More aggressive optimization than torch.compile, NVIDIA-only, requires engine builds per shape. The default for hyperscale NVIDIA-only serving.
The kernel layer
- CUDA Graphs — captures launch sequences for cheap replay. Orthogonal to compilation; used by every serious serving stack.
- Triton (Tillet et al., MAPL 2019) — Python-like DSL for GPU kernels; the default backend behind Inductor and the language most new hand-tuned kernels are written in. See our Triton kernel primer.
- Mosaic / Pallas (JAX) — JAX's kernel DSL, less relevant for PyTorch-centric serving but worth knowing for cross-framework benchmarking.
- CUTLASS — NVIDIA's C++ template library for high-performance GEMMs. Powers most production matmul kernels under the hood.
- ThunderKittens (Spector et al., Stanford Hazy Research, 2024) — minimalist C++ DSL targeting Hopper tensor cores; produces FlashAttention-3-class kernels in ~100 lines. The frontier for hand-written kernel performance.
Attention kernels specifically
FlashAttention is the most consequential kernel-fusion success story in deep learning, and it now has three generations:
- FlashAttention 1 (Dao et al., 2022, arXiv:2205.14135) — IO-aware tiling; first algorithm to fuse softmax with matmuls and avoid materializing the attention matrix in HBM.
- FlashAttention 2 (Dao, 2023, arXiv:2307.08691) — better parallelism across thread blocks and warps; ~2× over FA-1.
- FlashAttention 3 (Shah et al., 2024, arXiv:2407.08608) — Hopper-specific, uses asynchronous tensor cores (WGMMA) and TMA; ~1.5–2× over FA-2 on H100.
A dedicated section below unpacks each.
Layered comparison
| Technique | Layer | What it fixes | Typical overhead | Where it helps most |
|---|---|---|---|---|
| Eager PyTorch | — | Nothing (baseline) | — | Prototyping |
| torch.compile (Dynamo + Inductor) | Compiler | Kernel count, HBM traffic | Seconds–minutes compile | Any workload, esp. element-wise heavy |
| AOTInductor | Compiler (AOT) | Python overhead, cold-start | Compile once, ship .so |
Production serving, on-device |
| CUDA Graphs | Runtime | CPU dispatch overhead | Capture pass per shape | Small-batch decode |
| FlashAttention 2/3 | Kernel | HBM round-trips in attention | None (drop-in) | Long-context, training |
| Triton hand-written | Kernel | Custom fusion | Engineering | Hot paths a compiler misses |
| ThunderKittens | Kernel DSL | Hopper-class attention | Engineering | FA-3-class kernels |
| CUTLASS | Library | GEMM performance | None | Custom matmul shapes |
| TensorRT-LLM | Compiler + Runtime | All of the above, NVIDIA-tuned | Engine build per shape | Hyperscale NVIDIA serving |
| Custom kernel | Hand | Whatever | Significant | Bottleneck-specific |
For where this fits the broader serving picture see vLLM and PagedAttention, KV cache memory math, disaggregated inference, speculative decoding, and reasoning-model serving. For the precision side of the same throughput equation, see quantization tradeoffs and FP8 training tradeoffs.
Version pinning for a 2026 stack
Reference combination in May 2026: PyTorch 2.6 with CUDA 12.6, NVIDIA driver 560.x, Triton 3.1, FlashAttention 2.6+ (FA3 enabled on H100/H200/B200), Inductor enabled by default in torch.compile, AOTInductor for production serving stacks. CUDA Toolkit version drives whether async TMA paths work; old toolkits silently fall back to slower kernels. Always pin versions explicitly; PyTorch nightly is fine for development, never for production.
Where the time actually goes
Every GPU operation — a matmul, an addition, a softmax — is dispatched as a kernel: a small program that runs on the GPU. Each launch has overhead.
The cost of one launch
CPU side, per kernel:
- Assemble the arguments (pointers, shapes, strides).
- Push to the CUDA driver.
- Driver dispatches to the GPU command queue.
Typical wall-clock cost: 5-20 µs depending on stack and number of arguments. On Hopper/Blackwell with optimized drivers, lower end. With eager PyTorch and complex argument lists, higher end.
How many launches in a forward pass
A simplified accounting for one transformer layer with no kernel fusion:
- LayerNorm: 1-3 kernels
- QKV projection: 1-3 kernels (depending on whether QKV is fused)
- RoPE rotation: 1-2 kernels
- Attention: 1-2 kernels (FlashAttention is a single fused kernel)
- Output projection: 1-2 kernels
- Residual add: 1 kernel
- LayerNorm: 1-3 kernels
- FFN up-projection: 1-2 kernels
- Activation: 1-2 kernels
- FFN down-projection: 1-2 kernels
- Residual add: 1 kernel
Roughly 10-25 kernels per layer in a naive implementation. For a 70B model with 80 layers, that's 800-2000 kernels per forward pass.
At 10 µs per launch: 8-20 ms of pure dispatch overhead per forward pass, before the GPU does any work.
Why this matters for decode
For a model where one decode step takes ~50 ms total, 10-20 ms of dispatch overhead is 20-40% of step time. Eliminating it nearly doubles throughput at small batch sizes.
Kernel count by model size
The kernel count scales with layer count and depth, not exactly with parameter count, because most of the kernels are per-layer ops. A rough comparison:
| Model | Layers | Kernels per forward (naive) | Kernels after compile |
|---|---|---|---|
| Llama 3 8B | 32 | ~600 | ~120 |
| Llama 3 70B | 80 | ~1500 | ~280 |
| Llama 3 405B | 126 | ~2400 | ~440 |
| DeepSeek-V3 (MoE) | 61 | ~3000 (MoE all-to-alls) | ~700 |
For MoE the kernel count includes the dispatch and combine all-to-alls per MoE layer, multiplying the per-layer count. CUDA Graphs help proportionally more for MoE because the larger kernel count amplifies the dispatch overhead.
A concrete dispatch tax measurement
Run the same Llama-3-70B decode step on H100 SXM at batch 1 with FP16 weights:
| Configuration | Step time | GPU active | CPU dispatch | Notes |
|---|---|---|---|---|
| Eager PyTorch | 64 ms | 38 ms | 26 ms | Baseline; 40% wasted |
| CUDA Graphs | 41 ms | 38 ms | 3 ms | Dispatch tax mostly gone |
| torch.compile + Graphs | 32 ms | 30 ms | 2 ms | Fewer, fused kernels too |
| TRT-LLM (engine built for batch 1) | 28 ms | 27 ms | 1 ms | Plus FP8 kernels |
The eager-to-graphs improvement is purely dispatch overhead; the graphs-to-compile improvement is kernel count and HBM traffic; the compile-to-TRT improvement is FP8 plus more aggressive scheduling. At batch 32, the same exercise yields 50% smaller relative gains because dispatch tax is amortized.
Why decode is launch-bound
Prefill: each kernel does substantial work over a long sequence. Kernel-launch cost is dominated by the actual compute. Launch overhead is small relative to total step time.
Decode: each kernel does tiny work — one token's worth of matmul. The launch overhead becomes proportionally significant.
The arithmetic-intensity angle
Decode at small batch is bandwidth-bound — the GPU is mostly idle compute-side, waiting for memory. Launch overhead doesn't hurt the bandwidth; it leaves the GPU idle while the CPU prepares the next kernel.
If you fix the launch overhead, the GPU spends more cycles actually reading weights and producing tokens.
The pipeline-stall problem
A subtle effect: when the CPU spends 10 µs preparing the next kernel launch, the GPU's tensor cores stall for that entire window. On a Hopper-class GPU at 1.8 GHz, 10 µs is 18,000 clock cycles of idle tensor cores per launch. Multiplied across 2000 launches per forward pass, that's 36 million wasted cycles on the GPU side — physically not visible in any FLOPs counter, but visible as a 40% throughput loss. CUDA Graphs solve this directly by retiring kernels in a tight back-to-back pipeline, eliminating the per-launch pipeline stall.
Why this is invisible in training
Training does prefill-shaped passes on long sequences in big batches. Launches are amortized. Adding launch-elimination optimizations to training yields small wins.
Inference, especially decode, is the workload where launch overhead is biggest, and where launch-elimination yields the largest absolute speedup.
Why batch 1 decode is so painful
At batch 1, the GPU is doing tiny matmuls per layer: a 1×8192 vector times an 8192×8192 weight matrix. The matmul itself takes ~10-20 µs on H100. The kernel launch takes ~5-15 µs. The relative cost of launching the kernel is enormous. Increase the batch to 32 and the matmul becomes a 32×8192×8192 GEMM that takes 100-200 µs — the launch cost is the same but is now a small fraction. This is the entire story behind why bigger batches help dispatch overhead disproportionately, and why decode at batch 1 is the worst-case workload for any serving stack.
CUDA graphs
A CUDA graph captures a sequence of kernel launches once and replays it cheaply on subsequent invocations.
How it works
# Capture phase (once)
torch.cuda.graph(graph)
graph.capture_begin()
output = model(input)
graph.capture_end()
# Replay phase (many times)
input.copy_(new_input)
graph.replay()
# output now contains result for new_input
The graph object stores the dependency graph of kernels — which kernel reads which buffer, which produces which output. Replaying dispatches the entire sequence in one operation, bypassing per-kernel CPU overhead.
What CUDA graphs save
The capture-once-replay-many pattern eliminates the per-kernel CPU work on replay. For a forward pass with 2000 kernels and 10 µs per launch, that's ~20 ms saved per replay. Combined with the GPU keeping its pipelines fuller (no kernel-to-kernel CPU gap), the wall-clock speedup is typically more than just the saved overhead.
Empirically, on small-batch decode: 1.5-2× throughput improvement from CUDA graphs alone.
The second-order win is GPU pipeline fullness. Without graphs, the GPU spends micro-pauses waiting for the next kernel's args to arrive from CPU. With graphs, the dependency graph is pre-known and the driver issues kernels as soon as their dependencies retire, keeping the compute units saturated. On a high-clock-rate H100 this can recover 5-10% beyond the raw dispatch savings.
Constraints
- Fixed shapes. The graph captures specific tensor shapes. Different shapes need different graphs.
- No dynamic control flow. Operations with data-dependent branching can't be captured cleanly.
- Predictable memory. Allocations inside the captured region must be deterministic.
- Stream-bound. The graph captures one CUDA stream's worth of work. Multi-stream patterns need adaptation.
- No CPU sync. Any CPU-side synchronization or Python callback breaks the capture. All work must stay on the GPU.
- Capture stream isolation. The stream used for capture should not be used for other work during capture, or unrelated kernels could be inadvertently captured.
CUDA Graphs example pseudocode
import torch
model = MyLLM().eval().cuda()
inputs = {b: torch.empty(b, 8192, device="cuda") for b in [1, 2, 4, 8, 16, 32]}
graphs = {}
for b, x in inputs.items():
# Warm-up so allocators settle
for _ in range(3):
model(x)
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
out = model(x)
graphs[b] = (g, x, out)
# Inference path
def infer(real_input):
b = next(b for b in sorted(graphs) if b >= real_input.size(0))
g, x_buf, out_buf = graphs[b]
x_buf[:real_input.size(0)].copy_(real_input)
if real_input.size(0) < b:
x_buf[real_input.size(0):].zero_() # pad
g.replay()
return out_buf[:real_input.size(0)].clone()
The pattern: pre-capture per bucket, route real requests to the nearest bucket, pad with zeros, replay the graph, clone the relevant slice of the output. Real production stacks add error handling and memory pool management around this skeleton.
How serving stacks handle this
Pre-capture graphs for a set of canonical shapes (typically batch sizes 1, 2, 4, 8, 16, 32, 64; or KV cache sizes at common rounding levels). At runtime, route each request to the nearest captured shape — pad if necessary.
vLLM's V1 scheduler manages this automatically: at startup, it captures graphs for a configurable set of batch sizes (defaulting to powers of 2 up to the max batch), then routes requests to the smallest captured batch that fits. SGLang and TRT-LLM follow similar patterns. Hand-built serving stacks need to implement this logic explicitly; it is non-trivial to get right but standard once implemented.
The padding costs a small amount of wasted compute. The dispatch savings dwarf it.
CUDA Graph memory pinning and gotchas
Captured graphs assume the input and output buffers stay at fixed addresses. If you reallocate the input tensor between captures, the graph either fails or — worse — silently runs on stale memory. Production stacks pin a small pool of input/output buffers per graph and copy fresh data into those buffers at request time. The copy is fast (a few microseconds) and avoids the pitfall. The first time you write a CUDA Graph integration without this pattern, you will hit memory corruption that is hard to debug.
Per-shape graph capture cost
Capturing a graph for one shape on a 70B model takes ~50-200 ms (the capture pass runs the model once and records). For 10 bucketed shapes, total capture cost is 0.5-2 seconds at startup. This is one-time, well-amortized, and worth budgeting into your warm-up pipeline. Reducing shapes (via bucketing) reduces capture cost as well as runtime cost.
Memory footprint of captured graphs
Each captured graph reserves its own input/output buffers in HBM. For a 70B model at batch 32, the per-graph buffer footprint is around 100-300 MB depending on KV cache layout. Across 30 captured shapes, the total can reach 5-10 GB of HBM dedicated to graph workspaces. Worth budgeting for; if HBM is tight, reduce the number of buckets first before disabling graphs entirely.
torch.compile
A different lever: instead of accepting whatever kernel sequence PyTorch's eager mode produces, generate a better one.
What torch.compile does
torch.compile traces the model, builds a graph of operations, and lowers it through a compilation pipeline. The output is optimized code (often via Triton or Inductor backend) with:
- Fewer kernels (operations fused into single larger ones).
- Better memory access patterns.
- Reduced redundant computation.
- Specialization for the actual tensor shapes seen.
The mode argument controls the optimization aggressiveness: "default" is fast to compile with reasonable wins; "reduce-overhead" adds CUDA Graphs automatically and is the recommended mode for inference; "max-autotune" runs an extensive kernel-tuning search that can take minutes but produces the best runtime. For production serving, mode="reduce-overhead" is the typical choice.
Inductor versus other backends
Inductor is the default backend; alternative backends exist for specific use cases. aot_eager runs the model eagerly after AOT graph capture (useful for testing whether dynamo graph capture is the bottleneck). cudagraphs directly applies CUDA Graphs without compilation. Custom backends can be plugged in via the backend= argument. For 99% of production inference, the default Inductor backend is the right choice; alternatives are useful for debugging or research.
Tracing modes
- TorchDynamo: dynamic tracing that handles Python control flow. Most general.
- TorchScript: older static tracing. Limited but still used.
- Inductor: the default backend. Compiles to Triton kernels for GPU.
TorchScript is largely deprecated for new work in 2026; TorchDynamo + Inductor is the path forward. The legacy TorchScript codebases that remain in production are typically older serving stacks that have not yet migrated to torch.compile. The migration is usually straightforward but worth scheduling deliberately.
What kernel fusion does
If you have a sequence of element-wise operations:
y = x.relu()
z = y * scale
w = z + bias
Naive: three separate kernels, three round-trips to HBM (each kernel reads its input from HBM and writes its output to HBM).
Fused: one kernel that performs (relu(x) * scale + bias) in one pass. One read, one write. Three operations for the price of one's worth of HBM traffic.
For element-wise operations and small reductions, fusion is the dominant win — these are bandwidth-bound, and the fewer HBM round-trips, the better.
Fusion examples from Inductor logs
A typical Inductor compilation of a Llama-3 MLP block produces logs like (paraphrased): "fused gate_proj + silu + up_proj_mul into kernel triton_mlp_fused_act", "fused layer_norm + residual_add into kernel triton_norm_residual". Reading these logs is useful for understanding what compile is and isn't catching. If you see an expected fusion missing, it usually indicates a guard (dtype, shape, or stride) preventing the fusion. Inspecting TORCH_LOGS=output_code gives you the generated Triton source for verification.
Why fusion is hard around matmul
Element-wise fusion happens via dataflow analysis on operations that share inputs and outputs. Matmul is special: its tensor cores require specific tile sizes and memory layouts, and the kernel reads its inputs in a very specific access pattern. Fusing a pre-matmul operation (like RoPE) means rewriting the matmul kernel itself to do the pre-op inline — which is exactly what hand-written kernels (FlashAttention, TRT-LLM's MLP kernels) do. General compilers struggle here because the matmul kernel's internal tile structure is opaque to them. The future direction: tile-aware fusion (CUTLASS Python, Triton's matmul, FlexAttention-style declarative kernel composition) makes this easier.
Compilation cost
First-time compilation can take seconds to minutes for a large model. The compiled artifact is cached. Subsequent invocations use the cached compilation. For a 70B model with mode="reduce-overhead", expect 60-180 seconds on first call. With mode="max-autotune", multiply by 5-10×. The cache lives in ~/.cache/torch/inductor/ by default and survives across process restarts; for CI/CD it can be pre-populated and shipped with the deployment artifact.
For serving, you compile once at startup and amortize over many requests. The startup cost is tolerable.
Dynamo guards and recompilation
Dynamo traces Python bytecode and emits "guards" — conditions under which the compiled graph is valid. Common guards: tensor dtype, tensor shape (rank and per-dim size), Python value of integer arguments. If any guard fails at runtime, Dynamo falls back to eager mode or recompiles. The fast path requires no guard violations; debug-level logging (TORCH_LOGS="guards") shows what is being guarded against. The single most common cause of unexpected slow torch.compile performance is guard violations causing recompilation churn, which can dominate runtime if the workload triggers it often.
Inductor and FX-graph caching
Inductor caches compiled kernels keyed on the FX graph hash plus shape signature. The cache survives process restarts (in ~/.cache/torch/inductor/ by default) but does NOT survive PyTorch version upgrades — every PyTorch upgrade invalidates the cache. For production, deploy a pre-populated cache as part of your container image to avoid cold-start penalties on first request.
Inductor backend choices
Inductor can target Triton (default for GPU), C++/OpenMP (default for CPU), Halide (experimental), or vendor-specific paths. For NVIDIA GPUs the Triton path is mature and produces near-hand-tuned kernels for element-wise and small-reduction patterns. For matmul, Inductor still calls cuBLAS or CUTLASS — it does not generate matmul kernels itself. This means torch.compile does not improve raw GEMM performance; the wins come from fusion around matmuls and from reducing kernel count overall.
Kernel fusion
The deepest source of speedup from compiled paths is kernel fusion. Worth understanding what gets fused and what doesn't.
What fuses well
- Element-wise operations: relu, gelu, multiplication, addition, normalization.
- Small reductions: sum, mean, layernorm.
- Linear sequences with no branching.
- Operations on the same tensor shapes.
What doesn't fuse easily
- Operations across very different shapes.
- Matmuls (large GEMMs use vendor-optimized kernels; fusing into them is hard).
- Operations with complex data dependencies.
- Anything requiring synchronization across the GPU.
- Operations crossing collective communication boundaries (all-reduce, all-to-all). Fusion stops at the comm.
- Operations on tensors with non-contiguous strides — Inductor often falls back to copy + op rather than fused op.
Specific high-value fusions in transformers
- Fused QKV projection: combine the three linear projections (query, key, value) into one matmul.
- Fused MLP: GELU activation fused with the subsequent linear projection.
- Fused LayerNorm + projection: norm and matmul in one kernel.
- Fused attention: FlashAttention is essentially the most important fusion — combining matmul, softmax, and another matmul into one IO-aware kernel.
- Fused RoPE + KV write: rotary position embedding application combined with the KV cache append. Saves one kernel launch and one HBM round-trip per layer per token.
- Fused sampling: top-k filtering + softmax + sample, in one kernel. Small per-token win but multiplies across long generations.
Production inference stacks ship with these fusions hardcoded; torch.compile can discover others automatically. For hand-written fusions, see our Triton kernel primer.
Fusion limits: what compile cannot do
Inductor's fusion is local — it sees a window of operations and decides whether to fuse them. It does not restructure the algorithm. FlashAttention is the canonical example: combining matmul, softmax, and matmul into one IO-aware kernel requires algorithmic insight (tiling, online softmax) that no general fusion compiler discovers. Inductor produces FA-quality code only by calling FA as a black-box library, not by generating it. The lesson: compiler-driven fusion is great for element-wise and simple-reduction patterns; algorithmic optimizations like FlashAttention require human-written kernels.
Quantitative fusion wins
For a 70B model's MLP block (gate, up, down projections + activation + residual), eager mode launches ~8 kernels; torch.compile fuses some and produces ~3-4 kernels; hand-written fused MLP (in TRT-LLM or Triton) produces 1-2 kernels. The HBM traffic in eager mode is roughly 5× the model's MLP weight footprint per forward pass; fused is ~1.2×. The throughput improvement from this single block can be 15-25% at small batch decode.
FlashAttention generations explained
FlashAttention is the most important single kernel fusion in deep learning. Three generations, each unlocking a new hardware capability.
FlashAttention 1 (Dao et al., 2022)
arXiv:2205.14135. The key insight: standard attention writes the full N×N attention matrix to HBM, reads it back, softmaxes, writes again, then multiplies by V. For long sequences this is O(N²) HBM traffic.
FA-1 tiles Q, K, V and runs the entire attention computation (QKᵀ, softmax, ×V) within SRAM for each tile, never materializing the attention matrix. The result: linear HBM traffic in N, exact (not approximate) attention, and ~3× speedup at typical sequence lengths.
This is the canonical "kernel fusion as algorithm redesign" success.
FlashAttention 2 (Dao, 2023)
arXiv:2307.08691. Fixed inefficiencies in FA-1's parallelism:
- Split work across thread blocks along the sequence dimension (FA-1 split along batch / head only).
- Better warp scheduling within each block.
- Reduced non-matmul FLOPs (softmax rescaling).
~2× over FA-1 on Ampere and Hopper. As of 2024 this was the default in all major frameworks.
FlashAttention 3 (Shah et al., 2024)
arXiv:2407.08608. Hopper-specific. Uses three Hopper hardware features that FA-1/2 ignore:
- WGMMA — asynchronous tensor-core matrix multiply that overlaps with other warp work.
- TMA (Tensor Memory Accelerator) — async bulk transfers between HBM and shared memory.
- FP8 tensor cores — half-precision attention with quality recovery via per-tile scaling.
FA-3 runs at ~75% of H100 peak FLOPS — within striking distance of pure GEMM kernels, which was previously thought infeasible for a softmax-coupled operation. On long-context decoding the win compounds with KV-cache compression; see long-context attention for how this combines with sliding windows and chunked prefill.
When to care which generation
- Training on A100 / older: FA-2 is the ceiling.
- Training on H100 / B200: FA-3.
- Inference: vLLM and SGLang ship FA-2 / FA-3 depending on hardware; you generally don't pick this manually.
- Hand-writing kernels in this space: ThunderKittens is the most accessible starting point; ~100 lines of C++ for an FA-3-class kernel.
FlashAttention vs xFormers vs FlexAttention
Three open-source attention kernel stacks compete in 2026. FlashAttention is the most-cited reference and the production default in vLLM, SGLang, TRT-LLM. xFormers (Meta) wraps similar kernels with a slightly more flexible API and good support for memory-efficient attention masks. FlexAttention (PyTorch 2.5+) lets you express custom attention masks declaratively and generates fast Triton kernels — useful for non-standard patterns (block-sparse, dynamic masks). For 90%+ of production workloads, FlashAttention is the answer. For research with custom mask patterns, FlexAttention is the right starting point.
Hopper-specific optimizations FA3 exploits
WGMMA (Warp Group Matrix Multiply Accumulate) is Hopper's async matmul: one warp issues the matmul, others continue with non-matmul work, the result lands later. FA3 uses this to overlap the softmax pass with the next tile's matmul. TMA (Tensor Memory Accelerator) is a dedicated copy unit between HBM and shared memory; FA3 issues async TMA loads so the next tile is in shared memory before the current tile's compute finishes. Both features required FA3 to be rewritten from scratch from FA2's structure; the rewrite was published in 2024 and delivers near-peak Hopper FP8 throughput on long sequences.
What ThunderKittens is and is not
ThunderKittens is a research-grade C++ DSL that compiles to CUDA, designed specifically for Hopper tensor cores. Its claim to fame: an FA3-class attention kernel in ~100 lines of code, vs FlashAttention's ~2000 lines of hand-tuned CUDA. The implication is not that you should rewrite production attention in ThunderKittens — FA3 is more battle-tested — but that the productivity of kernel engineering can be much higher with the right abstractions. Production stacks have not adopted ThunderKittens broadly; it remains a tool for research kernel engineering and one-off custom kernels for novel attention patterns. See our Triton kernel primer for a related approach.
CUDA Graphs vs torch.compile — when to use which
The two are commonly confused. They are not substitutes; they solve different problems and you usually want both. But if you're forced to pick one:
Pick CUDA Graphs first if
- Decode at small batch is your bottleneck.
- Shapes are predictable (you can bucket).
- You care about latency P50 / P99 more than P0 throughput.
- You're already using fused kernels (FlashAttention, fused QKV) and the residual cost is dispatch overhead.
- Compilation time at startup is unacceptable (e.g., serverless cold start).
Pick torch.compile first if
- You have not yet adopted fused attention / QKV kernels (compile will discover many of these automatically).
- Element-wise operations dominate (normalization, activations, residual chains).
- Your model has unusual operators that vendor kernels don't cover.
- You can afford 30 s – several minutes of startup compilation.
- Your hardware is non-NVIDIA — Inductor targets multiple backends; CUDA Graphs is CUDA-only.
What each does not fix
- CUDA Graphs does not reduce HBM traffic. If you're bandwidth-bound rather than launch-bound, graphs do nothing.
- torch.compile does not reduce per-launch dispatch overhead — Inductor still emits separate Triton kernels which are launched individually unless wrapped in a graph.
Diagnostic rule
Profile in Nsight Systems. If you see GPU idle gaps between kernels on the timeline, you're launch-bound — graphs first. If you see fully back-to-back small kernels with high HBM bandwidth utilization, you're fusion-bound — compile first.
Practical decision tree
| Symptom | Diagnosis | Action |
|---|---|---|
| GPU at < 30% utilization in decode | Launch-bound | Add CUDA Graphs |
| Many kernels < 50 µs each | Launch-bound | Add CUDA Graphs |
| HBM bandwidth at 80%+ | Bandwidth-bound | Add quantization, compile for fusion |
| HBM bandwidth at 30-60% with launch gaps | Mixed | Both CUDA Graphs and compile |
| Eager works fine, compile makes it slower | Recompilation churn | Bucket shapes or use dynamic=True |
| First request slow, then fast | JIT compilation | Pre-warm or use AOTInductor |
This is the standard triage flow. Spending an hour with nsys before optimizing is almost always cheaper than guessing.
AOTInductor for production
torch.compile is JIT — it traces and compiles the first time the model runs. AOTInductor is the AOT (ahead-of-time) variant: compile once, ship a self-contained shared library, load and run with no Python and no PyTorch dependency.
Why AOT matters
- Cold start: JIT compilation of a 70B-class model takes 30 s to several minutes. Unacceptable for serverless / autoscale where instances start frequently.
- Deployment surface: AOTInductor
.soloads in a C++ runtime. No Python interpreter, no pip-installed PyTorch — easier to certify for regulated environments and smaller container images. - Reproducibility: the compiled artifact is bit-stable. JIT can recompile differently on driver / kernel version changes.
Workflow
import torch
model = MyModel().eval().cuda()
example_inputs = (torch.randn(1, 512, device="cuda"),)
torch._inductor.aot_compile(model, example_inputs, options={"aot_inductor.output_path": "model.so"})
At serving time, load model.so from C++ or Python and call it like a regular function. Shape specialization works the same way as JIT compile — bucket and pre-compile for each shape.
Trade-offs
- One
.soper shape bucket — deployment ships N artifacts. - Less flexibility for live experimentation.
- Tracing constraints stricter than JIT — data-dependent control flow that JIT can guard around becomes harder.
For low-latency inference (where the JIT compile cost is a multi-minute startup tax) and for on-device deployment (where Python isn't available), AOTInductor is the standard answer in 2026.
AOTInductor vs ExecuTorch vs ONNX
Three production-deployment formats with overlapping but distinct niches. AOTInductor produces a CUDA-targeting .so; great for server-side deployment, less useful for mobile or embedded. ExecuTorch is PyTorch's mobile/embedded export format; targets CPU and mobile GPUs (Metal, Vulkan), poor fit for data center. ONNX is a vendor-portable graph format consumed by ONNX Runtime, TensorRT, OpenVINO; mature but lossier on PyTorch-specific operators. For data-center GPU serving in 2026, AOTInductor is the modern choice; for cross-vendor or edge, ONNX still dominates.
Cross-vendor parity
ROCm has hipGraph (CUDA Graphs equivalent), torch.compile with the Inductor backend targeting ROCm, and the same FlashAttention algorithms ported to AMD via Composable Kernel. The 2026 gap is small for common LLM workloads — within 10-20% of NVIDIA on equivalent silicon — but the long tail of less-common kernels still favors NVIDIA. For Intel GPUs (Gaudi3 in the data center, Arc on the desktop), tooling exists but performance and feature completeness lag further. The portability story has improved enormously since 2023 but is still imperfect.
Engine builds vs JIT in practice
TRT-LLM's engine build is conceptually similar to AOTInductor: compile once offline, ship a binary artifact, load it at startup. The differences: TRT-LLM builds are NVIDIA-specific and bound to a specific GPU architecture (an engine built for H100 will not run on A100), while AOTInductor produces more portable artifacts. TRT-LLM engine builds also take longer (5-30 minutes for a 70B model) than AOTInductor compilation (1-5 minutes), but produce faster runtime kernels. The choice depends on whether you can absorb the longer build time and the NVIDIA lock-in.
Profiling tools (Nsight, PyTorch Profiler)
You cannot optimize what you can't measure. The 2026 toolchain:
Nsight Systems (nsys)
NVIDIA's system-wide profiler. The single most important tool for diagnosing launch-bound vs bandwidth-bound vs compute-bound workloads.
nsys profile -o trace --trace=cuda,nvtx python infer.py
nsys-ui trace.nsys-rep
What to look at:
- GPU timeline gaps: visible white space between kernels = launch overhead. Fix with CUDA Graphs.
- SM utilization: low (<50%) at small batch usually means launch-bound.
- Stream concurrency: parallel streams should be visible if you're using them.
Nsight Compute (ncu)
Per-kernel deep dive. Tells you whether a single kernel is compute-bound, memory-bound, or latency-bound. Use after you've narrowed the problem to a specific kernel.
Key metrics: roofline analysis (FLOPS achieved vs HBM bandwidth used), warp occupancy, memory access patterns.
PyTorch Profiler
In-process profiler, usable from Python.
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
record_shapes=True, with_stack=True,
) as prof:
model(inputs)
prof.export_chrome_trace("trace.json")
Best for: integration with eager / compiled PyTorch, attributing time to PyTorch operator names rather than raw CUDA kernels. Use alongside Nsight, not as a replacement.
torch.compile debugging
When compilation goes wrong:
TORCH_LOGS="+dynamo,+inductor"for verbose compile-time logs.torch._dynamo.explain(model)(*inputs)to show graph breaks (the main reasoncompileis slow).- Recompilation log (
TORCH_LOGS="recompiles") catches accidental shape-driven recompilation churn.
Differential profiling
A useful technique: profile the same workload before and after a change (a new compile mode, a different bucket layout, an added kernel) and diff the kernel-time breakdowns. Nsight Compute Diff or simple per-kernel time deltas reveal whether the change actually helped, hurt, or shifted time around. Always confirm optimization wins with measurement; intuition is unreliable in this domain.
Production telemetry
Inference servers should emit per-request:
- Step time (decode latency).
- Kernel-launch count (proxied via cudaLaunchKernel counters).
- HBM bandwidth utilization.
Track these continuously. A regression in any of them is usually traceable to a missing graph capture, a kernel that fell out of fusion, or a shape that started recompiling. For broader eval infrastructure, the same principle applies one level up.
Reading an nsys trace: what matters
When you open an nsys trace for the first time, the relevant signals are: (1) the gap between kernels — large gaps are launch-bound symptoms; (2) the per-kernel duration — kernels under 100 µs are the dispatch-bound region; (3) the HBM bandwidth gauge — saturated means bandwidth-bound, low means launch or compute bound; (4) the CPU-side cudaLaunchKernel events — high CPU activity here is the smoking gun for launch-bound. A trained eye spots launch-bound workloads within 30 seconds of opening the trace.
Cost of profiling itself
Nsight Systems adds 5-15% overhead and can serialize CUDA streams in some configurations; do not profile production traffic at scale. PyTorch Profiler is heavier (20-50% overhead, mostly on the CPU side from event records). Both should be run on synthetic workloads or low-traffic windows. For continuous production telemetry, integrate cheaper counters (CUDA event timing per request) into your serving stack.
The shape-specialization problem
The shared difficulty: CUDA graphs and torch.compile both work best on fixed shapes, but inference sees variable shapes.
Where shapes vary
- Batch size: 1, 2, 4, ..., 64. Different graphs for different batches.
- Sequence length: each request has a different prompt length and different generation length.
- KV cache size: grows during generation.
Bucket and pad
Pre-compile / pre-capture for a set of shapes:
- Batch sizes: powers of 2.
- KV lengths: rounded to nearest 256 or 512 tokens.
- Prefill chunks: chunked prefill in fixed-size pieces.
At runtime, route requests to the nearest pre-compiled shape, padding the input if necessary.
Costs: a small amount of wasted compute on padding (typically <5%). Gains: dispatch overhead eliminated.
Dynamic compilation
Some stacks compile on-demand for new shapes, caching results. First request at a new shape pays the cost; subsequent requests don't.
Recompilation pitfalls
A common production trap: a model field gets accessed in a way Dynamo cannot trace cleanly, causing a graph break. Subsequent calls find the trace is invalid and recompile. The result is high CPU usage and unpredictable latency. Detection: TORCH_LOGS="recompiles" or dynamo shows the breaks. Fixes: move data-dependent control flow into Python helper functions called outside the compiled region, use @torch.compiler.disable on operations that should not be compiled, or restructure the model to avoid the break.
CUDA Graphs for paged attention
KV cache organized in fixed-size pages (paged attention). A request with N tokens uses ceil(N / page_size) pages. Graphs are captured per page-count bucket, not per token-count, drastically reducing the number of shapes needed.
This is one of the architectural insights behind vLLM's performance: paging + per-page-count graphs = small number of shapes to capture, near-zero dispatch overhead. See our KV cache memory math for the paging mechanics and the vLLM PagedAttention deep-dive for how this fits the broader stack.
Concrete bucketing recipes
Production bucketing schedules look like this. For a 70B chat workload with up to 32k context:
| Dimension | Buckets |
|---|---|
| Batch size | 1, 2, 4, 8, 16, 32, 64 |
| KV cache pages | 16, 32, 64, 128, 256, 512 (page_size 16 = up to 8192 tokens) |
| Prefill chunk size | 512, 1024, 2048, 4096 |
| Sequence length (decode) | n/a (covered by page count) |
Total graph count: 7 batch × 6 KV bucket × 1 (decode only) = 42 graphs for decode. Plus 4 chunk sizes × 6 batch = 24 graphs for chunked prefill. ~66 total graphs at ~1 second total capture cost. The set covers production traffic with high padding efficiency.
When to use dynamic=True
If your workload has unpredictable shapes (custom user inputs, variable sequence lengths that don't fit buckets), Inductor can emit dynamic-shape code with torch.compile(..., dynamic=True). The generated kernels handle any shape within a guarded range, at the cost of 5-15% slower runtime vs shape-specialized code. The decision: predictable shapes → bucket + specialize. Unpredictable shapes → dynamic.
Combining graphs and compile
The two techniques are complementary and stack:
- torch.compile produces fewer kernels (via fusion) with better memory access.
- CUDA graphs dispatch those kernels with near-zero overhead.
Pipeline:
- Compile the model: produces a graph of fused, optimized kernels.
- Capture CUDA graphs for each target shape: records the sequence of compiled-kernel launches.
- At runtime: route requests to the nearest shape, replay the graph.
Empirically, the combination yields more than either alone:
| Setup | Decode throughput (relative) |
|---|---|
| Eager PyTorch | 1.0× |
| torch.compile only | 1.3-1.5× |
| CUDA graphs only | 1.5-2.0× |
| Both | 2.0-3.0× |
Numbers vary by model, GPU, and batch size. Direction is consistent.
What changes when you add FP8 quantization on top
The decode throughput math compounds with quantization. A naive sequence: eager FP16 → 1.0×. Compile + Graphs in FP16 → 2.5×. Add FP8 weights → 4-5×. Add FP8 KV → 4.5-6×. Add speculative decoding (EAGLE-2) → 6-10×. The compounding holds because each technique addresses a different bottleneck: dispatch, fusion, bandwidth, and verification respectively. A production stack with all four delivers an order-of-magnitude over naive PyTorch generate(). See quantization tradeoffs and speculative decoding for the other pieces.
Trade-offs and limitations
CUDA graphs
- Pros: large dispatch-overhead reduction, simple mental model, well-tested.
- Cons: shape-fixed, doesn't help with HBM traffic, requires careful memory management.
torch.compile
- Pros: kernel fusion (HBM savings), automated, generates better code than humans write.
- Cons: long compilation times, recompilation on shape mismatch, sometimes generates suboptimal code, debugging compiled paths is harder.
Combined
- Pros: largest practical decode throughput improvement available.
- Cons: setup complexity, more shapes to pre-compile, more failure modes.
Cross-architecture portability
Code optimized for Hopper (FA3, TMA, WGMMA) may run on Blackwell with minor changes but typically needs re-tuning for peak performance. AOTInductor .so artifacts are GPU-architecture-specific; TRT-LLM engine files even more so. Production deployments serving across mixed GPU generations need separate artifacts per architecture, plus the dispatch logic to route to the right one. This is a real operational cost for organizations with heterogeneous hardware.
Future: compiler-driven attention
Today FlashAttention is a hand-written kernel. The trend toward compiler-driven attention (FlexAttention in PyTorch 2.5+, declarative attention APIs in JAX Pallas) would let users express custom attention patterns and have the compiler generate FA-class kernels. Early results are promising but not production-mature in 2026. By 2027-2028, we expect compiler-generated attention to close most of the gap with hand-tuned kernels, freeing engineering effort for the next bottleneck.
CUTLASS as the matmul layer
Underneath both torch.compile-generated kernels and TRT-LLM's compiled engines, matmuls usually call CUTLASS — NVIDIA's open-source C++ template library for GEMMs. CUTLASS provides the per-tile, per-shape, per-precision implementations that achieve near-peak FLOPs on H100/H200/B200. The 2026 version (CUTLASS 3.x) has dedicated paths for FP8, FP4, NVFP4, mixed-precision (FP8 × INT4 weight-only), and grouped GEMM (for MoE). Production inference stacks dispatch to CUTLASS for nearly all matmul work; the rest of the kernel ecosystem (Triton, hand-written) handles the non-matmul fusions around it.
For production serving, the trade-offs almost always favor adoption.
Common pitfalls
Four failure modes that show up repeatedly in production:
Silent recompilation. A subtle Python-level change (a tensor type annotation, a method order) triggers Inductor to recompile on each call. Throughput plummets without obvious cause. Detection:
TORCH_LOGS=recompiles. Fix: stabilize the call site.Stale graph after weight update. A CUDA Graph captured before a weight update continues to use the old weights — graphs capture pointers, not values. Detection: outputs do not change after a
model.load_state_dict. Fix: re-capture after any weight modification.Cross-stream synchronization missing. Captured graphs assume a specific stream ordering. If your code uses extra streams (for async data movement), make sure they are properly synchronized with the captured graph's stream. Detection: occasional incorrect output. Fix: explicit
torch.cuda.synchronizearound capture boundaries.Inductor not enabled at all. A common one: someone added
torch.compile(model)but the model is never called (because of a different code path), so all the throughput numbers reflect eager mode. Detection: kernel count in profiling shows the original count, not the reduced one. Fix: verify compile activated by checkingmodel._compiled_callor running withTORCH_LOGS=output_code.
Profiling for launch overhead
If you suspect launch overhead is your problem:
Run with NSight Systems
NVIDIA's profiler shows kernel-by-kernel timing and CPU/GPU overlap. Launch overhead appears as gaps between kernels on the GPU timeline.
A practical workflow: start nsys profile -o trace --capture-range=cudaProfilerApi python infer.py, wrap a single forward pass in torch.cuda.profiler.cudaProfilerStart/Stop, open the resulting .nsys-rep in nsys-ui. The timeline view shows the kernel sequence; the histogram view shows total time per kernel category. Both are essential for diagnosing where time goes.
Indicators of launch-bound workload
- GPU SM utilization low (< 50%) at small batch sizes.
- Many short kernels (< 100 µs each).
- CPU showing high activity preparing kernel launches.
- HBM bandwidth not saturated even though the workload is decode.
What to do
- If launch-bound: enable CUDA graphs and torch.compile.
- If HBM-bandwidth-bound: quantize weights, reduce batch size.
- If compute-bound: scale up the GPU or pipeline.
Production usage in 2026
vLLM. Uses CUDA graphs extensively for decode. Paged attention + CUDA graphs = small number of shapes. Inductor compilation is mature in vLLM V1 scheduler with --enable-torch-compile.
SGLang. CUDA graphs are first-class. RadixAttention works alongside. The prefix-cache hits are graph-friendly because they reuse the same captured shapes; SGLang's design pre-dates broad torch.compile integration but the two compose cleanly.
TensorRT-LLM. Compiles the model with NVIDIA's TensorRT compiler — fundamentally similar to torch.compile but more aggressive and NVIDIA-specific. Plus CUDA graphs. TRT-LLM's distinctive feature is the engine-build phase which selects per-shape kernels from CUTLASS at build time; this is what produces the latency advantage over torch.compile's runtime selection.
llama.cpp. Hand-tuned kernels per backend. Less reliant on automated compilation; the kernels are themselves highly optimized. The CPU/consumer-GPU niche where launch overhead is less of an issue because the workload is already running close to its hardware limits per-kernel.
Hugging Face TGI. Mix of compiled paths and CUDA graphs. The default fallback when teams need a hosted-API-compatible serving layer with broad model coverage; less aggressive on performance than vLLM or TRT-LLM but easier to set up for non-frontier models.
Hosted providers. All use some form of compilation + graph capture. Specifics not public. Latency patterns consistent with aggressive engine compilation (TRT-LLM-class) plus CUDA Graphs plus FP8 throughout.
Stack feature matrix
| Feature | vLLM 0.8 | SGLang 0.4 | TRT-LLM 0.18 | llama.cpp |
|---|---|---|---|---|
| CUDA Graphs | Mature | Mature | Mature | n/a |
| torch.compile integration | Beta | Yes | n/a | n/a |
| AOTInductor | Beta | No | n/a (TRT engine) | n/a |
| FlashAttention-3 | Yes | Yes | Yes | n/a |
| Custom Triton kernels | Yes | Yes | Limited (CUTLASS instead) | n/a |
| Per-shape bucketing | Auto | Auto | Build-time | n/a |
| Multi-stream | Limited | Yes | Yes | Single-stream |
The pragmatic choice: vLLM for general production, SGLang for prefix-heavy workloads, TRT-LLM for NVIDIA-only frontier latency targets. llama.cpp lives in a different niche (consumer/CPU) where these techniques mostly do not apply.
Real-world startup cost examples
- vLLM with
--enforce-eager: instant startup, slowest decode. - vLLM with default CUDA Graphs: ~10-20 s startup, fast decode.
- vLLM with
torch.compileenabled: ~60-180 s startup, slightly faster decode. - TRT-LLM engine build: 5-30 minutes for a 70B model, then 1-2 seconds to load the prebuilt engine per instance.
The startup cost matters disproportionately for autoscaling deployments where instances start and stop frequently. For stable long-running deployments, the once-off cost is irrelevant.
When it doesn't help
Cases where this is unnecessary:
- Very large batch sizes: launch overhead is amortized, work-per-kernel dominates. Win is small.
- Pure prefill workloads: chunked prefill at long sequences has high work-per-kernel. Compile may still help via fusion, but Graphs are nearly free of effect.
- Embedding generation / retrieval inference: single-shot forward passes on short inputs. Launch overhead exists but the workload is rarely launch-bound enough to matter.
- One-shot scripts: compilation cost may exceed the runtime saved.
- Extremely small models: total step time so small that even bad overhead is acceptable.
- Dynamic shapes that don't fit any bucket: recompilation churn negates the savings.
- Heavily Python-bound applications: if Python interpretation is the bottleneck, GPU kernel overhead is not the limit.
For everything else — production serving with realistic batch sizes — eliminating launch overhead through some combination of CUDA graphs and compile is one of the largest single throughput wins available.
When kernel fusion hurts more than helps
A few cases where aggressive fusion is the wrong call:
Debugging. Fused kernels are opaque to per-operation profiling and harder to debug numerically. During development, eager mode is easier to reason about; only enable fusion for production.
Frequently changing models. Each model change triggers recompilation. For research workloads where the model changes daily, the JIT compile cost may exceed the runtime saved. Keep compile off until the model stabilizes.
Heterogeneous shape distributions. If your traffic has thousands of distinct shapes that cannot be bucketed cleanly, dynamic-shape Inductor code or eager mode may beat aggressive shape-specialized compilation.
Mixed-precision experiments. When prototyping precision choices, each combination triggers recompilation. Disable compile until the precision is settled.
Multi-tenant serving with graphs
When serving multiple distinct models from one process (e.g., a routing tier that serves several smaller models), CUDA Graphs must be captured per model. Memory pressure can be significant: each captured graph holds onto its working buffers in HBM. For deployments with 5+ resident models per GPU, graph capture for all of them may exceed available HBM. Mitigations: capture lazily on first request per model, share buffer pools across captures, or run a smaller subset of bucketed shapes per model.
torch.compile internals: Dynamo, AOTAutograd, Inductor
torch.compile is not one component; it is a three-stage pipeline whose stages can each fail differently. Understanding the boundary between them is the difference between a five-minute debug session and a five-day one.
Stage 1: TorchDynamo (the bytecode tracer)
Dynamo runs at the level of CPython bytecode. When a function decorated by torch.compile is first called, Dynamo intercepts the frame at the CPython evaluator level, walks the bytecode, and builds an FX graph corresponding to the Python operations it can prove are side-effect-free under the input types and shapes observed. Anything Dynamo cannot prove — a print statement, a numpy interop call, tensor.item() reading a scalar back to Python, a Python-level dict mutation — terminates that graph (a graph break), the partial graph is compiled, eager Python runs the unsupported piece, and a new graph begins after it.
Dynamo emits guards alongside each graph. Guards encode the conditions under which the trace is valid: tensor dtype, rank, per-dim shape, stride contiguity, integer scalar value, NN module class identity, even Python global variable identities for closures. On every subsequent call to the compiled function, Dynamo checks guards first. If they pass, the cached compiled graph runs. If any guard fails, Dynamo either re-traces (a recompile, expensive) or falls back to eager.
The single most common production failure mode is guard violation churn: a model that looks fine in eager mode but recompiles on every request because some incidental Python value differs each time. Symptoms: CPU pegged, throughput collapsing, occasional TORCH_LOGS=recompiles lines mentioning "tensor stride changed" or "size mismatch on dim 1". The diagnosis tool is torch._dynamo.explain(model)(*inputs), which lists graph breaks with line numbers and recommended fixes.
Stage 2: AOTAutograd (the functionalization layer)
After Dynamo hands off an FX graph, AOTAutograd performs three transformations that are invisible to users but crucial for backend compilers. First, it functionalizes the graph: in-place ops (add_, relu_, view mutations) are rewritten as out-of-place ops with explicit copies, because most backends cannot reason about in-place mutation. Second, it traces the backward pass at the same time as the forward pass, producing a joint graph whose forward and backward are co-optimized — this is what enables Inductor to schedule activation reuse across the forward-backward boundary. Third, it decomposes high-level ops (F.scaled_dot_product_attention, nn.LayerNorm) into their constituent primitives so Inductor sees a normalized, low-level op set.
The "fast eager, slow compile" anti-pattern often originates here. If your model uses a custom op that AOTAutograd cannot functionalize (a torch.utils.cpp_extension op with no functionalization rule), the entire path drops back to eager despite Dynamo successfully tracing. Fix: register a functionalization rule via torch.library.impl for the abstract dispatch key.
Stage 3: Inductor (the codegen backend)
Inductor receives the functionalized, decomposed FX graph and lowers it to Triton (for CUDA), C++/OpenMP (for CPU), or a vendor path. It does three jobs: scheduling (deciding which ops fuse together based on a cost model of HBM traffic and register pressure), codegen (emitting Triton source), and autotuning (searching over Triton meta-parameters like BLOCK_SIZE, num_warps, num_stages).
Inductor's autotuner is where the 30-180 seconds of compile time for a 70B model goes. For each generated Triton kernel, Inductor tries 4-12 configurations, benchmarks each, and picks the fastest. The cache (~/.cache/torch/inductor/) keys results on kernel hash plus device capability, so subsequent calls skip the search.
Dynamic shapes: the 2026 state
The chronic complaint of 2023-2024 — that torch.compile recompiles for every batch size — has largely been fixed. Modern Dynamo (PyTorch 2.4+) supports symbolic shape tracking: shapes become symbolic s0, s1 variables, the FX graph is parameterized over them, and Inductor emits code that handles a range of shapes. The trade-off is some runtime overhead (shape arithmetic at kernel-launch time, typically 1-3 µs per kernel) and reduced specialization (Inductor cannot hardcode constants that depend on the shape). For production LLM serving, the win — one compile vs dozens — is overwhelming.
mark_dynamic(tensor, dim_index) tells Dynamo a specific dimension should be treated as symbolic from the first trace, avoiding the "specialize first, recompile dynamic" two-pass that wastes startup time. Use it for batch and sequence dimensions in LLM workloads; leave it off for head dimension and feature size, which never vary.
Recompile triggers worth knowing
The full list of things that invalidate a cached graph: tensor dtype change, tensor rank change, tensor size change on a non-dynamic dim, stride change (sometimes — depends on guard granularity), Python integer argument value change (unless wrapped as torch.SymInt), nn.Module class identity change, module's training attribute flipping, autograd state change (requires_grad flipping). For production, the practical rule is: keep your input pipeline boringly consistent, and use torch.compiler.allow_in_graph for ops you know are safe.
Per-backend tour: Inductor, IPEX, ONNXRT, TensorRT, vLLM/SGLang/TGI compile modes
torch.compile is backend-pluggable via the backend= argument. The choice matters more than most teams realize.
Inductor (default)
Targets Triton on CUDA and ROCm, C++/OpenMP on CPU. The default for nearly all production workloads. Best at element-wise fusion, normalization fusion, small reductions. Calls cuBLAS / CUTLASS for matmul; does not generate matmul kernels itself. Mature for LLMs in PyTorch 2.4+, with the FlexAttention extension (PyTorch 2.5+) handling custom attention masks via Triton codegen.
Inductor CPU
Same Inductor frontend, C++/OpenMP backend, with auto-vectorization for AVX-512 and AVX2. For CPU inference of small models (under 7B) this can match or beat naive ONNX Runtime. For LLM-class serving on CPU, llama.cpp is still faster due to hand-tuned kernels per ISA.
IPEX (Intel Extension for PyTorch)
Intel's path for both CPU (with VNNI / AMX optimizations) and Intel data-center GPUs (Gaudi3, Arc). Plugs in as a torch.compile backend via backend="ipex". The CPU path is the strongest case: significant wins on Sapphire Rapids and Granite Rapids where AMX matrix instructions can accelerate INT8 and BF16 GEMMs. On Gaudi3 the SynapseAI compiler typically does more aggressive work than IPEX-as-backend, and most production Gaudi3 deployments use the SynapseAI path directly.
ONNXRT backend
Wraps ONNX Runtime as a torch.compile backend. Exports the FX graph to ONNX, hands it to ORT, which can dispatch to TensorRT, CUDA EP, ROCm EP, DirectML, or others. Most useful when you want ORT's ecosystem (broad model coverage, vendor-portable) but with PyTorch as the development interface. Slower than native Inductor for the common LLM case; relevant for cross-vendor deployment or for using ORT's quantization toolchain.
TensorRT backend (torch-tensorrt)
A first-class TRT backend that compiles FX subgraphs to TRT engines and stitches the result back into a PyTorch-callable module. Faster than torch.compile + Inductor for NVIDIA-only deployments — typically 20-40% on Hopper, larger on Blackwell — at the cost of much longer build times (5-30 minutes for a 70B model) and bound-to-architecture engine files. Not the same as TensorRT-LLM, which is a full serving framework with its own runtime; torch-tensorrt is the lighter-weight option for embedding TRT inside a PyTorch program.
vLLM compile mode
vLLM 0.8 introduced a --enable-torch-compile flag that runs the model under torch.compile(mode="reduce-overhead") for both prefill and decode. The integration handles bucketing for batch and KV-page count automatically. Adoption is gated by stability; vLLM still defaults to its CUDA Graphs path without compile because the compile path has occasionally regressed with new PyTorch nightlies.
SGLang compile mode
SGLang has a more aggressive compile integration: it compiles per-layer with FX graph capture, applies fusion (RoPE + KV-write, residual + LayerNorm) at its own scheduler level, and falls back to CUDA Graphs for the dispatch layer. On Llama-3-70B decode at batch 8 we have measured a 15-20% throughput advantage for SGLang compile mode over vLLM compile mode at equivalent settings.
TGI compile mode
Hugging Face TGI ships with torch.compile integration in mode="reduce-overhead", primarily for the decoder path. The mode is conservative — TGI prioritizes broad model coverage over peak throughput — but the wins for typical chat workloads are 1.4-1.7× over eager. For frontier-latency targets, TGI is rarely the right choice; for "I need to serve a Hugging Face model in a hosted-API-compatible way," it remains the path of least resistance.
Backend selection matrix
| Backend | Best fit | Compile cost | Runtime perf vs Inductor | NVIDIA-only? |
|---|---|---|---|---|
| Inductor | General | 30-180 s | 1.0× (baseline) | No (also ROCm, CPU) |
| Inductor CPU | Small models, CPU serving | 30-90 s | n/a (different target) | No |
| IPEX | Intel CPU/GPU | 60-180 s | 1.0-1.2× on Intel hw | No |
| ONNXRT | Cross-vendor, ORT toolchain | 60-180 s | 0.8-1.0× | No |
| torch-tensorrt | NVIDIA-only PyTorch programs | 5-30 min | 1.2-1.4× | Yes |
| vLLM compile | Production LLM serving | 60-180 s | 1.1-1.2× over vLLM eager | No (also ROCm) |
| SGLang compile | Prefix-heavy LLM serving | 60-180 s | 1.15-1.25× over SGLang eager | No |
| TGI compile | Broad-coverage hosted serving | 60-120 s | 1.4-1.7× over TGI eager | No |
CUDA Graphs mechanics: stream capture, allocators, mutable args
The CUDA Graphs API has more sharp edges than the simple capture_begin / capture_end pseudocode suggests. The mechanics that matter in production:
Stream capture mode
The default capture mode is cudaStreamCaptureModeGlobal, which forbids any kernel launch on any stream during capture except on the capture stream. This is the safest default and what torch.cuda.graph uses. The alternative modes — ThreadLocal and Relaxed — allow other streams to launch kernels during capture but require the user to guarantee no interference. Production stacks use the default; the relaxed modes invite race conditions that are hard to debug.
Memory allocator interaction
PyTorch's CUDA allocator participates in capture: allocations made during capture are tracked and re-used on replay. To prevent fragmentation across graphs, captured graphs can be told to use a private allocator pool via torch.cuda.graphs.graph_pool_handle. This is what mode="reduce-overhead" in torch.compile does automatically when it wires in CUDA Graphs — it creates one pool per captured graph and isolates allocations.
The trade-off: a private pool reserves memory that cannot be used by other graphs or by eager-mode code, increasing peak memory usage. For LLM serving with dozens of graphs (batch × KV bucket combinations), the extra memory can reach several GB. Mitigation: share a single pool across graphs that have compatible lifetimes (typically all decode graphs).
Mutable args and the input-buffer trick
CUDA Graphs capture kernel arguments by value at capture time. If your kernel takes a tensor pointer, the graph remembers that exact pointer. If the tensor is freed and reallocated at a different address, the graph either fails or silently runs on garbage.
The production pattern is to allocate fixed-address input/output buffers once, capture graphs that reference those buffers, and at runtime copy_ new data into the buffers and replay. The copy is fast (a few microseconds), and the graph runs against stable memory.
For mutable per-step state (the KV cache, in particular), production stacks structure the cache so its pointers are stable across decode steps — paged attention assigns each request a fixed set of page pointers, which are captured into the graph. New requests assigned to the same slot reuse the same pointers.
In-place ops
In-place ops are fine in captured graphs, but their semantics are subtle. A tensor.add_(other) inside a graph mutates a specific memory location captured at capture time. Replaying executes the same mutation against the same address. This is exactly what you want for the KV cache append (mutate slot at position i) and is exactly wrong for any tensor whose identity you expected to be re-created each step.
Allocator pool sizing
Capturing graphs eagerly without sizing the pool leads to fragmentation: each new graph allocates from the general pool, leaves holes, and the next graph cannot find a contiguous range. The fix: use set_per_process_memory_fraction to reserve a fraction of HBM for graph capture, plus the pool-handle pattern above. For a 70B model with 30+ graphs, reserving 8-12 GB for graph pools is typical.
Dynamic shape interaction
CUDA Graphs themselves do not support dynamic shapes — every graph is one shape. The dynamic-shape support in PyTorch comes from capturing multiple graphs (the bucketing pattern) and dispatching at runtime. There is no way around this in the current CUDA Graphs API; it is a hardware-driver feature, not a PyTorch limitation.
Mutating KV cache safely
The KV cache is the trickiest tensor to handle. Common pitfall: capture a graph that reads from a KV cache slot at position N, then at runtime serve a longer sequence whose decode now reads from position N+1. The graph reads the old position. The fix is paged attention's design — the graph reads from page pointers via an index table, and the index table is updated outside the graph, so the graph's behavior automatically follows the new layout. This is one of the deeper reasons paged attention won as the production KV cache design.
CUDA Graph gotchas: cuBLAS warmup, NCCL streams, allocator pools
A non-exhaustive but production-grade list of CUDA Graph traps:
cuBLAS warmup
cuBLAS lazily initializes per-shape kernels on first use. If your graph contains a matmul whose shape cuBLAS has never seen, the first capture will include an initialization step that allocates workspace and selects a kernel. Subsequent replays use the cached kernel, but the first replay can be slower or even fail if the workspace allocation is non-deterministic.
The fix: warm up before capturing. Run the model eagerly through every shape you plan to capture, then capture. This is what vLLM's startup pipeline does — there is a warmup loop that touches every batch and KV bucket before any graph is captured.
NCCL stream interactions
In multi-GPU serving, NCCL collectives (all-reduce, all-to-all for MoE) run on dedicated streams. Capturing a graph that includes a collective requires the NCCL communicator to be initialized and the collective stream to be synced with the capture stream. Naive captures often fail with cryptic NCCL errors that mention stream priority or context mismatch.
The pattern that works: synchronize all relevant streams before capture_begin, perform the collective inside the captured region with explicit stream pass-through, synchronize again before capture_end. PyTorch's distributed module handles this for FSDP and DDP under torch.compile; hand-rolled distributed code needs to replicate it.
Allocator pool sizing (continued)
A second-order issue: the pool allocator may grant a captured graph more memory than it needs (due to power-of-2 rounding), creating per-graph waste. For deployments with many shapes, total waste can reach 15-25%. Mitigation: use max_split_size_mb=128 (or similar small values) in the allocator config to reduce the maximum allocation granularity, at a small cost in fragmentation elsewhere.
Dynamic shapes inside captured regions
A graph captured with shape X cannot replay with shape Y. Trying does not error; it produces garbage. Always assert shapes match before replay. Production stacks structure this as a dispatch table keyed on (batch, kv_pages, prefill_chunk) and refuse to route requests that do not match any captured shape.
KV cache pointer churn
If your KV cache changes its allocation pattern between captures (because of dynamic page assignment), captured graphs become stale. Production fix: pre-allocate the full KV cache at startup, hand out slot indices rather than fresh allocations, and capture graphs that reference fixed base pointers + variable offsets.
Multi-stream coordination
If your serving stack uses multiple streams (one for compute, one for prefetching weights, one for output transfer), the captured graph must include the synchronization primitives that coordinate them. Otherwise the graph's compute proceeds while a prefetch has not landed, producing incorrect output. The cleanest pattern is to do all of the in-graph work on a single stream and use eager-mode synchronization around the graph boundary for multi-stream coordination.
Stale graph on weight update
LoRA adapter swap, weight reload, or quantization toggle all change weights. Captured graphs reference the old weights by pointer. After any weight update, captured graphs must be invalidated and re-captured. Production stacks expose a hook for this — vLLM and SGLang both have reload_weights paths that drop the graph cache.
Pinned host memory for input copy
When copying input data into the pre-allocated input buffer (the pattern from the previous section), the host-side copy is faster from pinned memory. Allocating the request's working buffers in pinned memory (torch.zeros(..., pin_memory=True)) drops the copy time by 2-3× compared to pageable memory. For sub-millisecond latency budgets, this matters.
Per-stack CUDA Graph adoption (vLLM, SGLang, TRT-LLM)
The three major serving stacks have arrived at different CUDA Graphs strategies based on their architectural priorities.
vLLM: CUDA Graphs for decode only
vLLM's V1 scheduler captures graphs for decode steps only; prefill is run eagerly with FlashAttention. The rationale: prefill steps are large enough that dispatch overhead is small relative to compute (a 4096-token prefill is ~25 ms of GPU work on a 70B model; saving 10 ms of dispatch is helpful but not transformative). Decode at small batch is the bandwidth-and-launch-bound regime where graphs pay off most.
vLLM captures graphs for batch sizes {1, 2, 4, 8, 16, 32, 64} and for KV-page counts at common sizes. Total captured graphs: typically 30-60 depending on configuration. Startup cost: 15-30 seconds for a 70B model.
SGLang: CUDA Graphs for prefill AND decode
SGLang takes the opposite stance: it captures graphs for both prefill and decode. The reasoning is that SGLang's RadixAttention often hits prefix-cache reuse, where the "prefill" is really a small uncached suffix on top of a cached prefix. These small prefills behave more like decode (launch-bound) than like long prefills (compute-bound), so capturing them helps.
The downside: more graphs to capture (chunk size × batch size = larger product). SGLang startup with full graph capture is 30-60 seconds for a 70B model. The throughput gain on prefix-heavy workloads (agentic, multi-turn chat, long system prompts) typically pays for it within minutes of serving.
TRT-LLM: graphs baked into engine
TRT-LLM does not use PyTorch's CUDA Graphs API directly. Instead, the engine build phase produces a TRT engine that includes the graph-equivalent dispatch optimization at the TRT runtime level. From the user's perspective, TRT-LLM is "always graphed" — there is no eager mode, and the engine binary includes the captured-equivalent state.
The implication: TRT-LLM's launch overhead is structurally minimized, but it is not actually using cudaGraphExec underneath. The TRT runtime has its own dispatch optimization that achieves similar results.
Cross-stack feature comparison
| Stack | Prefill graphs | Decode graphs | Chunked prefill graphs | MoE graph support |
|---|---|---|---|---|
| vLLM 0.8 | No (eager) | Yes | Partial (per-chunk size) | Yes |
| SGLang 0.4 | Yes | Yes | Yes | Yes |
| TRT-LLM 0.18 | Engine-baked | Engine-baked | Engine-baked | Engine-baked |
| TGI | Limited | Yes | No | Limited |
| llama.cpp | n/a | n/a | n/a | n/a |
Inductor Triton template kernels and the autotune cache
Inductor's codegen is not pure synthesis; it uses a set of hand-written Triton templates for high-value patterns (matmul, reductions, normalizations) and synthesizes only the glue.
Template kernels
For matmul, Inductor calls cuBLAS or CUTLASS rather than generating Triton. For matmul-adjacent fusions (matmul + bias + activation), it uses Triton templates parameterized over the fusion pattern. The template approach gives Inductor near-CUTLASS performance for common matmul + epilogue patterns without having to solve the full kernel-synthesis problem.
For attention, Inductor does not template — it dispatches to FlashAttention or to FlexAttention's Triton codegen. Inductor's job is to recognize the attention pattern and route to the right kernel.
Triton autotuner
For non-matmul Triton kernels, Inductor runs an autotuning search over BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, num_warps, num_stages, and per-kernel scheduling hints. The search is bounded — typically 4-12 configurations per kernel — and benchmarked on the actual GPU. The winner is cached.
Cache location and invalidation
Default: ~/.cache/torch/inductor/ (or TORCHINDUCTOR_CACHE_DIR if set). Cache keys include the FX graph hash, the PyTorch version, the Triton version, the CUDA toolkit version, the GPU compute capability, and the kernel hash. Any change to any of these invalidates affected cache entries.
For CI/CD: pre-populate the cache on a representative GPU, archive it as part of the container image, and ship it to production. The first request after a deploy now starts with a warm cache, eliminating the 30-180 second JIT penalty.
Deterministic builds
By default, autotuning is non-deterministic — small wall-clock variations during the benchmark can pick different winners across runs. For reproducible production deployments, set TORCHINDUCTOR_BENCHMARK_KERNEL=1 and pin the autotune output by archiving the cache. The deterministic mode (torch.use_deterministic_algorithms(True)) further constrains kernel choice; this is necessary for bit-exact reproduction but may forgo 5-15% performance.
torch.compile with FSDP, DDP, Megatron, and custom ops
FSDP
FSDP (Fully Sharded Data Parallel) shards parameters across ranks and reconstructs them on the fly for each forward / backward step. The all_gather and reduce_scatter collectives that implement this are now compile-friendly in PyTorch 2.4+ via torch.compile(model, fullgraph=False) plus FSDP's use_orig_params=True configuration. Graph breaks occur at collective boundaries by default; the fragmented graphs still benefit from per-fragment Inductor fusion. fullgraph=True is not supported with FSDP today; the integration is improving but expect graph breaks.
DDP
DDP (Distributed Data Parallel) replicates the model and all-reduces gradients in the backward pass. The all-reduce is fused with backward computation via bucketing; torch.compile plays well with this because the all-reduce is on a separate stream and does not block the compute path. Practical experience: torch.compile + DDP delivers the expected speedups; this is the most boring and most-tested distributed-compile pattern.
Megatron-LM tensor parallelism
Megatron's tensor parallelism splits matmuls across ranks with all-reduce in the forward and backward. The all-reduce sits between two matmuls within a layer and is hard for Inductor to optimize across — fusion stops at the collective. The wins from torch.compile + Megatron-TP are smaller than for DDP/FSDP because the per-layer collective limits the fusion window. Still positive, typically 10-15% throughput.
Custom ops via torch.library
When your model uses a custom CUDA op (a hand-written kernel via torch.utils.cpp_extension), Inductor sees it as an opaque function call and refuses to fuse around it. The fix: register the op with torch.library.custom_op, supply a register_fake rule (the abstract output shape function), and optionally a register_kernel rule for the actual implementation. Inductor can then schedule around the op even if it cannot fuse into it. This is the path for production custom kernels that need to coexist with torch.compile.
Quantized / FP8 compile
torch.compile supports FP8 and INT4 weights via the torchao library. The FP8 path requires Hopper or newer (the cublasLtMatmul API on Ampere does not support FP8). The INT4 weight-only path works on any GPU; it uses dequantize-then-matmul Triton kernels generated by Inductor.
The relevant pitfall: quantization introduces extra tensors (scales, zeros) that participate in fusion. If Inductor cannot see them as part of the matmul template, it generates a separate dequant kernel. Use torchao.quantization.quant_api.quantize_ after torch.compile, in that order, so Inductor's templates can recognize the quantized weight pattern.
Numerical precision, FP8/quantized compile, reproducibility
Numerical differences after compile
Compiled kernels are not bit-identical to eager. Sources of difference: kernel fusion changes the order of floating-point additions (FP is non-associative), Triton's matmul accumulator may use a different precision than cuBLAS's default, and tensor-core paths can use TF32 where eager used FP32.
Practical magnitude: per-token logit differences of 1e-5 to 1e-3 between eager and compile, occasionally producing different argmax decisions on borderline tokens. For most workloads this is irrelevant — sampling temperature dominates. For deterministic regression tests, expect to need a tolerance of 1e-4 on logit comparison.
FP8 quantized compile
FP8 (E4M3 for weights and activations, E5M2 for some pathways) requires per-tensor or per-channel scales. The compiler must thread these scales through fusion correctly. Inductor's FP8 support in PyTorch 2.4+ handles this via the torch.float8_e4m3fn dtype and torch._scaled_mm op; if you see "no matching kernel" errors on FP8, it usually means you are on a too-old PyTorch or your GPU does not support FP8 (Ampere does not).
Reproducibility
For deterministic builds: torch.use_deterministic_algorithms(True) plus CUBLAS_WORKSPACE_CONFIG=:4096:8 plus pinning all library versions. Inductor's autotuner is non-deterministic by default; setting TORCHINDUCTOR_DETERMINISTIC=1 makes it deterministic at the cost of forgoing some autotune wins. Full bit-exact reproduction across machines also requires matching CUDA toolkit version, driver version, and GPU compute capability — practically unattainable across heterogeneous fleets, achievable on a single hardware SKU.
Benchmarks: eager vs compile vs graphs on Llama 3, Mistral
Concrete numbers from our internal benchmarking, May 2026 on H100 SXM 80GB with PyTorch 2.6, CUDA 12.6, FA3 enabled. Throughput in decode tokens / second / GPU at batch 1 unless noted.
Llama 3 8B (BF16)
| Configuration | Batch 1 decode | Batch 16 decode | Notes |
|---|---|---|---|
| Eager PyTorch | 78 tok/s | 720 tok/s | Baseline |
| torch.compile (default) | 102 tok/s | 880 tok/s | +30% / +22% |
| torch.compile (reduce-overhead) | 138 tok/s | 940 tok/s | Adds CUDA Graphs |
| vLLM 0.8 (graphs only) | 145 tok/s | 1180 tok/s | Production stack |
| vLLM 0.8 + compile | 156 tok/s | 1240 tok/s | Compile beta path |
| TRT-LLM 0.18 (FP16 engine) | 168 tok/s | 1340 tok/s | Engine-built |
Llama 3 70B (BF16)
| Configuration | Batch 1 decode | Batch 16 decode | Notes |
|---|---|---|---|
| Eager PyTorch | 12 tok/s | 145 tok/s | Baseline |
| torch.compile (reduce-overhead) | 21 tok/s | 195 tok/s | +75% / +35% |
| vLLM 0.8 (graphs only) | 23 tok/s | 230 tok/s | Production stack |
| vLLM 0.8 + compile | 25 tok/s | 245 tok/s | Compile beta |
| SGLang 0.4 (graphs + compile) | 27 tok/s | 260 tok/s | Prefix-aware |
| TRT-LLM 0.18 (FP8 engine) | 38 tok/s | 380 tok/s | FP8 included |
The FP8 row in TRT-LLM is doing two things at once — engine compile plus FP8 quant — and is not directly comparable to the FP16 rows. Holding precision constant, the engine-build advantage of TRT-LLM over compile + graphs is roughly 15-25%.
Mistral 7B Instruct
Mistral 7B's GQA (group-query attention) reduces KV cache size and shifts the decode bottleneck slightly. The wins from compile + graphs are proportional but slightly larger than for Llama 3 8B because the smaller KV cache leaves more room for launch overhead to dominate. We have measured 1.9× decode throughput from eager → vLLM + compile at batch 1.
Production deployment workflow: AOT, warmup, version pinning
A production-grade deployment of compile + graphs looks like this:
Step 1: Pin versions
PyTorch, Triton, CUDA toolkit, NVIDIA driver, FlashAttention, and your serving framework. Any nightly is forbidden in production. Reference 2026 stack: PyTorch 2.6.0, CUDA 12.6, driver 560.x, Triton 3.1, FlashAttention 2.7 with FA3 enabled.
Step 2: Build with AOTInductor (where appropriate)
For latency-critical deployments, build a per-shape AOTInductor .so per bucketed shape, archive them in object storage, and ship them with the container image. For autoscale-friendly cold start, this is the highest-impact change you can make — startup drops from 60-180 s to 2-5 s.
Step 3: Pre-populate Inductor cache
Run a representative workload on the same GPU SKU as production, archive ~/.cache/torch/inductor/, ship it with the container. First-request latency drops by the JIT compile cost.
Step 4: Warm up at startup
On instance start, run synthetic requests across all bucketed shapes to ensure CUDA Graphs are captured, cuBLAS is warmed up, and any lazy-initialized library state is settled. The warmup should run inside the readiness probe; only mark the instance ready when warmup completes.
Step 5: Continuous monitoring
Track per-request decode latency, kernel-launch count (via NVTX or cudaLaunchKernel counters), and the recompilation log. Alert on regressions; recompilation churn is the most common silent failure mode.
Step 6: Validate after every PyTorch upgrade
PyTorch upgrades invalidate the Inductor cache and can change kernel selection in ways that affect both performance and numerical output. Treat PyTorch upgrades as deployment events: run a benchmark suite plus a numerical regression test before promoting to production.
CPU-bound vs GPU-bound regimes and Blackwell changes
When compile/graphs help
- Decode at small batch on any high-clock-rate GPU.
- Prefill on long prompts with element-wise-heavy post-attention paths.
- MoE inference (many small kernels per expert × layer multiply launch overhead).
- Cold-start-sensitive serverless workloads (use AOT).
When they don't
- Compute-bound prefill (the kernels are big enough to amortize launch).
- CPU-only workloads dominated by Python interpretation (Python is the bottleneck, not kernel dispatch).
- Heavily Python-side code (data preprocessing, custom samplers in pure Python) — the GPU is idle for reasons unrelated to dispatch.
- Embedding-only workloads on short inputs.
- Toy models where the per-kernel time is already smaller than the per-launch overhead is not even the issue.
Blackwell-specific changes
B200 brings two changes that affect this stack: (1) higher per-clock launch overhead reduction in the driver (CUDA 12.6 ships with a faster cudaLaunchKernel path), so the baseline launch tax is lower; (2) NVFP4 / MXFP4 tensor cores that require new fusion patterns in Inductor.
The net effect on the compile + graphs ROI:
- The relative win from CUDA Graphs alone is smaller on B200 — launch overhead is lower in the baseline. Typical batch-1 decode wins drop from 1.5-2× on H100 to 1.3-1.7× on B200.
- The relative win from compile (kernel fusion) is larger on B200 — the new precision formats benefit more from fusion around the dequant path. Typical batch-1 decode wins increase from 1.3-1.5× on H100 to 1.4-1.7× on B200.
- The combined win is roughly preserved: 2-3× on both.
NVL72 (B200 in 72-GPU NVLink domain) changes the calculus for distributed serving but not for single-GPU compile + graphs. The intra-rack NVLink bandwidth (1.8 TB/s per GPU) is large enough that tensor-parallel inference can run with negligible communication overhead, but the compile stack still operates per-rank and the same caveats apply.
torch.compile decision matrix and the "fast eager vs slow compile" trade-off
A practical framework for deciding whether to invest in torch.compile, CUDA Graphs, both, or neither.
The decision matrix
| Workload | Use torch.compile? | Use CUDA Graphs? | Notes |
|---|---|---|---|
| Inference, fixed batch + shape | Yes | Yes | Best case for both |
| Inference, dynamic shapes (varying batch) | Yes with dynamic=True |
Multiple captured graphs | Compile recompiles; graphs need shape bucketing |
| Inference with KV cache decode loop | Yes (cautious) | Yes (vLLM-style) | Graph capture during decode only |
| Training, dense | Yes | Rare | Compile helps, graphs typically only inference |
| Training, FSDP / DDP | Yes (with dynamic=True) |
Difficult | Compile + FSDP improving; graphs hard with collectives |
| Training, gradient accumulation | Yes | Yes (capture step) | Possible for stable shapes |
| Research, frequent code changes | Optional | No | Compile time can exceed productivity gain |
| Edge / CPU inference | Yes (cpu backend) |
n/a | Inductor CPU is competitive with onnxruntime |
| Models with many graph breaks | Marginal | No | Fix graph breaks first |
The "fast eager vs slow compile" trap
A frequent disappointment: a developer profiles eager mode at 50 ms/iteration, runs torch.compile(), sees the first iteration take 30 seconds, and walks away thinking compile is broken. The reality:
- The first call triggers compilation (Dynamo trace → AOTAutograd → Inductor → Triton template tuning → CUDA caching). This takes seconds to minutes.
- Subsequent calls run the compiled artifact. These are usually 20–60% faster than eager.
- If you measure only the first iteration, you measure compilation, not speed.
The fix: always warm up the model (5–10 forward passes) before measuring. For production, the warmup happens at server startup, so the cold-start cost is paid once.
When compile actively hurts
A small list of cases where compile is the wrong move:
- Models with many graph breaks (Python control flow,
.item()calls, in-place ops mixed with autograd). The compile overhead exceeds the speedup. - Highly dynamic shapes where every batch is a recompile.
- Single-shot inference where compile time exceeds run time.
- Models that depend on specific eager-mode behaviour (rare, but exists).
- Models where the kernel-level work is already kernel-bound; compile mostly helps Python/launch overhead.
When CUDA Graphs alone is enough
If your bottleneck is purely launch overhead (small kernels, small batches, decode-bound LLM inference), CUDA Graphs can deliver the full speedup without compile's complexity. vLLM and SGLang both use CUDA Graphs aggressively for decode. The downside is the shape rigidity — you need to capture a graph per shape bucket.
When both pay off
Production LLM serving: compile during model loading (using torch.compile on the model), then capture CUDA Graphs of the compiled forward pass for decode. vLLM, SGLang, and TensorRT-LLM all follow this pattern. The combination delivers 1.5–3× over eager for typical decoding workloads.
For the underlying mechanics: see vLLM PagedAttention for the production decode loop and disaggregated inference for the architectural decomposition.
Reproducibility and determinism in compiled code
A practical question that bites teams shipping production inference: can compiled PyTorch be deterministic? Mostly yes, with caveats.
Sources of non-determinism
In any GPU code, non-determinism comes from:
- Floating-point reduction order (different kernels sum tiles in different orders).
- Atomic operations (scatter, certain reductions).
- cuBLAS algorithm selection (different algorithms produce slightly different results).
- Triton autotuner picking different kernels across runs.
- TensorFloat-32 (TF32) lossy precision on Ampere+ unless disabled.
- FP8 / INT8 quantisation with different scaling.
Achieving determinism
For inference reproducibility:
torch.use_deterministic_algorithms(True)— turns on PyTorch's deterministic mode.CUBLAS_WORKSPACE_CONFIG=:4096:8— required for cuBLAS determinism.torch.backends.cuda.matmul.allow_tf32 = False— disable TF32 matmul.- Pin Triton autotune cache or use
@triton.heuristicswith fixed configs. - For compile:
torch.compile(..., mode="reduce-overhead")with consistent shapes.
The result: outputs reproducible to the bit across runs on the same hardware. Cross-hardware reproducibility (H100 vs B200) is much harder — different hardware uses different kernels.
When determinism costs you
Deterministic algorithms are slower. The expected cost is 5–20% throughput. For most inference use cases, accepting near-determinism (bit-equivalent within a few ULP) is fine. For high-stakes use (medical, financial, regulatory), full determinism is sometimes required.
Compile and determinism interaction
The torch.compile cache is keyed on input shapes and (in some modes) on autotune results. If autotune picks different kernels on different machines, two compiled artifacts can produce slightly different outputs. Pinning the autotune cache solves this; the cost is occasional sub-optimal kernels.
Practical workflow for reproducible production inference
- Pin model weights, code version, CUDA version, PyTorch version, Triton version.
- Disable TF32 if exact precision matters.
- Set deterministic algorithms.
- Share the Triton autotune cache across replicas.
- Validate reproducibility with a golden-output suite.
- Re-validate after any dependency update.
This workflow is overkill for most products and essential for some. The latter are usually products that ship to regulated industries.
Profiling compiled code: what's different
Profiling compiled PyTorch requires slightly different tools and techniques than eager mode. A practical workflow.
Tools
torch.profiler. PyTorch's built-in profiler. Works with compiled code but the trace can be hard to read because operators are fused.nsys(Nsight Systems). NVIDIA's system-level profiler. Shows CUDA kernel timeline, including graph captures and replays.ncu(Nsight Compute). Kernel-level profiler. For deep analysis of specific Triton-generated kernels.torch._dynamo.config.verbose = True. Reveals graph breaks and recompiles.TORCH_LOGS=+dynamo,inductor. Verbose logging of what the compiler is doing.
What to look for
In a compiled forward pass:
- Graph breaks. Each break is a Python re-entry, which costs hundreds of microseconds and prevents fusion across the break. Aim for zero graph breaks in the hot path.
- Recompiles. Every recompile costs seconds. If you see recompiles per iteration, you have unstable shapes.
- Kernel time breakdown. What fraction of time is in matmul vs attention vs Python? Compile should reduce Python to <5%.
- Memory copies. Inductor sometimes inserts copies for alignment or layout. Big copies are a flag.
- CPU-GPU sync points.
.item(),.cpu(), host-side conditionals — any of these forces a sync.
A typical "why isn't compile faster" investigation
A team has a model that's only 5% faster after compile. The investigation:
- Run
TORCH_LOGS=+dynamo,inductorand check for graph breaks. - Check whether the compile mode is
reduce-overhead(graph capture) ordefault(eager-like). - Profile with
nsysand compare kernel times eager vs compile. - Look for any
.item()or Python-level operations in the hot path. - Check whether the workload is fundamentally kernel-bound (matmul-dominated) — if so, compile can't help much.
The common finding: graph breaks at unexpected places (often due to operator implementations that fall back to eager). Fixing them recovers most of the missing speedup.
For deeper kernel-level analysis, see Triton kernel primer.
CUDA graphs for training: the rarer use case
Most production discussion of CUDA Graphs focuses on inference because decode loops are launch-bound. Training is usually kernel-bound and benefits less, but there are specific cases where graphs help training too.
When training is launch-bound
- Small models with small batch sizes (uncommon in production training).
- Gradient accumulation steps with many small ops.
- Models with many small Python-side operations in the forward pass.
- Optimizer step on small models.
Capturing a training step with graphs
The full forward + backward + optimizer step can be captured if:
- Shapes are stable across iterations.
- No control flow that depends on tensor values.
- Collectives (DDP, FSDP) are either captured or excluded from the graph.
Megatron-LM has had support for graph capture of training steps for years. Capturing FSDP is harder because of its dynamic communication patterns, but FSDP2 and 2026 PyTorch versions have improved support.
Practical wins
- Gradient accumulation: typically 5–15% speedup.
- Small-model training: occasionally 20–40% speedup.
- Optimizer step: usually a minor win (5%) unless the optimizer has many small ops (Lion, Sophia).
For most large-scale training (multi-GPU, multi-node, FSDP/Megatron), CUDA Graphs in training are marginal. The kernel-level work (matmul, attention, layer norm) dominates and graphs don't accelerate the kernels themselves.
The gradient checkpointing interaction
Gradient checkpointing recomputes activations during backward. This recomputation can be captured in a graph if checkpointing is at a stable boundary. Some configurations work; others break the capture.
The takeaway: CUDA Graphs in training are a worthwhile optimisation for specific workloads but not the universal win they are in inference.
Blackwell-specific compile considerations
Blackwell (B100, B200, GB200 NVL72) shipped through 2024–2026 and introduces architectural changes that affect compile behaviour.
What's new on Blackwell
- TCGen5 tensor cores. New instruction set for matrix multiplication, including native FP4 and MXFP8 support.
- Partition-aware scheduling. Each SM is split into multiple partitions; compile needs to schedule across partitions.
- Larger shared memory. More SMEM per SM allows larger Triton tile sizes.
- Higher HBM bandwidth and more capacity. B200's 192GB HBM3e changes the memory hierarchy.
- NVLink 5. Faster inter-GPU communication affects sequence-parallel and TP communication patterns.
Compile support timeline
- PyTorch 2.5 (late 2024): initial Blackwell support, missing some optimisations.
- PyTorch 2.6 / 2.7 (2025): improved Inductor on Blackwell, including FP8 paths.
- PyTorch 2.8 / 2.9 (early 2026): full TCGen5 support, FP4 quantisation, partition-aware scheduling.
By mid-2026, torch.compile on Blackwell is roughly at parity with H100 maturity — production-ready but newer than the H100 codebase.
Workloads that benefit most
- FP8 / FP4 inference: Blackwell's native low-precision tensor cores deliver large speedups when paired with appropriate quantisation.
- Long-context attention: more SMEM allows larger FlashAttention tiles.
- MoE serving: NVLink 5 changes the expert-routing communication cost.
Migration considerations
Code that works on H100 with torch.compile should work on Blackwell without changes, but optimal performance often requires:
- Recompilation with Blackwell-specific kernel templates.
- Quantisation passes (FP8 or FP4) where appropriate.
- Tile size tuning for the larger SMEM.
For production deployments moving from H100 to B200, the upgrade isn't drop-in. Plan a re-benchmarking and tuning pass.
For the underlying hardware: H100, H200, B200 architecture.
Comparison tables
Three consolidating tables that anchor the trade-offs discussed throughout the guide.
Table A: techniques by use case
| Technique | Inference launch-bound | Inference kernel-bound | Training | Research |
|---|---|---|---|---|
| Eager only | Slow | Acceptable | Acceptable | Best DX |
| torch.compile (default) | Faster | Slightly faster | Faster | Slow compile, fine after warmup |
| torch.compile (reduce-overhead) | Much faster | Slightly faster | Sometimes | Less flexible |
| CUDA Graphs only | Much faster | No change | Marginal | Painful DX |
| compile + graphs | Best | Slightly faster | Good for stable | Tedious |
| AOTInductor | Best for production | Best for production | n/a | n/a |
Table B: speedup typical ranges (eager = 1×)
| Workload | torch.compile | CUDA Graphs | compile + graphs |
|---|---|---|---|
| Llama 3 8B decode | 1.2–1.4× | 1.3–1.6× | 1.5–2.0× |
| Llama 3 70B decode | 1.1–1.2× | 1.2–1.4× | 1.3–1.5× |
| Mistral 7B prefill | 1.1–1.3× | 1.0× | 1.1–1.3× |
| Small CNN inference | 1.3–1.8× | 1.5–2.5× | 2.0–3.0× |
| ResNet-50 training | 1.05–1.15× | 1.0× | 1.05–1.15× |
Table C: when each tool is the right answer
| If your goal is | Use |
|---|---|
| "Make my eager PyTorch faster with one line" | torch.compile() (default mode) |
| "Reduce decode launch overhead in my LLM" | CUDA Graphs (often via vLLM/SGLang) |
| "Best inference throughput on a fixed model" | TensorRT-LLM or vLLM with compile + graphs |
| "Production deployment as a standalone artifact" | AOTInductor |
| "Custom kernel I can profile easily" | Write in Triton directly |
| "Squeeze the last 10% on a hot path" | CUTLASS or hand-tuned CUDA |
| "Research that changes frequently" | Stay in eager; add compile last |
These tables condense the practical guidance. For deeper dives on specific patterns, see the Triton kernel primer and vLLM PagedAttention posts.
Production deployment patterns by stack
How each major inference stack actually uses torch.compile and CUDA Graphs in 2026. Versions referenced are mid-2026 publicly-known capabilities.
vLLM
- CUDA Graphs: enabled by default for decode since vLLM 0.5. Captured per batch-size bucket. Compilation happens at server startup.
- torch.compile: enabled for the model forward in vLLM 0.6+. Graph capture wraps the compiled forward.
- Quantisation: FP8 KV cache + FP8 weights supported. INT4 via GPTQ / AWQ. Compile interacts cleanly with all.
- Multi-GPU: tensor parallel works with compile + graphs; pipeline parallel partially.
- Cold start: 30–90 seconds for a 70B model with full compile + graph capture.
SGLang
- CUDA Graphs: captured for both prefill and decode (vLLM only does decode). The dual capture is part of SGLang's throughput advantage.
- torch.compile: integrated with the kernel selection layer. Some custom kernels bypass compile.
- RadixAttention: tree-based KV-cache reuse interacts subtly with graph capture; SGLang has specific patches.
- Cold start: comparable to vLLM.
TensorRT-LLM
- No torch.compile: TRT-LLM uses NVIDIA's TensorRT engine builder, separate from PyTorch.
- Engine build: produces a self-contained
.enginefile. The build itself takes 10 minutes to multiple hours depending on model size and tuning. - AOT artifact: the engine is shippable as a binary, much like AOTInductor.
- Performance: typically 1.1–1.3× over vLLM for the same model and hardware, at the cost of less flexibility.
Triton Inference Server (NVIDIA Triton, not the kernel language)
- PyTorch backend: supports torch.compile via the backend's config.
- TensorRT backend: uses TRT engines.
- Ensembling: orchestrates multiple models, each potentially using a different runtime.
Modal, RunPod, Together, Fireworks (managed)
- Most managed inference providers in 2026 use vLLM or SGLang under the hood.
- They handle the compile / graph warmup as part of the deployment workflow, so users see fast inference without managing cold starts.
- Custom-fine-tuned models can be deployed; the provider handles compile + graph capture transparently.
Llama.cpp / MLX / Ollama
- These target consumer hardware (CPU, Apple Silicon, modest GPUs).
- Do not use torch.compile (different runtime).
- CUDA Graphs are not used (these stacks have their own batching strategies).
- Achieve comparable per-token throughput on consumer hardware via different optimisations (kernel fusion in custom CUDA / Metal kernels, quantisation, batch-size-1 specialisation).
| Stack | torch.compile | CUDA Graphs | Quantisation | Best for |
|---|---|---|---|---|
| vLLM | Yes | Decode | FP8, INT4 | General production LLM serving |
| SGLang | Yes | Prefill + decode | FP8, INT4 | High-throughput multi-tenant |
| TensorRT-LLM | No | No (engine-level) | FP8, FP4, INT4 | Maximum throughput, willing to build |
| TGI (HuggingFace) | Yes (newer) | Yes (newer) | FP8, INT4 | HuggingFace-centric deployments |
| Llama.cpp | No | No | GGUF quant | CPU and consumer GPU |
| MLX | No | n/a | INT4, INT8 | Apple Silicon |
| Modal/Together/Fireworks | Hidden | Hidden | Provider-managed | Managed inference |
The pattern across the industry: compile + graphs is now the production default for GPU LLM serving. Consumer-stack alternatives use different optimisations but reach competitive single-user throughput via other paths.
For the broader serving architecture context, see LLM serving and agent serving infrastructure.
Common pitfalls and how to avoid them
A consolidated list of mistakes that show up over and over in support channels and code reviews.
Pitfall 1: measuring cold start as "compile speed"
The first compiled iteration includes compilation time, which can be 10–60 seconds. Treating this as the runtime is the single most common torch.compile mistake. Always warm up before benchmarking.
Pitfall 2: dynamic shapes with mode="reduce-overhead"
reduce-overhead mode uses CUDA Graphs, which require stable shapes. If shapes vary, the runtime will recompile or fail. Use mode="default" for dynamic shapes, or pre-bucket shapes if you need graph capture.
Pitfall 3: graph breaks in the hot path
A print() statement, a .item() call, or a Python conditional inside the model's forward will cause a graph break. The break costs hundreds of microseconds and prevents fusion. Audit the hot path for these.
Pitfall 4: not clearing the Inductor cache after major changes
The Inductor cache is keyed conservatively, but corner cases exist where a stale cache is used after code changes. If compile behaviour seems off after a refactor, clear ~/.cache/torch/inductor/ and recompile.
Pitfall 5: forgetting to warmup cuBLAS before graph capture
cuBLAS allocates workspace on first call. If the first call happens inside a captured graph, capture fails. Always run a representative matmul before graph capture.
Pitfall 6: mixing eager and compiled forward passes
If you sometimes call the model in eager mode and sometimes via compile, you're effectively re-compiling between modes. For consistent performance, commit to one mode after a baseline.
Pitfall 7: ignoring the recompile log
TORCH_LOGS=+recompiles shows every recompile. If you see recompiles per iteration, your shapes are unstable. Fix the upstream shape inconsistency rather than trying to make compile tolerate it.
Pitfall 8: assuming compile always helps
For workloads that are kernel-bound (matmul-dominated), compile's improvement is small. The big wins are in launch-bound workloads. If your model is compute-saturated, compile won't help much.
Pitfall 9: ignoring numerical differences
Compile-generated kernels may differ in floating-point order from eager kernels. Outputs can differ by a few ULP. For most uses this doesn't matter; for high-precision regression tests it can. Adjust test tolerances or pin a kernel selection.
Pitfall 10: not version-pinning in production
PyTorch and Triton updates can change compile behaviour. Pin versions for production and validate that compile behaviour is preserved on upgrade.
Pitfall 11: trying to compile training code with many in-place ops
In-place operations (x.add_(y), x += y) can interact poorly with compile's autograd handling. If you see autograd errors after enabling compile, audit for in-place ops.
Pitfall 12: capturing graphs with allocator-pool collisions
If two graphs are captured with overlapping allocator pools, replay can corrupt memory. The fix is to use separate pools (PyTorch handles this automatically in most cases, but custom CUDA stream usage can break it).
Pitfall 13: deploying compile to production without disk persistence
The Inductor cache lives on disk. If your production environment has ephemeral containers without persistent storage, every container restart pays the full compile cost. Mount a persistent volume or pre-build artifacts.
Pitfall 14: assuming CUDA Graphs work with multi-stream code
Custom CUDA streams interact subtly with graph capture. The PyTorch defaults work; deviating from them requires careful capture configuration. When in doubt, stick to the single-stream default.
Pitfall 15: profiling without TORCH_LOGS
The default profiler output is hard to read for compiled code because operations are fused. Combine nsys with TORCH_LOGS=+inductor,+dynamo for actionable profiling output.
These fifteen pitfalls together account for the majority of "compile / graphs aren't working" support requests. Working through them is most of the practical learning curve for new adopters.
The bottom line
The launch-overhead tax is what makes a fast GPU look idle on decode. The single biggest lever is using CUDA Graphs and torch.compile together: graphs strip dispatch cost to near zero, compile shrinks the kernel count it's stripping, and the combined decode speedup at production batch sizes is 2–3× on Hopper-class hardware. Either one alone leaves most of the win on the table.
- Decode is launch-bound; prefill is compute-bound. Optimize them as different workloads.
- Use
torch.compile(mode="reduce-overhead")as the default starting point — it pulls in graphs automatically. - Bucketed shapes are the price of admission. Pin a small set, pre-compile, recompile on misses.
- FlashAttention is orthogonal and additive — never compete with it, always pair with it.
- AOTInductor lets you ship a compiled binary so production startup isn't paying compile time on every restart.
For the kernel layer underneath the compiler, see Triton kernel primer. For the bandwidth side of decode this combination unblocks, see quantization tradeoffs and KV cache.
FAQ
Do I need both graphs and compile, or just one? For best results, both. For simple deployments, CUDA graphs alone capture most of the dispatch-overhead win. Compile adds kernel fusion on top.
Does this work for training? Yes, but with smaller relative wins. Training does prefill-shaped passes on large batches, where launch overhead is small relative to compute. Compile + graphs help by 10-20% in training; 100-200% in decode.
Can I capture a graph that handles variable batch size? No directly. Workaround: capture multiple graphs for different batch sizes, route based on incoming traffic. Pad batches up to the nearest captured size.
What if my model uses dynamic control flow (e.g., early exit)? CUDA graphs don't handle data-dependent branches. Options: capture multiple graphs for each branch, or use a hybrid path (eager for the branch-decision, graph for the rest).
Is torch.compile production-ready? For inference: yes, broadly. The Inductor backend is mature. For exotic models or unusual ops, expect debugging.
How long does compilation take? Seconds for small models. Minutes for 70B-class. Hours for some extreme cases. Cached afterward.
Does this work with custom kernels? Yes. Custom Triton kernels can be called from compiled paths; CUDA graphs capture them like any other kernel.
What about the JIT in TensorRT? TensorRT is NVIDIA's commercial inference compiler. It does more aggressive optimization than torch.compile but is NVIDIA-only and has steeper learning curve. For NVIDIA-only deployments at scale, often worth using.
Does this matter on AMD GPUs? The same principles apply. ROCm has its equivalents (hipGraph for HIP equivalent of CUDA graphs; TritonCompile or other paths for fusion). Kernel-launch overhead exists everywhere.
Should I use FlashAttention 2 or 3? FA-3 on Hopper / Blackwell, FA-2 on Ampere (A100) or older. Major serving stacks pick automatically based on hardware; you rarely choose manually. Training has the same rule.
What's the difference between torch.compile and AOTInductor?
torch.compile is JIT — compiles at first run, caches. AOTInductor compiles offline into a .so you ship and load without Python. Use AOTInductor when cold-start latency matters or when you can't run a Python interpreter at serving time.
When should I write my own Triton kernel? When profiling shows a specific hot path that Inductor isn't fusing well, and the workload is high-value enough to justify engineering. For most production stacks: don't bother. For frontier serving, hand-tuned kernels on the 1–2 hottest paths can recover another 10–30%.
What is ThunderKittens and is it production-ready? A research C++ DSL for Hopper attention kernels from Stanford Hazy Research. It produces FlashAttention-3-class kernels in ~100 lines. Used in research and a few high-end production stacks; not a default. Worth watching.
Does TensorRT-LLM replace torch.compile? For NVIDIA-only deployments at hyperscale, often yes. TensorRT-LLM does more aggressive optimization but requires engine builds per shape and is NVIDIA-locked. For multi-vendor or research use, torch.compile is more flexible.
How does this interact with quantization? Compilation and graph capture work on whatever precision the model runs in. FP8 and INT4 inference paths benefit just as much (often more, because the smaller kernels are even more launch-overhead-sensitive). See quantization tradeoffs.
Why does my torch.compile slow down on every new batch size?
Recompilation churn from shape changes. Either bucket inputs to a small set of shapes or use dynamic=True / mark_dynamic so Inductor emits dynamic-shape code. Check TORCH_LOGS=recompiles to confirm.
Should I pre-warm CUDA Graphs before serving traffic? Yes, always. Cold CUDA Graphs add 100-200 ms of capture time to the first request that hits each shape. Pre-warming the capture during startup eliminates this from production latency. Most serving stacks do this automatically; if yours does not, run synthetic requests at each bucketed shape during health-check warmup.
How do CUDA Graphs interact with multi-stream execution? CUDA Graphs capture one stream's worth of work. Multi-stream patterns (overlapping compute with data movement) need explicit setup: capture each stream's work separately and synchronize between them. Most LLM inference uses a single primary stream because the work is sequential per token, so this rarely matters.
Why isn't my compile fusing the attention? Attention kernels (FlashAttention) are already custom CUDA/Triton kernels that bypass the standard PyTorch operator path. They show up to Inductor as black-box function calls and are not fused with surrounding operations. This is correct — FlashAttention's internal optimization is much better than any fusion Inductor would discover. The lack of fusion around attention is by design.
Does this work for vision models and ViTs? Yes. The same techniques apply: CUDA Graphs for dispatch overhead, torch.compile for fusion. Vision models often have heavier element-wise post-processing (color space, normalization, attention pooling) where compile-time fusion wins are larger relative to total compute. See our multimodal serving guide.
What's the impact on autoscaling? Cold start gets worse: JIT compilation adds 30-180 seconds of warm-up before a new instance can serve traffic. AOTInductor solves this by shipping pre-compiled artifacts. For autoscale-heavy deployments where instances start frequently, AOTInductor is the right answer; for stable long-running deployments, JIT is fine because the startup cost is paid once.
How does this interact with multi-tenant LoRA serving? LoRA adapters change the linear layers' weights at request time. CUDA Graphs assume fixed weights; swapping adapters between requests requires either re-capture (slow) or a graph design that includes the adapter merge as a fused operation (fast). Production multi-LoRA stacks (Punica, vLLM's LoRA path) implement this via dedicated adapter-aware kernels. See multi-tenant LoRA serving.
Can I use torch.compile on AMD ROCm or Intel GPUs? Yes, with caveats. ROCm has Triton support and Inductor targets it; performance varies by kernel pattern but the common LLM ones work. Intel GPUs have OpenMP and SYCL paths in Inductor; immaturity-bound. The non-NVIDIA paths have closed most of the gap by 2026 but still trail NVIDIA on the long tail of kernel patterns.
What if my model has data-dependent control flow that I cannot avoid?
Use torch.compile(mode="reduce-overhead") which is more permissive about graph breaks. Each break causes a fall-back to eager for that operation, which loses some optimization but does not fail. For CUDA Graphs, capture the deterministic regions of the model and use eager mode for the branching parts; the typical result is 70-80% of the all-graph win with much simpler engineering.
Is there a downside to AOTInductor I should know about?
Two: (1) the compiled .so is shape-bucketed and you ship N artifacts; if you forget a shape, it falls back to eager or fails. (2) AOTInductor compiles work that JIT could have skipped (e.g., dead branches in your model). The .so is sometimes larger than the equivalent JIT'd in-memory artifacts. Neither is a dealbreaker; just budget for both.
How does TensorRT-LLM compare to torch.compile in performance? TRT-LLM is typically 20-50% faster than torch.compile + CUDA Graphs on the same hardware for typical LLM decode, due to more aggressive kernel selection, fused attention with custom kernels, and FP8 paths that PyTorch's stack does not fully exploit. The gap has narrowed steadily; in 2024 it was 2-3×, by 2026 it is in the tens of percent. For NVIDIA-only deployments with high stability, TRT-LLM remains the throughput leader.
What is the future of this stack? The trend is consolidation: torch.compile becoming the default, AOTInductor taking the production-deployment slot, CUDA Graphs handled implicitly inside compile. Triton continuing as the kernel-writing language. Vendor-specific paths (TRT-LLM) closing the performance gap with the open ecosystem. By 2027, expect "PyTorch nightly + compile + AOT" to be the default production path with TRT-LLM reserved for the highest-scale NVIDIA deployments.
Does mark_dynamic eliminate all recompiles?
No. It tells Dynamo a specific dim is symbolic from the first trace, so you avoid one round of specialize-then-generalize. But other shape changes (rank, dtype, stride pattern) still trigger recompilation. The advice is to mark batch and sequence dims dynamic, leave head and feature dims static, and watch TORCH_LOGS=recompiles during initial deployment.
What is the difference between Dynamo, AOTAutograd, and Inductor in practice? Dynamo is the Python-bytecode tracer that produces the FX graph. AOTAutograd functionalizes the graph, captures the backward, and decomposes high-level ops. Inductor lowers the result to Triton or C++ code. When you see compile errors, the stage is usually identifiable from the traceback: Dynamo errors mention "graph break" or bytecode opcodes; AOTAutograd errors mention "decomposition" or "functionalization"; Inductor errors mention Triton or codegen failures.
Should I use fullgraph=True?
fullgraph=True makes Dynamo error on any graph break instead of falling back to eager. For development it is useful — it surfaces hidden graph breaks that silently degrade performance. For production it depends; if your model has a hard-to-eliminate break (a custom op, a NumPy interop), fullgraph=True prevents the model from running at all. The practical pattern: use fullgraph=True in CI to catch regressions, leave it off in production.
How do I share the Inductor cache across containers in a CI/CD pipeline?
Mount ~/.cache/torch/inductor/ as a persistent volume across CI runs, or pre-populate it once on a representative GPU and bake the result into the container image. The cache directory is typically a few hundred MB for a 70B model with full bucketing; manageable for container layers.
What's the relationship between FlexAttention and FlashAttention? FlexAttention is a PyTorch 2.5+ abstraction that lets you express custom attention masks declaratively (per-token bias functions, block-sparse patterns) and have Inductor generate a fast Triton kernel for that exact pattern. FlashAttention is a hand-written kernel for the standard attention pattern; faster than anything Inductor can generate today for the standard case. FlexAttention is the right choice when the pattern is non-standard; FlashAttention when it is standard.
Why are my compile times so long on a 70B model?
The dominant cost is autotuning Triton kernels — Inductor tries 4-12 configurations per kernel, benchmarks each, and picks the winner. For a 70B model with ~700 distinct kernels, this can take 60-180 seconds. Setting TORCHINDUCTOR_MAX_AUTOTUNE=0 disables autotune (uses default configs), cutting compile to 10-20 seconds at a 5-15% runtime cost. The cache means you only pay this once per (model, GPU, PyTorch version).
Do CUDA Graphs work with MoE expert dispatch? Yes, but the all-to-all collectives that implement expert dispatch must be captured inside the graph. This requires NCCL initialization before capture and synchronization on the collective stream. vLLM and SGLang both support this for DeepSeek-V3 and similar MoE models. Hand-rolled stacks need to replicate the NCCL-aware capture pattern carefully.
What's the impact of torch.compile on autograd? For training, torch.compile traces both forward and backward together (via AOTAutograd) and fuses across the forward-backward boundary. The wins are smaller than for inference (10-20% for training vs 100-200% for decode) because training is compute-bound rather than launch-bound, but they exist. The main caveat is gradient correctness — verify on a small validation step that gradients match eager mode within numerical tolerance.
How do I debug "compile makes my model slower"?
First check TORCH_LOGS=recompiles for recompilation churn. Second check torch._dynamo.explain(model)(*inputs) for graph breaks. Third profile with Nsight Systems and look for unexpectedly large kernels (autotune may have picked a bad config). Fourth, try mode="default" vs "reduce-overhead" vs "max-autotune" to see if a different aggressiveness setting helps. Fifth, disable compile on specific submodules with @torch.compiler.disable to isolate the regression.
Can I use torch.compile in a Jupyter notebook? Yes, with the caveat that re-running cells often triggers recompilation because Dynamo treats each invocation as potentially having different module identities. For development this is fine; for benchmarking inside a notebook, expect inflated times on the first cell run and use the cached run for measurements.
What is TORCHINDUCTOR_CACHE_DIR used for?
It overrides the default ~/.cache/torch/inductor/ location for the Inductor compiled-kernel cache. Use it to share caches across containers (mount the directory), to isolate caches per environment (different paths for dev vs prod), or to relocate the cache to faster storage. Setting it to a tmpfs-backed path is a common pattern for ephemeral CI runners that want fast first-build performance without persistence.
How do I AOT compile for multiple shapes?
Call torch._inductor.aot_compile once per shape, producing N .so files. At deployment, ship all of them and dispatch at runtime based on input shape. Some teams build a wrapper Python module that lazy-loads the right .so per request; vLLM has a similar pattern under development for production AOT serving.
Why does torch.compile sometimes silently fall back to eager?
Dynamo encounters a Python construct it can't trace and aborts the trace at that point. The compiled segment runs, the un-traceable code runs in eager, and then compilation may resume after. Each fallback is a graph break. The runtime survives but the speedup is reduced. Set TORCH_LOGS=+dynamo to see graph breaks. Common causes: .item() calls, data-dependent control flow, third-party C extensions, certain in-place ops, Python printing in the hot path.
How does torch.compile interact with FSDP2?
Better than with FSDP1. FSDP2 (introduced in PyTorch 2.4) uses per-parameter sharding which is easier to compile across. Both compile + FSDP2 works in most cases; the typical speedup is 10–20% over eager FSDP2. For FSDP1, compile support is partial and the gain is smaller; the recommended migration path is to move to FSDP2 if you're starting with compile.
Can I share compile artifacts across machines?
Yes, with caveats. The Inductor cache (TORCHINDUCTOR_CACHE_DIR) is keyed on PyTorch version, CUDA version, GPU SM version, and code hashes. Machines with the same versions can share the cache. Cross-GPU-generation sharing (H100 → B200) requires recompilation since SM version differs.
What's the difference between torch.compile and torch.jit (TorchScript)? TorchScript was PyTorch's earlier (2018-era) attempt at ahead-of-time compilation. It used a separate scripting language and frequently required code modifications. torch.compile (introduced PyTorch 2.0, 2023) is more permissive — it traces native Python code without requiring rewrites. TorchScript is deprecated in 2026; torch.compile is the current path. AOTInductor is the production replacement for TorchScript-style standalone artifacts.
Why does CUDA graph capture sometimes fail with cuBLAS errors? cuBLAS uses workspace memory that may be allocated on first use. If the workspace isn't ready before graph capture, capture fails. The fix is to warm up cuBLAS before capture — run a few representative matmuls outside the captured region. Many production stacks (vLLM, SGLang) do this automatically.
Can I use torch.compile with quantised models (INT8, FP8)? Yes, with caveats. FP8 support is mature on H100/B200 with PyTorch 2.5+. INT8 and INT4 quantisation support is partial — basic patterns work, exotic ones may fall back to eager. The torchao library provides quantisation primitives that are compile-friendly. For production INT4 inference (Marlin-style), compile + custom kernels is often the right pattern.
What does it mean when Inductor says "fallback to ATen op"? Inductor couldn't generate a Triton kernel for a specific operation, so it called the ATen (PyTorch C++) implementation. This is usually fine for performance but means that operation isn't fused with neighbours. If you see many ATen fallbacks in your hot path, consider whether you can rewrite using ops Inductor supports better.
How do I know if my compile cache is being hit?
Set TORCH_LOGS=+inductor and look for cache-hit messages. The first run of a model populates the cache; subsequent runs (same code, shapes, versions) hit it. Cache hits skip the multi-second compilation step. In production, ensure the cache directory persists across container restarts.
Is torch.compile production-ready for serving?
Yes, as of mid-2026 it's used in production by vLLM, SGLang, Modal, Together AI, and many others. The combination of torch.compile + CUDA Graphs is the standard production pattern for LLM decode. Caveats: dynamic shape handling has edge cases, debug-ability is harder than eager mode, and the cold-start cost requires warmup discipline.
What's the overhead of graph capture itself? Capturing a single CUDA graph: hundreds of milliseconds to seconds depending on the graph's complexity. Replaying a captured graph: microseconds of dispatch overhead. The trade-off favours capture-and-replay for any workload where you'll run the same shape more than a handful of times. For one-shot workloads with constantly-changing shapes, capture overhead exceeds the speedup.
Can I use multiple CUDA graphs in one process? Yes. vLLM captures one graph per batch-size bucket (typically 8 sizes), each holding KV-cache references. Switching between them is fast (microseconds). The memory cost is one allocator pool per graph; on H100 with 80GB, this is usually negligible.
What's a "memory allocator pool" in the CUDA graph context? A CUDA graph's memory allocations must be stable — the addresses captured at capture time must still be valid at replay time. PyTorch achieves this by giving each captured graph its own allocator pool, separate from the main allocator. This isolates the graph's memory from re-use by other ops. The cost: more memory fragmentation across many graphs.
Does compile help with attention kernels specifically? Marginally for FlashAttention-class kernels (already hand-optimised in CUDA / Triton). More for surrounding ops (softmax, masking, rotary embedding application). The biggest compile win in transformers is fusing pre- and post-attention operations into single kernels, not the attention compute itself.
What's the relationship between compile and TensorRT? Different products solving overlapping problems. torch.compile is in-PyTorch, easier to use, and supports more PyTorch operators. TensorRT (specifically TensorRT-LLM for LLMs) is a separate engine that takes a model and produces an optimised binary; it often produces faster kernels but requires a model export step and supports fewer ops. The 2026 industry pattern: torch.compile for dev and many production deployments; TensorRT-LLM for the last 10–20% of throughput at scale.
Can torch.compile reduce memory usage? Sometimes. By fusing kernels and avoiding intermediate materialisation, compile can reduce peak memory by 10–30% in some workloads. The reductions are most visible in models with many small ops (transformer with many small intermediates). For matmul-dominated models, memory savings are smaller.
How do I debug a model that's slower under torch.compile?
First check that you're measuring after warmup (5–10 iterations). Then look for graph breaks (TORCH_LOGS=+dynamo). Then profile with nsys and compare kernel timelines. Common culprits: graph breaks, recompiles per iteration, or a workload that's already kernel-bound and not Python-bound.
What happens when PyTorch is upgraded? Does the compile cache survive? The cache is keyed on PyTorch version, so an upgrade invalidates it. All compilations re-run on first use after the upgrade. This is a meaningful production consideration — schedule warmup runs after deployments.
Glossary
- CUDA graph — captured sequence of GPU operations replayable with low overhead.
- Dispatch overhead — CPU-side cost of launching a kernel.
- Eager mode — PyTorch's default execution mode; kernels launched one at a time.
- FX graph — PyTorch's intermediate representation used by Dynamo and Inductor; a directed acyclic graph of operations.
- Graph break — when Dynamo cannot trace a piece of code (data-dependent control flow, opaque library call) and falls back to eager mode for that segment. Frequent breaks defeat compilation.
- Inductor — torch.compile's default compilation backend, generates Triton kernels.
- Kernel fusion — combining multiple operations into one kernel.
- Paged attention — KV cache organized in fixed-size pages.
- Shape specialization — compiling or capturing for a specific input size.
- Trace — recorded sequence of operations the compiler analyzes.
- TMA — Tensor Memory Accelerator, a Hopper feature for async bulk HBM↔SRAM transfers.
- WGMMA — Warp Group Matrix Multiply Accumulate, Hopper's async tensor core matmul instruction.
- TorchDynamo — torch.compile's tracing frontend.
- Triton — GPU kernel programming language used as backend by Inductor.
References
- PagedAttention / vLLM — Kwon et al., 2023. "Efficient Memory Management for Large Language Model Serving with PagedAttention." arXiv:2309.06180. Paged KV plus CUDA graphs is the foundational pattern.
- FlashAttention — Dao et al., 2022. arXiv:2205.14135. The dominant kernel-fusion success story.
- Triton — Tillet, Kung, Cox, 2019. "Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations." MAPL 2019. See triton-lang.org.
- NVIDIA CUDA Graphs documentation — see CUDA C Programming Guide, "CUDA Graphs" section.
- PyTorch torch.compile / TorchInductor — see PyTorch 2.x release notes and the Inductor RFC at github.com/pytorch/pytorch.
- TensorRT-LLM documentation — NVIDIA's serving stack, deeply integrated graph capture.
- NSight Systems — NVIDIA's profiler; the primary tool for diagnosing launch-bound workloads.
- "Getting Started with CUDA Graphs" — NVIDIA Developer Blog, 2019. The canonical introduction explaining capture/replay and instantiation costs. See developer.nvidia.com/blog/cuda-graphs/.
- TorchDynamo and TorchInductor design — Ansel et al., 2024. "PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation." ASPLOS 2024. The reference paper for the modern
torch.compilestack. - FlashAttention — Dao, Fu, Ermon, Rudra, Ré, 2022. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." arXiv:2205.14135. The original IO-aware tiled attention algorithm.
- FlashAttention-2 — Dao, 2023. "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." arXiv:2307.08691. Improved warp / block parallelism.
- FlashAttention-3 — Shah, Bikshandi, Zhang, Thakkar, Ramani, Dao, 2024. "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision." arXiv:2407.08608. Hopper-specific async tensor cores and FP8 path.
- Triton (MAPL 2019) — Tillet, Kung, Cox, 2019. "Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations." ACM DL. The Python-like DSL behind Inductor.
- ThunderKittens — Spector, Arora, Singhal, Fu, Ré, 2024. hazyresearch.stanford.edu/blog/2024-05-12-tk. Minimalist C++ DSL for Hopper attention.
- PyTorch 2.0 release notes — pytorch.org/blog/pytorch-2.0-release/. Reference for torch.compile / Dynamo / Inductor.
- AOTInductor documentation — pytorch.org/docs/stable/torch.compiler_aot_inductor.html. Ahead-of-time compilation path for production serving.
- CUTLASS — NVIDIA, 2017–present. github.com/NVIDIA/cutlass. C++ template library for high-performance GEMM building blocks.