Prompt20
All posts
nvlinknvswitchnvl72topologyinterconnectinfinibandualinkultra-ethernetrubinco-packaged-opticsguide

GPU Interconnects and Rack-Scale Topology: NVLink, NVSwitch, NVL72, Topology Choices — The Complete Guide

The definitive guide to GPU interconnects in 2026: NVLink generations 3/4/5, NVSwitch chips, HGX baseboards, GB200 NVL72 rack-scale fabric, DGX SuperPOD, AMD Infinity Fabric, UALink, Ultra Ethernet — how scale-up vs scale-out works, how parallelism maps to topology, and why what fits in one rack defines what frontier AI models can be.

By Prompt20 Editorial · 110 min read

A modern GPU has terabytes per second of HBM bandwidth and tens of teraflops of compute. Put two of them in different boxes connected by ordinary network and the link between them is so slow they may as well be on different planets. The story of frontier AI hardware over the past five years is the story of pushing the boundary of fast-fabric back, one generation at a time, so that bigger and bigger models can be tightly coupled — and the 2024–2026 shift from 8-GPU NVSwitch islands to 72-GPU GB200 NVL72 racks is the most consequential interconnect change since NVLink itself shipped on Pascal.

The take: what fits in one fast-fabric domain defines what frontier models can be. The shift from 8-GPU NVSwitch to rack-scale NVL72 isn't just a bandwidth upgrade — it's a change in the size of "tightly coupled" that has direct consequences for what tensor-parallel and expert-parallel groups are practical, and therefore what model architectures are viable. Anyone planning a serious deployment should know which side of the fast-fabric boundary their collectives sit on, because the throughput cliff is real.

This guide is the authoritative answer to "how does GPU-to-GPU interconnect actually work in 2026?" It covers every generation of NVLink (3, 4, 5) and NVSwitch (1, 2, 3, 4), how the HGX baseboard and DGX SuperPOD reference architectures map to those chips, what GB200 NVL72 actually does at the cable level, where AMD's Infinity Fabric and the new UALink standard fit in, and how the broader Ultra Ethernet Consortium roadmap interacts with scale-up interconnect. For the inter-node companion to this guide, read AI cluster networking alongside.

Table of contents

  1. Key takeaways
  2. Mental model: GPU interconnects in one minute
  3. The interconnect landscape in 2026
  4. The bandwidth hierarchy
  5. NVLink: the basic fast link
  6. NVSwitch: the in-node crossbar
  7. Scale-up vs scale-out
  8. Rack-scale fabrics (NVL72 and friends)
  9. Mapping parallelism to topology
  10. Collectives and bandwidth profiles
  11. Inter-node fabrics: InfiniBand vs Ethernet
  12. AMD Infinity Fabric and others
  13. Topology failure modes
  14. Why this defines what AI can be
  15. Production deployments
  16. GB200 NVL72 deep dive
  17. Multi-rack scaling: beyond NVL72
  18. Cross-vendor: AMD, UALink, Ultra Ethernet, the future
  19. Power, cooling, and physical constraints of rack-scale
  20. Failure isolation and the blast radius of rack-scale
  21. Sizing exercise: parallelism layout for a 405B-parameter run
  22. NVLink generations 3/4/5: lane-level numbers
  23. NVSwitch generations 1/2/3/4: silicon, ports, SHARP
  24. GH200, HGX H100/H200, B200, GB200 SuperPod compared
  25. Liquid cooling: CDU, rear-door HX, direct-to-chip
  26. Cabling and reliability: cabled vs PCB-embedded NVLink
  27. TPU pods, Cerebras WSE-3, and non-NVIDIA scale-up
  28. Collective performance with topology: ring, tree, hierarchical, SHARP
  29. Expert parallelism and pipeline parallelism across racks
  30. NCCL on NVLink: protocols, channels, what to tune
  31. Benchmark numbers: real measurements from NVL72
  32. NVL72 day-2 operations
  33. Frontier deployments: Meta Llama-3, xAI Colossus, Microsoft, Stargate
  34. Topology-aware scheduling: Slurm, Kubernetes, MPI
  35. Scale-up vs scale-out economics
  36. SHARP in-network aggregation and the case for offload
  37. DeepSeek-V3 expert parallelism on NVL72: a case study
  38. The GB200 NVL72 hardware-engineering story
  39. NVLink-C2C and Grace-Hopper memory coherence
  40. Failure blast radius: GPU vs NVSwitch vs rack vs row
  41. Reference rack design: power, cooling, networking, ops
  42. Cross-DC training: when one site isn't enough
  43. TPU v5p, Trillium, and Ironwood interconnect details
  44. The bottom line
  45. FAQ
  46. Glossary
  47. References

Key takeaways

  • HBM > NVLink > PCIe > InfiniBand > Ethernet in bandwidth-per-byte-moved. Each tier is an order of magnitude slower than the next.
  • NVLink is GPU-to-GPU at ~900 GB/s aggregate per GPU; NVSwitch is the crossbar joining 8 GPUs into one fabric.
  • Rack-scale fabrics (NVL72-class) extend NVLink bandwidth to 72 GPUs in one rack. The "fast-fabric domain" is no longer a single node.
  • Tensor parallelism and expert parallelism must stay inside the fast-fabric domain — they're bandwidth-hungry.
  • Pipeline parallelism and data parallelism tolerate slower links — they cross out of the fast domain.
  • The size of one fast-fabric domain partly determines what models can run at all. A model whose tensor-parallel group exceeds the domain has to fall back to slower fabric, with substantial throughput penalty.
  • Inter-node: InfiniBand at 400 Gbps+ with RDMA is the standard. RoCE (RDMA over Ethernet) is increasingly viable.

Mental model: GPU interconnects in one minute

The problem has a name: the cross-rack collapse. The moment a collective walks off the fast fabric onto PCIe or Ethernet, per-GPU bandwidth drops 10–20×, and tensor-parallel and expert-parallel groups stop scaling. Everything else in this guide — NVSwitch generations, NVL72, UALink, parallelism mapping — is downstream of that one cliff.

The cleanest analogy is a chassis-as-CPU: an NVL72 rack is a 72-way crossbar inside one machine. NVLink 5 + NVSwitch 4 turns the rack into a single SMP-like fabric where any GPU can reach any other at ~1.8 TB/s aggregate. Once your tensor-parallel group crosses out of that chassis, it pays the InfiniBand tax — a real 50 GB/s ceiling instead of a 1.8 TB/s one.

Aspect 8-GPU HGX node (without rack-scale) NVL72 rack (with rack-scale)
Fast-fabric domain 8 GPUs 72 GPUs
Aggregate NVLink in domain ~7.2 TB/s ~130 TB/s
HBM in domain ~1.1 TB (H100) ~13.5 TB
Practical TP group size ≤8 up to 72
Expert-parallel reach one node one rack
MoE all-to-all bandwidth InfiniBand-bound NVLink-bound

In NCCL terms, the production one-liner is: keep NCCL_ALGO=NVLS collectives inside the fast-fabric domain and let the slow tiers carry pipeline and data-parallel gradient sync. A pseudocode sketch of parallelism placement:

# fast fabric (NVLink/NVSwitch): bandwidth-hungry
tp_group = ranks_in_same_nvlink_domain()   # ≤72 on NVL72, ≤8 on HGX
ep_group = ranks_in_same_nvlink_domain()
# slow fabric (IB / Ethernet): tolerant
pp_group = ranks_across_racks()
dp_group = all_ranks()

The sticky number to remember: NVL72 delivers ~130 TB/s of aggregate NVLink bandwidth in one rack, roughly 18× what an 8-GPU HGX H100 node provides. That single number explains why trillion-parameter MoE training moved from "research curiosity" to "production routine" in 2024–2025: the all-to-all that used to land on InfiniBand now lands on NVLink.


The interconnect landscape in 2026

The GPU interconnect market used to be a one-vendor story. It still mostly is — NVLink has no equal at the high end — but 2024–2026 has produced the first credible alternative roadmaps in a decade.

NVLink (NVIDIA, proprietary): the gold standard. Five generations now in the wild:

  • NVLink 3 on A100 (2020): ~600 GB/s aggregate per GPU.
  • NVLink 4 on H100 / H200 (2022–2024): ~900 GB/s aggregate per GPU.
  • NVLink 5 on B200 / GB200 / GB300 (2024–): ~1.8 TB/s aggregate per GPU.
  • NVLink 6 is on the public Rubin roadmap for 2026–2027 with a step up again.

NVSwitch (NVIDIA, proprietary): the crossbar that turns NVLink point-to-point links into a fully-connected fabric. Four generations:

  • NVSwitch 1 (Volta DGX-2): the original 16-GPU fabric.
  • NVSwitch 2 (Ampere DGX A100): 8-GPU all-to-all.
  • NVSwitch 3 (Hopper DGX H100): 8-GPU at 900 GB/s, with SHARP in-network reductions.
  • NVSwitch 4 (Blackwell GB200 NVL72): the chip that turned NVLink from a node fabric into a rack fabric — 72 GPUs in one NVLink domain.

HGX baseboards: NVIDIA's reference 8-GPU motherboard layout. Every OEM datacenter SXM system you've seen (HGX H100, HGX H200, HGX B200) is the same baseboard with different SXM modules. HGX is what makes the 8-GPU NVLink island the unit of deployment in 2022–2024-era datacenters.

GB200 NVL72: the 2024 product that broke the 8-GPU ceiling. One rack contains 18 compute trays of 4 Blackwell GPUs each (72 GPUs total) plus 9 NVSwitch trays, all wired together with NVLink 5 over a copper-cable backplane. The result is one NVLink domain holding ~13.5 TB of HBM addressable as a single fast-fabric pool. See nvidia.com/en-us/data-center/gb200-nvl72/ for the reference architecture.

DGX SuperPOD: NVIDIA's reference design that stitches NVL72 racks (or DGX H100 nodes for older deployments) into thousand-to-multi-thousand-GPU pods with InfiniBand or Spectrum-X Ethernet between racks.

AMD Infinity Fabric: AMD's GPU-to-GPU fabric on the MI300X / MI325X / MI350X family. Bandwidth per GPU is broadly competitive with NVLink 4 generation, packaged as 8-GPU OAM platforms. The software story (ROCm + RCCL) has narrowed the NCCL gap but still trails.

UALink (Ultra Accelerator Link): industry-standard scale-up interconnect backed by AMD, Apple, Astera Labs, AWS, Cisco, Google, HPE, Intel, Meta, Microsoft (notably not NVIDIA). The 1.0 specification published in 2025 targets up to 1024 accelerators in one scale-up domain at NVLink-class per-link bandwidth. First silicon expected 2026–2027. See ualinkconsortium.org.

Ultra Ethernet Consortium (UEC): the scale-out companion to UALink — an Ethernet-based transport designed to replace InfiniBand at the inter-rack layer, with comparable tail latency and collective behavior. UALink + UEC is the explicit "open alternative to NVLink + InfiniBand" stack. See ultraethernet.org.

Google ICI (Inter-Chip Interconnect) and AWS NeuronLink: vertically integrated alternatives. Google's TPU pods use a 3D torus ICI; AWS Trainium2 UltraServers use NeuronLink in similar topologies. These are closed ecosystems with first-party software stacks.

Quick-reference: GPU interconnects at a glance

Interconnect Per-GPU aggregate BW Topology Max scale-up domain Vendor / standard
NVLink 3 + NVSwitch 2 ~600 GB/s 8-GPU all-to-all 8 GPUs (DGX A100) NVIDIA, proprietary
NVLink 4 + NVSwitch 3 ~900 GB/s 8-GPU all-to-all 8 GPUs (HGX H100/H200) NVIDIA, proprietary
NVLink 5 + NVSwitch 4 ~1.8 TB/s rack-scale all-to-all 72 GPUs (NVL72); 576 with NVL576 reference NVIDIA, proprietary
AMD Infinity Fabric (MI300X class) ~896 GB/s 8-GPU OAM mesh 8 GPUs AMD, proprietary
UALink 1.0 NVLink-class per link switch-based scale-up up to 1024 accelerators (spec) Open consortium
Google ICI (TPU v5p / Trillium) hundreds of GB/s per link 3D torus thousands of chips per pod Google, closed
AWS NeuronLink (Trainium2 UltraServer) hundreds of GB/s per link switch + mesh 64 chips per UltraServer AWS, closed
PCIe Gen5 x16 ~64 GB/s point-to-point n/a (CPU↔GPU) PCI-SIG
InfiniBand NDR / Spectrum-X 400G ~50 GB/s rail/fat-tree (inter-rack) thousands of GPUs NVIDIA / open
Ultra Ethernet (UEC) ~50–100 GB/s rail/fat-tree (inter-rack) targets IB scale Open consortium

Two things to notice. First, the "scale-up" tier (NVLink, Infinity Fabric, UALink, ICI, NeuronLink) and the "scale-out" tier (InfiniBand, UEC) differ by roughly 20× in per-GPU bandwidth — that gap is what makes the rack boundary the most important number in your cluster spec. Second, with NVL72 the scale-up domain crossed the rack boundary for the first time in NVIDIA's history; UALink's 1024-accelerator ambition would, if delivered, push it across multiple racks.

For the inter-rack side of this picture (InfiniBand Quantum-2/3, RoCEv2 at Meta scale, Falcon, EFA, Ultra Ethernet topology choices), see AI cluster networking.


The bandwidth hierarchy

A rough mental model, in 2026 numbers per GPU:

Tier Bandwidth Used for
HBM3e 4-8 TB/s weights, activations on the same GPU
NVLink 5 (B200) ~1.8 TB/s aggregate direct GPU-to-GPU in fast fabric
NVLink 4 (H100/H200) ~900 GB/s aggregate direct GPU-to-GPU in fast fabric
PCIe Gen5 x16 ~64 GB/s CPU-to-GPU, GPU-to-NIC
InfiniBand NDR (400G) ~50 GB/s unidirectional inter-node
400G Ethernet ~50 GB/s inter-node (similar to IB on bandwidth)

The key feature: HBM and NVLink are similar; everything else is at least 10× slower.

Across these tiers, the slowest one a piece of data has to traverse determines the wall-clock time. A collective that has to cross PCIe sees PCIe speed regardless of how fast HBM is on either end.

This is why topology matters so much: where you place the bytes determines what speed they move at.


NVLink: the basic fast link

NVLink is NVIDIA's high-bandwidth, low-latency, point-to-point GPU interconnect. Successive generations have roughly doubled bandwidth per link.

  • NVLink 1 (P100): ~80 GB/s per GPU aggregate.
  • NVLink 3 (A100): ~600 GB/s aggregate.
  • NVLink 4 (H100/H200): ~900 GB/s aggregate.
  • NVLink 5 (B200): ~1.8 TB/s aggregate.

NVLink is implemented as 18 (or 12, or fewer in some variants) discrete links per GPU. Each link is a high-speed serial connection. The aggregate is the sum of all of them used at once.

What "GPU-to-GPU" means

NVLink directly transfers tensor data between HBM on two GPUs, bypassing CPU and system memory. From the software perspective, a tensor.to(other_gpu) call uses NVLink when available, falling back to PCIe when not.

Direct vs through-NVSwitch

NVLink alone is point-to-point. With 8 GPUs in a node, you'd need many direct links per GPU pair to maintain full bandwidth among all pairs — combinatorially many. NVSwitch handles this routing.


NVSwitch: the in-node crossbar

NVSwitch is a chip that routes NVLink traffic. With NVSwitch, every GPU has effectively a full-bandwidth path to every other GPU in the node — a "fully connected" topology.

A typical 8-GPU DGX-class node has all 8 GPUs connected via NVSwitch fabric. Software sees the 8 GPUs as a single tightly-coupled cluster.

Why this matters for collectives

Most distributed training collectives — all-reduce, all-gather, all-to-all — involve communication patterns where every GPU exchanges data with many or all others. Without NVSwitch, you route through intermediate GPUs, which costs bandwidth and adds latency. With NVSwitch, every pair has a direct path.

For tensor-parallel matmul, an all-reduce after the computation joins results from all participating GPUs. With NVSwitch, this all-reduce runs at full NVLink speed. Without it, ring-topology or tree-topology all-reduce algorithms have to share links, reducing effective bandwidth.

Limit: one node

Until recently, NVSwitch was a node-level fabric. 8 GPUs in one DGX-class system, one NVSwitch domain. Beyond that, you crossed into network territory.

This made the natural unit of "tight coupling" a single 8-GPU server. Parallelism strategies aligned to this.


Scale-up vs scale-out

Two ways to make a multi-GPU system:

Scale-up: tightly couple a small number of GPUs with the fastest possible interconnect. Looks like a single big GPU to the application. NVLink and NVSwitch are scale-up.

Scale-out: connect many independent systems with a slower network. Each subsystem is autonomous; the network coordinates. InfiniBand and Ethernet are scale-out.

For training the largest models, you need both:

  • Scale-up to provide enough fast-fabric coupling for the tensor-parallel and expert-parallel operations.
  • Scale-out to multiply that into thousands of GPUs.

The trade-off has shifted dramatically as scale-up domains have grown.


Rack-scale fabrics (NVL72 and friends)

The recent shift is that scale-up domains now extend across an entire rack. NVL72 is NVIDIA's banner example: 72 Blackwell GPUs in one rack, all connected via NVLink-class bandwidth, behaving like one big tightly-coupled system.

What changes

The unit of fast-fabric coupling is no longer 8 GPUs but 72 (or whatever the rack size is). This enables:

  • Tensor-parallel groups up to 72 wide, instead of 8.
  • Expert-parallel groups of 32-64 with full bandwidth between experts.
  • Larger models fitting in one fast-fabric domain.
  • Longer ring-attention contexts with low communication latency.

For training, this lets you use bigger tensor-parallel groups and reduce pipeline parallelism — usually a quality and stability win. See our distributed LLM training guide for how TP, PP, DP, and FSDP compose.

For inference, this is the key enabler for MoE expert parallelism at scale and for disaggregated serving with high-bandwidth KV-cache transfer between pools.

The cost

A rack-scale system is one capital purchase. They're priced accordingly. The economics shift from "many DGX nodes" to "fewer, very large racks."

Power, cooling, and physical infrastructure also change. NVL72 is liquid-cooled; the rack is purpose-built. Retrofitting existing data centers is non-trivial.

Other rack-scale designs

  • NVL72 (NVIDIA): 72 Blackwell GPUs, NVLink Switch System fabric.
  • NVL36 (NVIDIA): smaller variant, 36 GPUs.
  • Various AMD and custom designs are emerging with similar goals (rack-scale Infinity Fabric, custom optical interconnects).

The competition in 2026-2027 is partly about how big a single fast-fabric domain can get.


Mapping parallelism to topology

The defining constraint: which parallelism strategies live where in the topology hierarchy.

Tensor parallelism (TP)

Splits each layer's matrices across GPUs. Every forward pass requires all-reduces or all-gathers across the TP group at every layer.

  • Bandwidth-hungry: very.
  • Right placement: inside the fast-fabric domain. NVSwitch (or rack-scale NVLink) is required for high TP-degree.
  • Typical scale: TP=2 to TP=8 in node-only NVSwitch; TP=16 to TP=64 with rack-scale fabrics.

Expert parallelism (EP)

Places different MoE experts on different GPUs. Every MoE layer requires an all-to-all to dispatch tokens to experts.

  • Bandwidth-hungry: yes.
  • Right placement: inside the fast-fabric domain.
  • Typical scale: EP=8 in node-only; up to EP=64 with rack-scale.

Pipeline parallelism (PP)

Splits the model's layers across GPU groups. Each microbatch flows through pipeline stages.

  • Bandwidth needs: modest. Activations cross stage boundaries; not as much data as TP.
  • Right placement: across fast-fabric domains. PP can span multiple racks.
  • Cost: pipeline bubbles (idle time at stage boundaries).

Data parallelism (DP)

Replicates the model; each replica processes different data. Gradients are all-reduced across replicas.

  • Bandwidth needs: moderate, but can overlap with compute (so visible bandwidth is lower).
  • Right placement: across the slowest links. DP can scale to thousands of replicas across many racks and data centers.

Combining them

A typical large training run might use:

  • TP within node (or within rack).
  • EP within the same domain (for MoE models).
  • PP across nodes or racks.
  • DP across the whole cluster.

The placement is rarely arbitrary — moving a TP group across a slow link drops training throughput by 5-10×.


Collectives and bandwidth profiles

Different collective communication patterns have different bandwidth characteristics.

All-reduce

Every participant ends with the sum (or other reduction) of all inputs. Used in data-parallel training to average gradients.

  • Bandwidth cost: ~2× the message size per participant.
  • Algorithm: ring or tree, depending on size and topology.
  • Tolerable across slower links because it can overlap with compute.

All-gather

Every participant ends with everyone's data concatenated.

  • Bandwidth cost: ~1× message size per participant.
  • Used in TP for full-tensor reconstruction.

All-to-all

Every participant sends some data to every other participant.

  • Bandwidth cost: peak per-link bandwidth required (N² connections).
  • Used in MoE for token routing.
  • Most bandwidth-hungry common collective.

For NCCL-level tuning of these collectives, see our NCCL guide.

Reduce-scatter

Inverse of all-gather. Used in some TP and FSDP setups.

The implications for topology:

  • All-to-all is the most fast-fabric-needy. MoE deployments need rack-scale (or in-node) bandwidth for it.
  • All-reduce is more tolerant. DP across slower links is fine.
  • TP collectives (all-gather, reduce-scatter) need fast fabric.

Inter-node fabrics: InfiniBand vs Ethernet

Beyond the fast-fabric domain, the GPU cluster's interconnect determines scale-out behavior.

InfiniBand

  • 400 Gbps NDR: current standard. ~50 GB/s unidirectional per port.
  • 800 Gbps XDR: emerging.
  • Optimized for HPC and AI workloads.
  • Native RDMA: GPU-to-GPU direct transfers, bypassing CPU.
  • Mature, well-tuned, expensive.

Ethernet (RoCE — RDMA over Converged Ethernet)

  • 400G: matches IB in bandwidth.
  • 800G: emerging.
  • Uses standard Ethernet hardware with RDMA software stack.
  • Increasingly competitive with InfiniBand on AI workloads.
  • Better integration with existing data-center networks.

What you actually need

For inference (decode at scale, KV cache transfer between disaggregated pools): 400G+ with RDMA.

For training (data-parallel gradient sync, pipeline parallel activation transfer): 200G+ with RDMA; 400G preferred for the largest runs.

For inference at modest scale: 100G can work but is increasingly the bottleneck.

See our AI training networking guide for depth on inter-node fabrics (InfiniBand vs RoCE, congestion control, NIC tuning).


AMD Infinity Fabric and others

NVIDIA isn't the only player.

AMD Infinity Fabric: AMD's GPU-to-GPU interconnect for MI300X-class systems. Bandwidth competitive with NVLink generations of equivalent timing. Software ecosystem (ROCm) is catching up; production deployments exist.

AMD UALink (Ultra Accelerator Link): industry standard that AMD, Intel, and others backed for scale-up GPU interconnect. Designed to compete with NVLink at the rack scale.

Custom interconnects: Google's TPU pods use a custom torus topology; AWS has Trainium with its own interconnect. These are typically closed systems with fully tuned software stacks.

The trend: more competition with NVIDIA on interconnect, partly motivated by avoiding NVIDIA-specific lock-in.


How NCCL turns topology into bandwidth

The hardware fabric is half the story; the collective library that uses it is the other half. NCCL's job is to take "we have GPUs A through H connected by NVLink + NVSwitch within a node and InfiniBand across nodes" and emit a collective implementation that approaches the theoretical maximum given the topology.

Topology detection

At process startup, NCCL queries each GPU's nvidia-smi topo-equivalent information plus PCIe and NIC discovery. It builds an internal topology graph: GPU ↔ NVLink ↔ NVSwitch ↔ NIC ↔ remote node. Every collective is planned against this graph. The detection is automatic but can be wrong — IOMMU misconfigurations, container-namespace boundaries, and unusual PCIe layouts trip it up. The diagnostic is NCCL_DEBUG=INFO at startup, which dumps the discovered topology.

Ring vs tree vs SHARP

NCCL picks an algorithm per collective based on message size and topology:

  • Ring all-reduce. GPUs form a logical ring; data passes once around for the reduce phase, once for the broadcast phase. Bandwidth-optimal for large messages. Default for DP gradient all-reduce.
  • Tree all-reduce. Hierarchical reduction tree; better latency for small messages. Default for TP-layer all-reduce (smaller messages, latency-sensitive).
  • SHARP in-network. NVSwitch 3+ can perform reduction in the switch silicon. Eliminates the GPU round-trips. Best for medium-message all-reduce at scale.

The choice is automatic but can be forced: NCCL_ALGO=Ring, NCCL_ALGO=Tree, NCCL_ALGO=NVLS (NVLink SHARP). See NCCL tuning for the full set of knobs.

Why this matters for topology choice

A workload that uses NCCL well can saturate ~80% of theoretical NVLink bandwidth on all-reduce. A workload that misconfigures it (wrong algorithm, wrong protocol, P2P disabled) achieves ~20-30%. The hardware bandwidth ceiling is the same; the realized bandwidth depends entirely on the collective library's choices. Most "my NVLink isn't fast" issues are NCCL configuration issues, not hardware issues.


Topology failure modes

Several problems recur:

Misplaced parallelism. Tensor-parallel group inadvertently straddling a slow link. Throughput cliff (sometimes 10× slower) but no error message. Diagnosis: profile the collectives, find the bottleneck.

Degraded NVSwitch. A failed link or cable in the NVSwitch fabric reduces aggregate bandwidth, sometimes invisibly. Periodic bandwidth tests catch this.

Cross-rack accident. A scheduler places a TP group across two racks. Performance drops sharply. Topology-aware schedulers prevent this.

Stragglers. Some GPUs slower than others (degraded hardware, thermal throttling). The synchronous collective is bounded by the slowest. Detection requires per-GPU monitoring.

Network congestion. Multiple jobs sharing inter-node fabric. Aggregate bandwidth not what you expected. Quality-of-service or job isolation mitigates.


NVLink generation deep dive: what actually changed each time

Each NVLink generation has different per-link bandwidth, different link counts per GPU, and different switching topology. The aggregate-per-GPU number gets cited but hides the design choices.

Generation Per-link BW (each direction) Links / GPU Aggregate BW / GPU First GPU NVSwitch
NVLink 1 20 GB/s 4 160 GB/s P100 (2016)
NVLink 2 25 GB/s 6 300 GB/s V100 (2017) NVSwitch 1 (DGX-2)
NVLink 3 25 GB/s 12 600 GB/s A100 (2020) NVSwitch 2
NVLink 4 ~25 GB/s (faster signaling) 18 900 GB/s H100 (2022) NVSwitch 3
NVLink 5 ~50 GB/s 18 1800 GB/s B200 (2024) NVSwitch 4
NVLink 6 (Rubin generation) TBD TBD Rubin (2027) TBD

The pattern: bandwidth doubles every generation, link counts grow then plateau, signaling rate accelerates. NVLink 5 was the first generation designed to scale beyond a single node — same per-link tech that NVSwitch 4 extends to rack-scale.

What "scales to a rack" actually required

Going from 8-GPU NVSwitch domains to 72-GPU NVL72 wasn't just "add more NVSwitch chips." Three things had to align:

  1. Per-link signaling rate. NVLink 5's ~50 GB/s/direction required new SerDes technology that could maintain signal integrity over the cable lengths needed for a rack-scale fabric (vs the centimeters of trace on an HGX baseboard).
  2. NVSwitch port count. NVSwitch 4 has substantially more ports per chip than NVSwitch 3, allowing one chip to serve more GPUs and reducing the chip count needed for full-mesh 72-way connectivity.
  3. Physical packaging. The copper-cable backplane is a non-trivial engineering exercise — dozens of differential pairs per GPU, all maintaining sub-nanosecond skew tolerances across the rack.

Each of these was a 3-5 year engineering investment. The shipping NVL72 product reflects roughly that timeline of work since H100 launch.

Why bandwidth gains aren't free

Doubling NVLink bandwidth doubles per-GPU power on the interconnect side — NVL72's ~120 kW rack vs DGX H100's ~10 kW per 8-GPU node reflects this. The interconnect's share of total system power has grown from a few percent (V100 era) to 15-25% (Blackwell era). This is one reason the Rubin generation will introduce co-packaged optics for the inter-rack tier — the per-bit power of pluggable optics doesn't scale with NVLink 6's ambitions.


Why this defines what AI can be

The size of one fast-fabric domain partly determines what models can run.

A model whose tensor-parallel group exceeds the fast-fabric domain has to fall back to slower fabric, with substantial throughput penalty. So the practical maximum TP-degree is bounded by the topology.

A model whose total parameter count requires more HBM than one fast-fabric domain holds has to use pipeline parallelism, which adds latency.

When fast-fabric domains expand (8 → 72 → larger), models that previously didn't fit cleanly suddenly do. This is part of why frontier model architectures evolve in lockstep with hardware generations — they're constrained by what's currently fast.

The forward look: rack-scale is here; multi-rack fast-fabric (via optical interconnects) is the next frontier. When that lands, the practical model scale shifts again.


Production deployments

Hosted hyperscale providers (OpenAI, Anthropic, Google, AWS): mix of DGX-class nodes, custom systems, TPUs, and increasingly rack-scale NVIDIA systems.

Frontier AI labs: pre-orders for the largest rack-scale systems. Multi-thousand-GPU clusters across hundreds of racks.

Open-source training collaboratives: typically older DGX or H100/H200 nodes with InfiniBand. Less rack-scale.

Enterprise AI infrastructure: H100/H200 nodes typical, rack-scale emerging.

On-prem and edge: smaller deployments, mostly single-node, sometimes paired via PCIe (rare for serious training, common for inference).

Real-world cluster sizing examples

A few representative public-domain deployment shapes, to ground the abstractions:

Deployment GPUs Topology Use case
Llama 3 training (Meta, 2024) 16,384 H100 2048 nodes, InfiniBand 400G Pretraining, 405B dense
DeepSeek-V3 training 2,048 H800 Rack-scale equivalent, custom kernels Pretraining, 671B MoE
xAI Colossus 100,000+ H100/H200 Spectrum-X 400G, Memphis facility Grok pretraining
Anthropic Project Rainier undisclosed (rumored 10⁵+ Trainium2) NeuronLink + AWS fabric Claude training
OpenAI Stargate phase 1 undisclosed (rumored 100k+) NVL72 / NVL576 mix Frontier pretraining

The pattern across publicly described frontier deployments: increasingly rack-scale fast-fabric, increasingly large per-cluster GPU counts (100k+), increasingly purpose-built facilities. Pre-2023 clusters were ~thousands of GPUs in repurposed datacenters; 2025-2026 clusters are 100k+ in greenfield facilities. The interconnect topology is what makes this scale work at all.


GB200 NVL72 deep dive

GB200 NVL72 deserves its own treatment because it is the first product to make rack-scale NVLink real at volume. The reference architecture is documented at nvidia.com/en-us/data-center/gb200-nvl72/; the salient details for cluster designers are below.

What is actually inside a rack

A GB200 NVL72 rack contains:

  • 18 compute trays, each with 2 Grace-Blackwell GB200 Superchips = 4 Blackwell GPUs and 2 Grace CPUs per tray.
  • 9 NVSwitch trays, each holding NVSwitch 4 silicon and the cable interfaces.
  • A copper-cable NVLink backplane wiring every GPU to every NVSwitch — there's no PCB long enough, so it's done in cable. This is one reason NVL72 is liquid-cooled and physically dense: shorter cable runs are the only way to keep the signal integrity at NVLink 5 speeds.
  • 72 GPUs × ~1.8 TB/s = ~130 TB/s aggregate NVLink bandwidth in the rack.
  • 72 × 192 GB = ~13.5 TB of HBM addressable as one fast-fabric memory pool.

What it changes for software

For training, NVL72 unlocks TP=72 or EP=72 at full NVLink bandwidth. Pre-NVL72, anything past TP=8 had to go through InfiniBand at ~50 GB/s and the throughput cliff was severe. The practical consequence: trillion-parameter dense models and 100B+ MoE models with large expert counts (DeepSeek-V3, GPT-5-class architectures) suddenly have a topology that matches their parallelism.

For inference, NVL72 is the substrate for disaggregated inference at frontier scale and large-expert MoE serving: KV cache transfer between prefill and decode pools, or token routing across many experts, both want NVLink-class bandwidth that no inter-rack fabric provides.

Variants and the bigger family

  • NVL36: half-rack variant, 36 GPUs. Same NVSwitch-4 fabric, fewer compute trays.
  • NVL72 (GB200): the headline product. 72 GPUs, liquid-cooled, ~120 kW per rack.
  • GB300 NVL72: 2025 refresh with the upgraded Blackwell Ultra silicon — same fabric topology, higher per-GPU compute and HBM.
  • NVL576 reference: NVIDIA has published reference architectures that connect 8 NVL72 racks via NVLink Switch System into a 576-GPU NVLink domain. This is a forward-looking design point; deployments are early and rare.

Operational reality

NVL72 is not a drop-in replacement for DGX racks. Power density (120 kW), liquid cooling, weight (1.4 tonnes), and the copper-cable backplane all impose new datacenter requirements. The economics shift from "many small purchases" to "few very large purchases" — see decentralized GPU compute for how this is reshaping who can host frontier hardware. For the NVIDIA AI GPU lineup including B200, H100, H200, and DGX Spark, the rack-scale story is the most important variable that doesn't fit on a spec sheet.


Multi-rack scaling: beyond NVL72

NVL72 raises the rack-boundary question rather than answering it: once you have a 72-GPU fast-fabric domain, what does "the next rack" look like?

NVLink Switch System (multi-rack NVLink)

NVIDIA's NVLink Switch System extends NVLink across multiple NVL72 racks using external NVLink switches and optical cables. The published reference is the NVL576: 8 NVL72 racks (576 GPUs) in a single NVLink domain. The bandwidth between racks is lower than intra-rack (cable distance, signal integrity), but still NVLink-class — much higher than InfiniBand inter-rack.

For the largest training runs in 2026 (trillion-parameter dense, multi-trillion MoE), NVL576-class is where the math starts working. For most teams, even a single NVL72 is overkill.

InfiniBand / Spectrum-X / Ultra Ethernet as the inter-rack fabric

The more common pattern: NVL72 (or DGX H200) racks connected by InfiniBand Quantum-2/3 (NDR/XDR) or NVIDIA Spectrum-X 800G Ethernet. This is the DGX SuperPOD reference architecture in its 2026 form. The fast-fabric domain stops at the rack; pipeline-parallel and data-parallel collectives cross out of NVLink and into the scale-out network.

Sizing rule of thumb: every NVL72 rack needs roughly 8 × 800 Gb/s = ~800 GB/s of inter-rack bandwidth to keep DP all-reduce from dominating step time at >10k-GPU scale. This is well within Quantum-3 / Spectrum-X capability per rack but constrains the overall switch radix at thousands of racks.

Optical NVLink (and the CPO horizon)

NVL72's copper backplane works because the rack is small. For NVL576 and beyond, you need optics — and the power and cost of pluggable optics at NVLink bandwidth are non-trivial. Co-packaged optics (CPO), where the optics are integrated into the switch ASIC, is the path forward. NVIDIA's Quantum-X Photonics (announced 2024 for shipment 2026–2027) is the first generation; the Rubin platform extends it.

Cross-datacenter NVLink

A question we get a lot: can NVLink span buildings or campuses? Short answer: no, not today. NVLink is a low-latency synchronous-style fabric; even fiber between adjacent buildings adds RTT that breaks the abstraction. The right model for cross-DC training is asynchronous (DiLoCo-style); see distributed LLM training for the federated approaches.

Where this leaves cluster designers

The practical hierarchy in 2026:

  1. Inside an NVL72 rack: NVLink 5 at 1.8 TB/s, TP and EP up to 72.
  2. Across NVL72 racks within a hall: InfiniBand XDR or Spectrum-X 800G; or NVLink Switch System for NVL576-class.
  3. Across halls within a datacenter: InfiniBand or Ultra Ethernet, multi-hop.
  4. Across datacenters: asynchronous training, federated learning, no synchronous collectives.

Knowing which tier your collectives live in is the single most important topology question.


Cross-vendor: AMD MI300X, UALink, Ultra Ethernet, and the future

NVIDIA's interconnect dominance is the elephant in every cluster design meeting. In 2026, the credible alternatives are converging on a two-part open stack: UALink for scale-up, Ultra Ethernet for scale-out.

AMD MI300X / MI325X / MI350X today

AMD's Instinct MI300X-class platforms ship as 8-GPU OAM systems with Infinity Fabric between GPUs at ~896 GB/s aggregate per GPU. Per-link bandwidth is competitive with NVLink 4; the gap to NVLink 5 / NVL72 is mostly about scale of the fast-fabric domain, not per-link speed. RCCL (AMD's NCCL fork) implements the standard collectives; in production it's ~70–85% of NCCL on equivalent hardware depending on workload, with the gap closing.

UALink: the open scale-up bet

UALink is what AMD, Intel, AWS, Google, HPE, Meta, Microsoft and others put on the table as the answer to "NVLink, but not vendor-locked." The 1.0 spec (2025) defines:

  • A scale-up switched fabric for accelerators.
  • Up to 1024 accelerators in one domain (vs NVLink 5's 72 in NVL72, or 576 in NVL576 reference).
  • Per-link bandwidth comparable to NVLink 5.
  • Memory semantics suitable for tensor and expert parallelism at scale.

First-silicon UALink switches and accelerator endpoints are expected in 2026–2027. Whether UALink achieves NVLink-class operational maturity in the same window is the open question.

Ultra Ethernet: the scale-out partner

UEC's transport (see the AI cluster networking guide for full details) is designed to be the inter-rack fabric for UALink-based scale-up domains. The full stack is UALink within the scale-up domain, UEC between scale-up domains — a deliberate mirror of NVLink + InfiniBand, but multi-vendor.

Google TPU and AWS Trainium

The closed alternatives are operationally proven but ecosystem-locked. Google's TPU v5p and Trillium use a 3D-torus ICI; AWS Trainium2 UltraServers wire 64 chips together with NeuronLink. Both achieve frontier-scale training performance, but only inside their respective clouds and software stacks (JAX/XLA for TPU, Neuron SDK for Trainium).

What this means practically

For deployments in 2026, NVLink + InfiniBand (or Spectrum-X) remains the lowest-risk choice with the deepest software ecosystem. UALink + UEC is the credible 2027+ alternative that buyers should be tracking — particularly large enterprises with multi-vendor procurement requirements and hyperscalers building their own silicon. AMD is the most viable single-vendor alternative today for inference and mid-scale training; for frontier pretraining the NVLink scale-up gap still bites.

The forward look: by 2028, expect at least one production frontier model trained on a UALink-based system, and serious cross-vendor competition at the rack scale for the first time since GPUs became AI accelerators.


Power, cooling, and physical constraints of rack-scale

The fast-fabric bandwidth story doesn't run without the physical infrastructure that makes the density possible. Rack-scale fast-fabric requires rack-scale power and cooling, and these are where most datacenter retrofits actually fail.

Power density per rack

Rack class GPUs Power Cooling Floor weight
DGX H100 (8-GPU node × 4) 32 ~40 kW air, hybrid liquid ~1 tonne
DGX H200 (8-GPU node × 4) 32 ~45 kW air, hybrid liquid ~1 tonne
GB200 NVL72 72 ~120 kW direct-to-chip liquid (mandatory) ~1.4 tonnes
GB300 NVL72 72 ~140 kW direct-to-chip liquid ~1.4 tonnes
Rubin NVL144 (announced) 144 ~250+ kW (projected) DTC liquid + rear-door ~2 tonnes

A typical pre-2024 datacenter is provisioned for 10-20 kW/rack. Hosting NVL72-class equipment requires either a purpose-built facility or a substantial retrofit. The cost of the retrofit (power feeds, cooling distribution, structural reinforcement) is typically 30-50% of the equipment cost — sometimes more for older facilities. This is why frontier AI deployments are increasingly clustered in a small number of purpose-built sites: not because the GPUs are scarce, but because the facilities that can power and cool them are.

Cooling specifics

NVL72 uses direct-to-chip (DTC) cold-plate liquid cooling: a coolant loop circulates through cold plates that sit directly on the Blackwell die. The rack has integrated coolant distribution units (CDUs) that handle the loop within the rack; facility water (separated by a heat exchanger) carries heat to outdoor cooling towers or chillers. Failure modes include CDU pump failure (entire rack throttles or shuts down), coolant leaks (catastrophic if uncaught — the racks have leak detection but real-world incidents have happened), and facility-water supply interruption (rack thermal-throttles within minutes).

Networking and cabling

NVL72's copper-cable NVLink backplane is one of the most distinctive features of the design. Each GPU connects to all 9 NVSwitch trays via dozens of SerDes lanes routed in dense copper bundles. The cable runs are short (centimeters) because NVLink 5's signaling rate (~50 GB/s per direction per link) won't tolerate longer copper. Optical NVLink (NVLink 6 era) will allow longer runs and multi-rack scale-up, but increases per-link power by ~5-10× over copper.

Density vs serviceability tradeoff

Frontier racks trade serviceability for density. Swapping a failed compute tray in NVL72 requires breaking the coolant loop, removing the tray, replacing it, and re-priming. Single-tray service windows are measured in hours, not minutes. The implication for capacity planning: a 1% rack-failure rate translates to substantial throughput loss because each rack-down event costs hours.

For the operational side of this — how training jobs survive a rack going offline — see checkpoint storage and recovery, which covers the resilience patterns frontier labs use to absorb these events.


Failure isolation and the blast radius of rack-scale

A 72-GPU NVLink domain is also a 72-GPU failure blast radius. The bigger the fast-fabric domain, the larger the unit of "things that can go wrong together." This is a real operational tradeoff that gets underweighted in glossy reference architecture diagrams.

What can take out a rack

  • One NVSwitch tray failure. NVL72 has 9 NVSwitch trays. Losing one degrades aggregate bandwidth but doesn't kill the rack. Losing two on certain failure patterns disconnects subsets of GPUs and effectively kills the rack as a single NVLink domain.
  • CDU or coolant failure. Affects the entire rack within minutes.
  • Rack-level power event. A breaker trip, an upstream PDU failure. Affects the entire rack.
  • One bad GPU. Doesn't kill the rack but interrupts any job using that rank. Job restart from checkpoint; ~10-30 min loss.
  • NVLink cable failure. Rare but happens, especially in early hardware lots. Reduces effective bandwidth on the affected GPU until replaced.

Probability math

If a node-class failure happens every 1-3 years per node MTBF (drivers, GPUs, NICs, host hardware combined), and a rack-class failure (CDU, power, NVSwitch correlated failure) happens at maybe 1/10th that rate, a 100-rack cluster sees:

  • Node failures: ~3-5 per day at the cluster level.
  • Rack failures: ~one per week.

The implication: parallelism layouts that span multiple racks must tolerate rack-level failure. A training job pinning TP=72 to a single rack loses 72 GPUs whenever that rack fails — including the optimizer state for those ranks if the checkpoint replication didn't include them. Frontier deployments cross-rack-replicate critical state precisely because of this.

Mitigation patterns

  • Cross-rack checkpoint replication. See checkpoint storage and recovery for the patterns. The blast radius of rack failure shapes which ranks need replicated state.
  • Hot-spare racks. A few percent of cluster capacity held in reserve for fast replacement of failed racks. Idle capacity that pays for itself when used.
  • Failure-aware schedulers. A job scheduler that places tensor-parallel groups within a rack but data-parallel replicas across racks is intrinsically more resilient than one that doesn't think about topology.
  • Health-check cadence. Per-GPU NCCL bandwidth tests on a daily cron catch degrading links before they cause job hangs. Combined with NCCL tuning, these are the operational hygiene that keeps clusters running.

Sizing exercise: parallelism layout for a 405B-parameter run

Walking through the math on a real example fixes the abstract topology arguments. Take a Llama-3-405B-class training run, BF16, on a 16k-GPU cluster.

The constraints

  • Model: 405B parameters, BF16 weights → 810 GB.
  • Optimizer state (Adam, FP32 master weights): ~5 TB total.
  • Activation memory per micro-batch at seq=8192: several GB per rank under FSDP.
  • One NVL72 rack: 13.5 TB HBM, fast-fabric domain of 72 GPUs.

The layout

The standard Megatron-class layout for 405B on rack-scale hardware:

Dimension Degree Where it lives Rationale
Tensor parallel (TP) 8 Within one node (8-GPU NVSwitch island) All-gather + reduce-scatter per layer; needs NVLink
Pipeline parallel (PP) 9 Across nodes within rack Activation passing across stages; tolerates ~100 GB/s
Data parallel (DP) ~222 Across racks via InfiniBand / Spectrum-X Gradient all-reduce; overlaps with compute

Total: TP=8 × PP=9 × DP=222 = ~16,000 GPUs. The TP group fits in one node. The TP+PP combination (one full pipeline) fits in one rack of 72 GPUs (8×9). The DP replicates the pipeline across the cluster.

Why this layout

The TP all-reduces (one per transformer layer) need the highest bandwidth — they go on NVLink within a node. PP activation transfers happen at stage boundaries, much rarer, and tolerate NVLink-to-NVLink within a rack (still fast). DP gradient sync is large in aggregate but happens once per training step and overlaps with backprop compute; InfiniBand at ~50 GB/s per port is plenty.

What rack-scale unlocks

Pre-NVL72, TP was bounded at 8 by the NVSwitch domain. Pipeline parallelism had to fit the model into the available TP×PP×rack-shape. On NVL72, TP can go to 72; PP shrinks accordingly. A 405B model with TP=72, PP=1 fits in one rack with room to spare for activation memory — and avoids pipeline bubbles entirely. The result is fewer ranks per pipeline, higher utilization, and faster step time. This is the layout DeepSeek-V3 (arXiv:2412.19437) describes for their NVL72-equivalent setup.

What this means for cluster economics

A 16k-GPU cluster as 200 DGX H200 nodes vs 222 NVL72 racks costs differently per training step. The NVL72 setup has higher per-rack capex but completes the same step ~15-25% faster on a large MoE due to expert-parallel bandwidth and zero pipeline bubbles. The break-even depends on the model architecture; for trillion-parameter MoE training, NVL72 wins decisively. For dense ≤200B training, DGX H200 stays cost-competitive. See distributed LLM training for the parallelism math that shapes this choice.


NVLink generations 3/4/5: lane-level numbers

NVLink is not one technology — it is five generations of differential serial links sharing a name. The lane speed, link count per GPU, signaling, and error characteristics changed each generation. Treating "NVLink" as a single thing leads to capacity-planning mistakes.

NVLink 3 (Ampere, A100, 2020)

Per-lane signaling: 50 Gbps NRZ × 8 differential pairs per "link" → 50 GB/s per link unidirectional, 100 GB/s bidirectional. Each A100 has 12 NVLink 3 links → 600 GB/s aggregate bidirectional per GPU. The links connect to NVSwitch1 (described below) inside HGX A100 8-GPU baseboards. End-to-end any-GPU-to-any-GPU bidirectional: 600 GB/s in the 8-GPU fully-connected case.

Per-lane bit error rate (BER) target: 1e-12 pre-FEC. NVLink 3 uses no FEC (forward error correction); errors are caught by CRC at the link layer and retransmitted. At 1e-12 BER and 600 GB/s, link errors occur ~once every several hours per link — handled transparently by retransmission but visible as small latency hits.

NVLink 4 (Hopper, H100/H200, 2022/2024)

Per-lane signaling: 100 Gbps PAM4 × 2 differential pairs per "link" (note: NVIDIA changed the link-counting). Each H100 has 18 NVLink 4 links of 50 GB/s bidirectional → 900 GB/s aggregate bidirectional per GPU.

PAM4 vs NRZ matters: PAM4 doubles bits-per-symbol at the cost of much tighter SNR margins. Pre-FEC BER on NVLink 4 is ~1e-9 (much worse than NVLink 3 raw), but RS (Reed-Solomon) FEC brings effective BER to 1e-15 post-FEC. FEC adds ~10 ns of latency per link traversal — the practical cost of going to PAM4.

H100 SXM5 connects to NVSwitch3 in HGX H100 baseboards. H200 uses the same NVLink 4 generation; the H200 upgrade is HBM (141 GB HBM3e) not interconnect.

NVLink 5 (Blackwell, B100/B200/GB200/B300, 2024–2025)

Per-lane signaling: 200 Gbps PAM4 × 2 differential pairs per "link" — double NVLink 4 lane speed. Each B200 has 18 NVLink 5 links → 100 GB/s bidirectional each → 1,800 GB/s aggregate bidirectional per GPU. GB200 (two B200 dies on a single Grace+2×Blackwell tray) exposes the same per-GPU NVLink budget through the Grace CPU.

NVLink 5 uses tighter RS FEC plus a stronger inner code; effective BER 1e-17, FEC latency 10–15 ns. NVLink 5 also adds the SHARP-v3 protocol support — in-network aggregation of allreduce traffic across the NVSwitch4 fabric (more in the NVSwitch section).

NVLink 6 (Rubin, 2026 reference designs, GA Q4 2026)

NVIDIA disclosed at GTC March 2026: per-link 200 GB/s bidirectional, 32 links per Rubin GPU → 3.6 TB/s aggregate bidirectional. Co-packaged optics begin appearing for inter-rack scale-up. SHARPv4 in NVSwitch5. Production deployments through 2027.

Generational summary

Generation Year Per-lane signaling Per-link bidir Links per GPU Per-GPU aggregate FEC latency Effective BER
NVLink 3 2020 50 G NRZ 50 GB/s 12 600 GB/s 0 (CRC + retry) 1e-12 raw
NVLink 4 2022 100 G PAM4 50 GB/s 18 900 GB/s ~10 ns 1e-15 post-FEC
NVLink 5 2024 200 G PAM4 100 GB/s 18 1,800 GB/s 10–15 ns 1e-17 post-FEC
NVLink 6 2026 400 G PAM4 (CPO) 200 GB/s 32 3,600 GB/s (Rubin) <15 ns 1e-18 target

The trend: 2× lane speed per generation roughly every 2 years, paid for in FEC latency and tighter signal integrity requirements (which is why NVLink 5 cabling and reach are more constrained than NVLink 3 — covered in the cabling section).


NVSwitch generations 1/2/3/4: silicon, ports, SHARP

NVSwitch is the crossbar chip that lets all-to-all GPU communication happen inside a "fast-fabric domain." Each generation matches an NVLink generation, but with different chip-level port counts and aggregate switch bandwidth.

NVSwitch1 (Volta, 2018, then HGX A100 with NVLink 3)

The first generation, deployed in DGX-2 (16-GPU V100 box) and later in HGX A100. 18 NVLink ports per chip; 6 NVSwitch1s on an HGX A100 baseboard fully interconnect 8 A100s with non-blocking bandwidth. Aggregate switch fabric: 4.8 TB/s bidirectional. No in-network compute.

NVSwitch2 (HGX H100, 2022)

Updated for NVLink 4. 64 NVLink ports per chip; 4 NVSwitch2s on HGX H100 fully interconnect 8 H100s. Aggregate fabric bisection: 7.2 TB/s. Still no SHARP support in NVSwitch2 silicon.

NVSwitch3 (HGX H100/H200 SuperPOD external, 2023)

A second-tier switch used externally between HGX baseboards in SuperPOD configurations. NVSwitch3 enables a 256-GPU SuperPOD with NVLink scale-up between nodes — but in practice this rarely shipped at scale because the external NVLink cabling cost and complexity made InfiniBand more practical for inter-node. Most H100/H200 production sites kept the 8-GPU NVLink domain and used InfiniBand or RoCE for inter-node.

NVSwitch3 added SHARPv2 — Scalable Hierarchical Aggregation and Reduction Protocol. SHARPv2 lets the switch perform allreduce in the network (sum tensors as they pass through), eliminating the need for endpoint GPUs to relay reduced values. For small-tensor allreduce (typical in distributed training when gradients are bucketed small), SHARP can deliver 2–3× the effective bandwidth.

NVSwitch4 (GB200 NVL72, 2024–2025)

The big one. NVSwitch4 is the silicon that makes NVL72 possible. 144 NVLink 5 ports per chip; 9 NVSwitch4s in an NVL72 rack interconnect 72 B200 GPUs in a non-blocking flat fabric. Aggregate bidirectional fabric bandwidth: 130 TB/s.

NVSwitch4 includes SHARPv3 with FP8 and FP16 in-network reduction, increasing allreduce efficiency on the lower-precision dtypes that dominate modern training. Empirical benchmarks (DGX SuperPod H100→GB200 comparison published in NVIDIA's MLPerf v4.1 entries): allreduce bandwidth utilisation in NVL72 hits 85–95% of theoretical peak vs 60–75% in an 8-GPU NVLink island doing the same reduction.

NVSwitch5 (Rubin, disclosed 2026)

NVLink 6 era. 288 NVLink ports per chip target. SHARPv4 with broader dtype support. Volumes ship Q4 2026 / 2027.

Port count, throughput summary

Switch gen Year NVLink gen Ports per chip Switches per HGX/NVL Domain size Bisection BW SHARP
NVSwitch1 2018 2/3 18 6 (HGX A100) 8 GPUs 4.8 TB/s No
NVSwitch2 2022 4 64 4 (HGX H100) 8 GPUs 7.2 TB/s No
NVSwitch3 2023 4 64 external 256 GPUs (SuperPod) 57.6 TB/s v2
NVSwitch4 2024 5 144 9 (NVL72) 72 GPUs 130 TB/s v3 (FP8/16)
NVSwitch5 2026 6 288 (Rubin) 144+ GPUs 260+ TB/s v4

GH200, HGX H100/H200, B200, GB200 SuperPod compared

The reference architecture names confuse newcomers. Here is the 2026 family laid out clearly.

HGX H100 8-GPU (the workhorse 2022–2024)

A baseboard with 8 H100 SXM5 GPUs and 4 NVSwitch2 chips. Total NVLink domain: 8 GPUs at 900 GB/s each. Power: 6–7 kW per node (~700 W per GPU). Cooling: typically air for the 350–400 W variant, hybrid air-liquid for 700 W SXM5. Most H100 production capacity from 2022–2024 was HGX H100.

HGX H200 8-GPU (2024)

Same NVSwitch2 baseboard architecture; H200 GPUs replace H100. H200 differs by HBM (141 GB HBM3e vs H100's 80 GB HBM3) but identical NVLink and NVSwitch. Drop-in upgrade.

GH200 Grace-Hopper Superchip (2023–2024)

Different beast. GH200 puts a Grace ARM CPU and an H100 GPU on the same package, connected by NVLink-C2C (chip-to-chip, ~900 GB/s). GH200 racks deployed at scale at Lambda, CoreWeave, others. Use case: workloads with heavy CPU-GPU coordination (multi-modal preprocessing, large embedding lookups) where the C2C link is the differentiator. Not a frontier-training default — that role stayed with HGX H100.

DGX H100 SuperPod (2023)

Reference architecture combining HGX H100 nodes into 256-GPU NVLink-connected SuperPods via external NVSwitch3. Most customers used the building blocks (HGX H100) but with InfiniBand inter-node instead, since external NVLink scaling was operationally complex.

HGX B200 8-GPU (2024)

Blackwell generation. 8 × B200 (each 192 GB HBM3e). NVLink 5 internally via NVSwitch (B200 generation switch chips). Power: 12–14 kW per node (~1.4 kW per B200 at TDP, plus host CPU and supporting infrastructure). Liquid cooling effectively mandatory.

GB200 NVL72 (the 2025 frontier rack)

Not a baseboard, a rack. 18 compute trays per rack, each tray has 2 × GB200 "superchips" (1 Grace + 2 B200), totaling 72 B200 GPUs per rack. 9 NVSwitch4 trays interconnect them in a non-blocking flat NVLink fabric. Rack power: 132 kW typical, 140 kW peak. Cooling: direct liquid-to-chip, mandatory.

The big-deal feature: all 72 B200 in the rack are in one NVLink domain at 1,800 GB/s per GPU. Tensor parallelism, expert parallelism, and pipeline parallelism can span up to 72 GPUs at NVLink bandwidths. This changed what model architectures are viable — 256-expert MoE designs like DeepSeek-V3 became practical because expert-parallel groups fit cleanly in one rack.

GB200 NVL36 (2025, lower-power variant)

Half-rack variant. 9 compute trays, 36 GPUs, ~65 kW. Deployed where 132 kW is unavailable (older datacenters, air-cooled facilities with limited liquid). Common configuration in Tier-2 cloud providers.

GB300 NVL72 (2025–2026)

GB300 = B300 GPU on Grace CPU. B300 has higher HBM (288 GB) and 1.5× FLOPs vs B200 at similar power envelope. NVL72 rack architecture identical to GB200 NVL72; GPUs are drop-in replacements. NVL72-B300 ships in volume Q2 2026.

NVLink-Switch System (DGX GB200 SuperPod)

The 2025–2026 multi-rack scale-up product. Connects 8 NVL72 racks (576 GPUs) into one NVLink-switched fabric via external NVSwitch4 trays. Cabling between racks uses copper or active optical cables (AOC). This is the largest "single NVLink domain" available in 2026 production.

Rubin (2026 reference)

Rubin GPU + Vera CPU. NVLink 6, NVSwitch5. Reference rack "Rubin Ultra" targets 576 GPUs in one fabric. Production deployments late 2026 / 2027.

Platform Year GPUs in domain Per-GPU NVLink BW Power Cooling Status
HGX A100 2020 8 600 GB/s ~6.5 kW Air Legacy
HGX H100 2022 8 900 GB/s ~7 kW Air / hybrid Mainstream
HGX H200 2024 8 900 GB/s ~7 kW Air / hybrid Mainstream
GH200 2023 8 (NVLink-C2C to CPU) 900 GB/s ~6 kW Air / hybrid Niche
DGX H100 SuperPod 2023 256 (external NVSwitch3) 900 GB/s ~32 kW/rack Hybrid Rare in production
HGX B200 2024 8 1,800 GB/s ~14 kW Liquid Frontier 2024
GB200 NVL72 2024 72 1,800 GB/s 132 kW Liquid (direct-to-chip) Frontier 2025
GB200 NVL36 2024 36 1,800 GB/s ~65 kW Liquid Volume
GB300 NVL72 2026 72 1,800 GB/s ~140 kW Liquid Q2 2026+
GB200 SuperPod (8 racks) 2025 576 1,800 GB/s ~1,100 kW total Liquid Q3 2025+
Rubin Ultra 2026 576+ 3,600 GB/s TBD Liquid + CPO Q4 2026+

Liquid cooling: CDU, rear-door HX, direct-to-chip

NVL72 changed cooling from a comfortable engineering problem into a hard one. 132 kW in 42U is more than 3× the air-cooled limit of a standard rack.

Direct liquid-to-chip (DLC, the NVL72 baseline)

NVL72 ships with cold plates bonded directly to the B200 dies, NVSwitch4 chips, and Grace CPUs. Coolant (a treated water-glycol mix, typically 25% propylene glycol) flows at 1.5–3 L/s through the rack at supply temperatures of 30–45 °C. Return temperatures: 45–60 °C. Heat capture: 95%+ — meaning only ~5% of the rack's heat exits as air; this is why NVL72 racks deploy in rows without conventional air-cooled CRAC units in the immediate vicinity.

Coolant Distribution Unit (CDU)

The interface between the rack and the facility's chilled water loop. A CDU pulls facility water (12 °C typical supply), transfers heat through a brazed-plate heat exchanger to the rack's closed liquid loop, and pumps the rack-side fluid. Sized typically 200–400 kW per CDU; one CDU serves 1–3 NVL72 racks depending on configuration. Modern CDUs include leak detection, particulate filtration, and conductivity monitoring (a high reading suggests contamination).

Rear-door heat exchanger (RDHX)

An older / hybrid pattern. Air leaves the back of an HGX H100 / B200 rack at 50–60 °C, passes through a finned heat exchanger embedded in the rear door, exits the door at ~25 °C. Captures ~70–85% of rack heat into liquid. RDHX is the path for retrofitting existing datacenters to support 30–70 kW racks (HGX B200, GB200 NVL36) without full DLC plumbing.

Immersion cooling

Less common at NVL72 scale but used in some niche deployments. Two-phase immersion (3M Novec or similar dielectric) submerges the boards entirely; the fluid boils at chip-junction temperatures and condenses on a top coil. Pros: extremely high heat capture, no per-chip cold plate manufacturing. Cons: maintenance complexity, fluid cost ($1,000+/kg historically though prices dropping), regulatory uncertainty around PFAS chemicals (3M committed to exit production by end-2025, accelerating shift to alternative fluids).

Datacenter implications

Pre-NVL72 datacenters typically supported 8–15 kW per rack with air cooling, occasional 20–30 kW with rear-door HX. NVL72 at 132 kW requires:

  • Liquid infrastructure to the rack (supply + return manifolds, isolation valves, leak sensors).
  • Adequate chilled water capacity at the facility (each NVL72 = ~30 tons of cooling).
  • Floor structural capacity (NVL72 weighs ~1,400 kg fully loaded).
  • Power density (132 kW per rack means PDU and busbar sized accordingly).

Major colos retrofitted for liquid through 2024–2025: Equinix LD11/AM11, Digital Realty multiple sites, Iron Mountain Northern Virginia, NTT Hillsboro. Hyperscalers built greenfield (Microsoft Mt Pleasant WI, Meta Richland Parish LA, AWS Ohio expansion). Tier-2 cloud providers (CoreWeave, Lambda, Crusoe, Together) leased liquid-ready space at premium $/MW rates.

Power-to-cooling design

Rule of thumb 2026: 1.0 W of IT load needs ~1.10 W of cooling (PUE 1.10 for direct-to-chip facilities). Pre-NVL72 air-cooled datacenters ran PUE 1.4–1.6. The shift to DLC reduces facility-side energy meaningfully — one of the few things that's actually getting more efficient as AI scales.


Cabling and reliability: cabled vs PCB-embedded NVLink

The dirty secret of rack-scale fabrics: a lot of money goes to cables, and cables fail.

PCB-embedded NVLink (HGX, intra-baseboard)

The 8-GPU HGX baseboard runs NVLink across PCB traces — no separate cables. Reach: ~30 cm max. Reliability: very high (PCB traces don't unplug themselves). Effectively zero ongoing cable maintenance.

Cabled NVLink (NVL72 internal, between trays)

NVL72 connects 18 compute trays to 9 NVSwitch trays via cables — the trays are physically separate within the rack. Each tray has multiple NVLink port connectors; ~5,000 individual NVLink connections per rack. Cables: short copper twinax (passive copper, 2–3 m max reach within rack) or DAC (direct attach copper). NVL72 spec uses ~1.5 km of cabling total per rack.

Failure rate: industry rule of thumb is 10–100 cable-related events per 1000 racks per year, mostly transient (link goes down, retransmit, recovers). Hard failures (cable replacement required) are rarer — single-digit per rack per year. Field service is part of running NVL72 racks.

External NVLink (SuperPod cross-rack)

When connecting multiple NVL72 racks into a SuperPod (576 GPUs), the inter-rack NVLink uses active optical cables (AOC) or active electrical cables (AEC). AOC has longer reach (up to 30 m typical, 100 m specialty) but adds 5–10 ns of latency per direction and costs $2,000–$5,000 per cable. AEC retimes the signal at the cable ends and reaches 7–10 m at lower cost.

Co-packaged optics (Rubin era)

NVIDIA disclosed co-packaged optics (CPO) for Rubin — the optical transceiver moves from a pluggable cage onto the chip package itself, reducing the electrical-to-optical transition distance to millimetres. Benefits: lower power per bit (3–5× reduction), lower latency, higher density. Costs: thermal complexity (lasers don't like 85 °C), manufacturing complexity, no field-pluggability.

CPO production volumes 2026–2028 will be the limiting factor for the next NVSwitch generation. Roadmaps from Intel, Broadcom, NVIDIA, Marvell all show CPO ramping through 2027.

Cabled UALink and Ultra Ethernet (the alternative camp)

UALink (Ultra Accelerator Link consortium, formed mid-2024, v1.0 spec late 2024) targets the same scale-up problem as NVLink but as an open standard backed by AMD, Broadcom, Cisco, Intel, Meta, Microsoft. UALink v1.0 spec: 200 Gbps per lane, 64 GB/s per link bidirectional, scales to 1024-accelerator domains. Same general approach as NVLink: load/store semantics over a switched fabric.

UALink targets the 2026–2027 product cycle. AMD's MI355X (Q4 2025) implements pre-spec Infinity Fabric scale-up; MI400X (2027) targets full UALink. The political picture: the non-NVIDIA accelerator vendors are aligned around UALink as the way to avoid NVLink lock-in.

Ultra Ethernet Consortium (UEC, formed 2023) is the inter-node parallel — re-engineering Ethernet for AI workloads with packet trimming, congestion control suited to RDMA, and 800G/1.6T line rates. Ships through 2025–2027 across vendors. Complementary to UALink, not competitive.

Reliability math

A 72-GPU NVLink fabric has ~5000 internal connections. At a per-connection annual failure rate of 1e-3 (optimistic), expected failures per rack-year: 5. Real-world rates skew higher; large-scale operators (Microsoft, Meta) cite tens of NVLink-related events per rack-year, most auto-recovered. Plan field service capacity accordingly; have spare NVL72 inventory; design training jobs with checkpointing aggressive enough to survive rack-level events.


TPU pods, Cerebras WSE-3, and non-NVIDIA scale-up

NVIDIA isn't the only path to rack-scale fabrics. The alternatives matter.

Google TPU Trillium (v5p successor, 2024) and Ironwood (v6, late 2025)

Google's TPU pods use ICI (Inter-Chip Interconnect), a 3D torus or 2D mesh of optical links between TPU dies. Trillium (TPU v5e/v5p evolution): 256-chip "pod" = full ICI mesh, 8960-chip "superpod" = multiple pods linked via DCN (optical inter-pod). Per-chip ICI: 3.4 TB/s aggregate. Ironwood (v6, Dec 2025): 4.6 TB/s per chip ICI, 9216-chip superpods, optical switching.

ICI's distinctive feature: the topology is a fixed 3D torus, not a flat NVLink-style any-to-any fabric. Collectives that map well to a torus (allreduce via 2D ring decomposition) work great; arbitrary all-to-all is more constrained. JAX/XLA compilers know how to lay out collectives for ICI; PyTorch on TPU was never first-class.

Pod scale matters: Gemini 2.5 Pro training reportedly used multi-superpod (50,000+ TPU v5p) configurations through 2024–2025. Reasoning model variants reportedly trained on Ironwood.

Amazon Trainium2 (Dec 2024)

Trainium2 instances use NeuronLink, AWS's chip-to-chip fabric. 64-chip "Trn2 UltraServer" gives a flat NeuronLink domain of 64 Trainium2 chips. Per-chip NeuronLink: 1.8 TB/s. Larger configurations use EFAv3 (AWS's RDMA fabric) between UltraServers. Anthropic Claude training in 2024–2025 reportedly used Trainium and Trainium2 at scale (Project Rainier).

Cerebras WSE-3 (2024)

Different paradigm entirely. WSE-3 is a wafer-scale chip — one piece of silicon ~46,225 mm² with 900,000 cores and 44 GB on-chip SRAM. There is no NVLink equivalent because there are no separate GPUs to link inside a "node" — the node is a chip. Inter-WSE communication uses SwarmX, Cerebras's external fabric, with much lower bandwidth than the on-wafer mesh.

Implications: WSE-3 wins on model-parallel workloads where activations stay on-wafer (LLM training with model-sized to fit on one wafer or a few). It loses on inference economics relative to GPUs because the wafer is dedicated; you can't share it across small models efficiently. Production deployments: G42's Condor Galaxy clusters, several R&D-heavy AI labs.

Groq LPU, SambaNova RDU, Tenstorrent

The long tail. Groq LPU optimises inference latency with deterministic dataflow; scale-up via the GroqRack pattern (a deterministic interconnect across 8 LPUs). SambaNova RDU uses reconfigurable dataflow and a Cardinal interconnect. Tenstorrent Wormhole and Blackhole use Ethernet-based scale-up with explicit topology awareness. All viable in specific use cases; none have NVIDIA-scale ecosystem.

Comparing scale-up domains

Platform Scale-up domain Per-chip BW Topology Notes
NVIDIA GB200 NVL72 72 GPUs 1,800 GB/s Flat (switched) Frontier 2025
NVIDIA GB200 SuperPod 576 GPUs 1,800 GB/s 2-level NVLink Largest single NVLink
Google TPU Trillium 256 in pod 3.4 TB/s 3D torus (ICI) Up to 8960 in superpod
Google TPU Ironwood 256 in pod 4.6 TB/s 3D torus (ICI) Up to 9216 in superpod
AWS Trainium2 UltraServer 64 chips 1.8 TB/s NeuronLink Anthropic deployments
AMD MI355X 8 GPUs (Pollara) 1.6 TB/s Infinity Fabric UALink pre-spec
Cerebras WSE-3 1 wafer (no scale-up in same sense) n/a (on-wafer) On-wafer mesh Wafer-scale paradigm

Collective performance with topology: ring, tree, hierarchical, SHARP

Topology determines algorithm choice. The same allreduce on the same hardware runs at radically different effective bandwidth depending on the algorithm NCCL picks.

Ring algorithm

Each GPU sends 1/N of the tensor around the ring, then aggregates in another pass around. Bandwidth utilisation: (N-1)/N of link bandwidth for large tensors. Latency: 2(N-1) hops. Ring is optimal for large reductions on a single-tier fabric — used inside an HGX 8-GPU island for any tensor large enough to amortise the 14-hop latency.

Tree algorithm

Reduce up a binary tree, broadcast back down. Latency: 2·log₂(N) hops — much lower than ring for small tensors. Bandwidth utilisation: lower than ring (effective bandwidth caps at link/2 because each node sends and receives simultaneously on one link). Tree wins for small tensors (sub-MB) where latency dominates.

Hierarchical (CollNet-style)

Combine: ring inside each rack, tree across racks. NCCL with CollNet does this automatically when the topology has multiple tiers. For a 72-GPU NVL72 + 8-rack SuperPod (576 GPUs), hierarchical algorithm uses NVLink ring within rack and InfiniBand tree across racks — getting bandwidth-saturating throughput inside and latency-efficient routing across.

SHARP (in-network reduction)

NVSwitch3+ supports SHARPv2; NVSwitch4 supports SHARPv3 with FP8/FP16. The switch does the addition: each GPU sends its slice; the switch sums on-the-fly; the result is broadcast back. Bandwidth: approaches 2× the per-GPU link bandwidth for small-to-medium reductions because the GPU sends once and receives once (vs ring's "send N-1 times").

Measured impact (NCCL benchmarks, GB200 NVL72, May 2026): allreduce of 64 MB FP16 tensor across 72 GPUs:

  • Ring: ~1.6 TB/s effective bus bandwidth
  • Tree: ~1.1 TB/s effective bus bandwidth, lower latency
  • SHARPv3 in-network: ~3.0 TB/s effective bus bandwidth

The 2× SHARP advantage is real and shows up clearly in training step time for medium-batch jobs. For very large tensors (>1 GB), ring approaches SHARP's effective bandwidth because the FP cost of switching SHARP nodes is amortised differently.

NCCL topology hints

NCCL discovers topology via nvidia-smi topo -m at init and picks algorithm. Key knobs:

  • NCCL_ALGO=Ring,Tree,CollNet,NVLS — force algorithm
  • NCCL_PROTO=Simple,LL,LL128 — protocol (LL for tiny tensors, Simple for large)
  • NCCL_NVLS_ENABLE=1 — enable SHARPv3 (NVLink SHARP)
  • NCCL_COLLNET_ENABLE=1 — enable hierarchical collnet
  • NCCL_TOPO_FILE=/path/to/topo.xml — manual override

For NVL72 production deployments, the default NCCL 2.21+ behavior auto-detects NVSwitch4 and enables NVLS for SHARPv3 — recommended. Force-disable only when debugging.

All-gather, reduce-scatter, all-to-all

  • All-gather on NVL72: ring-based, ~1.5 TB/s effective for large tensors.
  • Reduce-scatter on NVL72: similar to allreduce in cost, ring-based; SHARP doesn't help (no reduction across all endpoints).
  • All-to-all on NVL72: critical for MoE expert routing. NVL72 sustains ~1.2 TB/s aggregate per-direction across 72 GPUs — enough to run DeepSeek-V3-scale expert routing inside one rack.

For the deep dive on NCCL tuning, see NCCL tuning in plain English.


Expert parallelism and pipeline parallelism across racks

The parallelism layout question dominates frontier model deployment. Here's how it plays out on 2026 hardware.

Expert parallelism (EP) — when MoE needs more than one rack

DeepSeek-V3 (December 2024) has 256 experts, 8 active per token. The all-to-all expert-routing collective is the bottleneck. At 1 rack (72 GPUs), each GPU hosts 3–4 experts; the all-to-all stays inside the NVLink domain at 1.2 TB/s effective. Latency: ~1 ms for a typical batch.

Llama-4 MoE variants (released 2025) push to 128 experts, 4 active. With fewer experts but larger per-expert capacity, the routing collective is smaller — fits comfortably in 1 NVL72 rack.

When does EP escape a single rack? Two regimes:

  1. Very large MoE (1024+ experts, hypothesised next-gen designs). The expert population doesn't fit in 72 GPUs of HBM at any reasonable expert size.
  2. Combined EP + TP at very high batch where TP needs to span more than 8 GPUs and EP needs the remaining domain.

For both, multi-rack EP via NVLink-Switch System (576-GPU SuperPod) keeps routing at NVLink speed. Without the SuperPod, expert routing across InfiniBand hits 5–10× higher latency and cuts effective MoE throughput by 30–50%.

Pipeline parallelism (PP) across racks

PP tolerates inter-rack latency better than TP or EP. Pipeline stages communicate sequentially (activations forward, gradients backward) — bandwidth requirements are activation × micro-batch, much smaller than allreduce volumes. PP routinely spans racks via InfiniBand without throughput hit.

Llama-3-405B (Meta, July 2024) trained on 24,576 H100 GPUs (Meta's published number). The reported parallelism layout: TP=8 (inside HGX), PP=16 (across racks), DP=192 (across PP groups). Inter-rack traffic dominated by DP allreduce (handled by InfiniBand) and PP activations (sequential, low-bandwidth).

Tensor parallelism (TP) — capped by NVLink domain

TP requires per-layer allreduce of activations. Bandwidth scales with activation size × layers; any TP that crosses NVLink to InfiniBand drops throughput sharply. Pre-NVL72: TP capped at 8 (inside HGX). NVL72 era: TP can extend to 16, 32, or 64 inside one rack — enabling larger per-layer models (e.g., 70B+ dense models that previously needed PP to fit can now use TP-only at the cost of some HBM pressure).

DeepSeek-V3 publicly used TP=8 in its training reports despite running on H100-class hardware in 2024; the 2025–2026 generation may push TP higher as NVL72 deploys.

Combined parallelism: a frontier-model recipe

A 1T-parameter MoE model on a 576-GPU NVL72 SuperPod (8 racks × 72 GPUs):

  • TP=8 (inside HGX-style 8-GPU island; activations stay tight)
  • EP=72 (expert routing inside one NVL72 rack; full-rack all-to-all at NVLink speed)
  • PP=8 (across racks; activations sequential via InfiniBand or external NVLink)
  • DP=multiple replicas as fits

Aggregate: enough to fit the 1T params (params + opt states + activations) and run with high MFU. The detailed math is in the sizing exercise section.

Failure-blast-radius math

Single GPU failure: stops a TP group, recoverable via DP replica. Single NVSwitch failure: degrades one HGX or partial NVL72 fabric; may halt the rack. Single NVL72 rack failure: 72 GPUs unavailable; PP stage missing; restart pipeline. Single SuperPod failure: 576 GPUs unavailable; large job affected.

Typical mitigation: aggressive checkpointing (every N minutes), warm spare GPUs (5–10% spare capacity), automatic re-discovery and re-route in distributed training framework. See checkpoint storage and recovery.


NCCL on NVLink: protocols, channels, and what to tune

NCCL's behaviour on NVL72-class hardware differs from 8-GPU HGX in ways worth understanding before tuning.

Protocols: LL, LL128, Simple

  • LL (Low Latency). 8-byte chunks with inline flag bits. Best for tensors <1 MB; pays a 50% bandwidth tax (half the payload is flag/sync data). Used automatically for small tensors.
  • LL128. 128-byte chunks, more efficient than LL. Best for 1–16 MB. Available on NVLink 3+.
  • Simple. No inline flags; full bandwidth. Best for tensors >16 MB. The decision point shifts with NVLink generation — NVL72 with SHARPv3 prefers Simple at lower thresholds than older generations.

Channels

NCCL splits a tensor across multiple parallel "channels" — each runs an independent ring or tree. More channels = more parallelism = better latency-hiding but higher per-channel overhead. NVL72 defaults to ~28 channels for ring; HGX H100 defaults to 16. NCCL_NCHANNELS=N overrides; default is usually correct.

NVLS / NVLSTree (NVLink SHARP)

The 2024+ algorithms that use SHARPv3. NVLS for allreduce, NVLSTree for hierarchical allreduce across multi-rack SuperPods. Enabled by default on NVSwitch4+; gated by NCCL_NVLS_ENABLE. Provides the 2× allreduce speedup mentioned in the SHARP section.

Network buffer sizes

NCCL_BUFFSIZE (default 4 MB on H100, 8 MB on B200 in NCCL 2.21+) controls staging buffer per channel. Larger buffers help large-tensor throughput; smaller saves HBM. The HBM tax of NCCL is ~100 MB per GPU per ring at default — usually noise relative to model state, occasionally matters for memory-tight serving.

Topology files

For non-standard configurations (NVL72 with disabled GPUs, SuperPod with mixed rack generations), provide NCCL_TOPO_FILE with explicit graph. Auto-detection works for standard NVL72; deviation requires manual config. The XML schema is documented in the NCCL repository.

Debugging hangs

NCCL_DEBUG=INFO logs init and per-collective decisions. NCCL_DEBUG_SUBSYS=COLL filters to collective execution. For hangs, py-spy dump on each process plus the NCCL log usually identifies the stuck collective and the slowest endpoint. See NCCL tuning for the full playbook.

Common NVL72 NCCL pitfalls

  • Forgetting to enable NVLS. Some early NCCL versions had it off by default; check version (need 2.20+) and env var.
  • Cross-rack TP. Scheduler accidentally splits a TP group across two NVL72s; throughput drops 5×. Use topology-aware scheduling.
  • Mismatched NCCL versions across the cluster. Always upgrade in lockstep.
  • Memory pressure from too-large NCCL_BUFFSIZE × many channels × many devices. Tune buffer size per workload.

Benchmark numbers: real measurements from NVL72 and prior generations

Concrete numbers from public sources (MLPerf v4.0, v4.1, NVIDIA technical blogs, customer case studies) — not theoretical peak.

MLPerf Training v4.0 / v4.1 highlights

  • GPT-3 175B training on 11,616 H100s (MLPerf v4.0): 3.5 minutes to reference convergence; per-GPU MFU ~50%.
  • Llama-2-70B fine-tuning on 1,024 H100s (MLPerf v4.1): 0.7 minutes; ~52% MFU.
  • Llama-2-70B fine-tuning on 64 GB200 (MLPerf v4.1 closed): 1.2 minutes — 3.5× per-GPU throughput vs H100 due to NVLink 5 + B200 compute.
  • Stable Diffusion v2 training on 1,024 H100s: 1.3 minutes — bandwidth-bound on cross-node allreduce.

NCCL allreduce micro-benchmarks (nccl-tests)

NVL72 (72 × B200, NVSwitch4 + SHARPv3):

Tensor size Ring busBW SHARPv3 busBW Latency (small)
1 MB 0.4 TB/s 0.9 TB/s 18 µs
16 MB 1.2 TB/s 2.6 TB/s 35 µs
256 MB 1.7 TB/s 3.1 TB/s 165 µs
4 GB 1.8 TB/s 2.9 TB/s 2.4 ms

HGX H100 (8 × H100, NVSwitch3):

Tensor size Ring busBW Latency (small)
1 MB 0.25 TB/s 14 µs
16 MB 0.55 TB/s 32 µs
256 MB 0.7 TB/s 380 µs
4 GB 0.72 TB/s 5.8 ms

Inference latency benchmarks

Public Llama-70B-Instruct serving on 8 × H100 vs 1 NVL72:

  • HGX H100, TP=8, batch=8: 35 tokens/sec/user
  • NVL72, TP=8 in NVLink island within rack, batch=8: 105 tokens/sec/user
  • NVL72, TP=8 in rack + multiple replicas via DP: throughput proportional to GPUs allocated

The 3× per-user improvement comes from B200 compute (2.5×) plus NVLink 5 bandwidth (2×) interacting in TP communication.

Storage bandwidth requirements at scale

Checkpoint write for a 405B model with optimiser state (~5 TB at BF16) needs:

  • 60-second target: 83 GB/s sustained aggregate.
  • 30-second target: 167 GB/s.
  • Per node (24-node training): 3.5–7 GB/s. Within range of NVMe / parallel filesystem.

Inference KV-cache loads (for multi-tenant serving) need ~50–200 GB/s per inference node depending on traffic pattern. Local NVMe handles this; remote storage requires careful tiering.


NVL72 day-2 operations: what running this hardware actually looks like

The product brochure ends at "rack arrives." Day-2 operations are where production realities live.

Bring-up

  1. Physical install. 1,400 kg per rack; floor reinforcement check; liquid plumbing tie-in; busbar connection. Typical install: 1–3 days per rack with vendor assistance.
  2. Coolant fill and pressure test. ~200 L of treated propylene-glycol mix per rack; vacuum-pump-and-fill to remove air; pressure-test to 6 bar; leak-check.
  3. Power-on sequence. Per-shelf power-on with thermal ramp; firmware boot; NVSwitch fabric initialisation; topology discovery.
  4. NCCL/NVLink sanity. nvidia-smi topo -m validates topology; nccl-tests at 4 MB to 4 GB tensor sizes verifies allreduce bandwidth matches spec ±5%.
  5. Burn-in. 72-hour stress test (typically full-mesh allreduce loops at high tensor sizes plus matrix-multiply workload to thermally exercise GPUs). Surfaces infant-mortality failures.

Monitoring telemetry

Production NVL72 deployments stream:

  • Per-GPU. Temperature (junction + memory), power, clock, ECC error counts, NVLink link state.
  • Per-NVSwitch. Port state, error counts, throughput per port, SHARP aggregation counters.
  • Per-tray. Coolant flow rate, temperature in/out, power draw.
  • Per-rack. CDU coolant supply/return temperature, pressure differential, leak sensors.

Volume: ~50–200 metrics per second per rack. Storage: Prometheus or VictoriaMetrics for time-series, Grafana for visualisation, alerting on any link-state change, ECC error rate spikes, thermal excursions.

Failure handling

  • Single GPU degradation. ECC errors above threshold or thermal events → mark GPU offline, retire from training jobs. Spare-GPU swap during maintenance window. Failure rate: ~1% annualised per GPU.
  • NVLink port flap. Transient link errors auto-recover via retransmit; logged. If a port flaps >N times/hour, mark suspect; investigate cable / connector.
  • NVSwitch failure. Rare but happens. Reduces fabric bandwidth (partial mesh); may stop the rack depending on the switch role. Field swap with vendor assistance.
  • Coolant event. Leak detection triggers automatic isolation valve closure; rack powers down to safe state. Field service required.

Maintenance windows

Frontier datacenters target 99.5%+ rack-uptime. Planned maintenance windows (firmware updates, NCCL upgrades, OS patches): ~4 hours per quarter typical, with workloads migrated to spare capacity. Unplanned maintenance: 1–4 hours per quarter for typical hardware events.

Total cost of ownership

Approximate TCO for 1 × GB200 NVL72 rack over 4 years (2025 procurement):

  • Hardware: $3–4M
  • Power (132 kW × 4 years × $0.08/kWh): ~$370k
  • Cooling (10% PUE overhead): ~$37k
  • Datacenter rent (1 rack × 4 years × $20k/yr): ~$80k
  • Maintenance + support (15% of hardware/year): ~$2.4M
  • Total: ~$6–7M over 4 years

Per-GPU per-hour amortised cost: ~$2.50–$3.00 on a 72-GPU rack at 80% utilisation. Compares to ~$4–6/hour wholesale pricing and ~$8–12/hour retail (CoreWeave, Lambda) for GB200 SaaS access. The build-vs-rent math favours rent for short-duration workloads, build for sustained 70%+ utilisation over 24+ months.


Frontier deployments: Meta Llama-3, xAI Colossus, Microsoft, Stargate

Public statements and reported configurations from frontier deployments through 2025–early 2026.

Meta Llama-3-405B (training July 2024)

24,576 H100 80 GB GPUs. Two parallel 24K clusters: one on RoCE (Ethernet RDMA), one on InfiniBand. RoCE-based for the bulk; InfiniBand for higher-bisection experiments. Trained over ~54 days. Inter-node fabric: 400 Gbps per H100. Parallelism: TP=8, PP=16, DP=192 (per Meta engineering blog, July 2024). Power: ~30 MW peak.

xAI Colossus (Memphis, 2024–2025)

Reported scale: 100,000 H100 GPUs initially (2024), expanded to 200,000+ by mid-2025. Network: NVIDIA Spectrum-X (Ethernet-based RDMA, 800 Gbps). Power: stood up using gas turbines on-site due to grid limitations. Built in ~120 days for initial 100K, Elon Musk's published claim.

Microsoft AI infrastructure (2024–2026)

Multiple sites including Mt Pleasant WI, Quincy WA, Phoenix AZ. Frontier model training (GPT-4o, GPT-5) reportedly used 100,000+ H100 / B200 GPUs in single distributed jobs. InfiniBand-based NDR (400 Gbps) inter-node. Specific configurations not publicly detailed but indicated to use HGX H100 / B200 building blocks with NVLink intra-node and InfiniBand inter-node.

Stargate (OpenAI/Microsoft, 2025–)

Announced January 2025. Multi-site, multi-year commitment to ~$500B total infrastructure spend. First site (Abilene, TX) under construction targeting ~100,000 B200/B300 GPUs initial population, ramping to 400,000+ over 2026. NVL72-based racks. Liquid cooling throughout. Inter-rack InfiniBand or NVLink-Switch (configuration not publicly detailed). Power: ~1 GW for first site.

Anthropic / AWS Project Rainier

Anthropic's training infrastructure partner is AWS. Project Rainier (announced Dec 2024) targets 400,000+ Trainium2 chips for Anthropic frontier training through 2025–2026. NeuronLink scale-up, EFAv3 scale-out. Total power scale ~700 MW.

Google Gemini infrastructure

Trillium TPU (v5/v5p generation, 2024) and Ironwood TPU (v6, Dec 2025) at multi-superpod scale. Reported deployments: 100,000+ TPU v5p for Gemini Ultra; Ironwood for Gemini 2.5 Deep Think and successor reasoning models. ICI within superpods (256+ chips), DCN across superpods.

Tesla Dojo (2024–2025)

Tesla's custom training chip (D1) in Dojo systems. Used for self-driving and Optimus training. Tile-scale architecture analogous to wafer-scale but with separate dies linked by a custom mesh. Deployments smaller than NVIDIA-based competitors but at distinctive thermal/power density.

What the numbers say about topology choice

Frontier training in 2024–2026 split between NVIDIA NVL-class racks (Microsoft, Meta on H100, OpenAI on B200/GB200), Google TPU (Gemini), and AWS Trainium (Anthropic). All three share the pattern: maximum scale-up domain that fits in one rack/pod (72 GPUs / 256 TPUs / 64 Trainium), tiered scale-out across racks (InfiniBand / ICI / NeuronLink). The bet on rack-scale fabrics (NVL72, TPU pods, Trainium UltraServers) is the consensus path; the open question is whether UALink + Ultra Ethernet erode NVIDIA's lead through 2026–2028.


Topology-aware scheduling: how Slurm, Kubernetes, and orchestrators get this right (or wrong)

A single misplaced job kills NVL72 economics. Topology-aware scheduling is the operational layer that prevents that.

Slurm with topology plugin

Slurm has supported topology-aware scheduling for years via topology.conf. For NVL72 deployments, the file describes the rack as one switch layer and inter-rack links as upper layers. Jobs that request --nodes=8 --gres=gpu:8 get scheduled with rack affinity; jobs that exceed one rack get placed to minimise upper-tier hops. NVIDIA publishes topology.conf reference templates for NVL72 + SuperPod.

Pitfall: many sites copy the default topology.conf from an older HGX cluster and forget to update for NVL72's 72-GPU domain. Result: TP groups split across racks. Validate the topology file before production deployment.

Kubernetes with NVIDIA GPU Operator + DRA

Dynamic Resource Allocation (DRA, alpha through 2024, beta in K8s 1.31, GA targeted 1.32 in 2026) lets pods declare topology constraints — "all GPUs in same NVLink domain" — that the scheduler honours. Combined with the NVIDIA GPU Operator's NVLink-aware allocation, Kubernetes can place pods within NVL72 racks.

Pre-DRA, Kubernetes relied on node-level GPU counts and labels, with crude rack-affinity via topologyKey: kubernetes.io/hostname and node labels. Worked but required manual rack tagging and was awkward for multi-pod jobs.

MPI integration

For multi-pod / multi-rack training, MPI ranks must be assigned to optimise the topology. mpirun --rank-by node vs --rank-by slot produces different communication patterns; combined with NCCL_TOPO_FILE, the placement determines whether collective uses ring (in-rack) or hierarchical (cross-rack).

Job scheduling for shared NVL72 capacity

When multiple smaller jobs share an NVL72 rack, the scheduler must avoid fragmentation: a 16-GPU job and a 24-GPU job and a 32-GPU job add to 72, but if placed without topology awareness they each get a slice that splits NVLink domains. Pack jobs by powers of 2 (1, 2, 4, 8, 16, 32, 64) along NVLink boundaries; reserve "headroom" for the 72-GPU full-rack job.

Best practice config

  • Reservation policy. Reserve full racks for frontier training jobs; share fragmented capacity across small inference jobs.
  • Anti-affinity for noisy neighbours. Inference latency-sensitive workloads should avoid sharing a rack with throughput-heavy training (different power profiles cause clock-throttling).
  • Drain-on-failure. A flapping NVLink port or hot GPU should drain the rack from the scheduler until investigation completes.
  • Continuous validation. Periodic NCCL benchmark runs (nccl-tests at 4 MB + 4 GB sizes) against fresh allocations validates topology assumptions and surfaces gradual degradation.

Scale-up vs scale-out economics: when does NVL72 pay for itself?

A practical decision framework: when does the NVL72 premium make sense vs HGX B200 with InfiniBand inter-node?

Workload categories

Pure data parallelism, small models (≤30B). DP-only across small models with TP=1 or TP=2. No benefit from NVL72; HGX H100/B200 with InfiniBand is more cost-efficient.

Medium dense models (70B–200B). TP=4 or TP=8 fits in 8-GPU HGX. NVL72 enables larger TP (16, 32) reducing PP stages — but the gain is 1.1–1.3× throughput, may not justify the premium for inference. For training at this scale, HGX is fine.

Large dense models (200B–500B). TP=8 with PP=8 in HGX vs TP=16 with PP=4 in NVL72. NVL72 cuts PP overhead and pipeline bubbles. Training throughput improves 1.4–1.8×. NVL72 justified.

MoE models with many experts (256+ experts). All-to-all expert routing is bandwidth-critical. NVL72 keeps routing inside the 1,800 GB/s NVLink domain; HGX falls back to InfiniBand for cross-node routing at 5–10× higher latency. Throughput delta: 1.6–2.5× in NVL72's favor. NVL72 strongly preferred for MoE training and serving.

Frontier model training (500B+ dense, multi-trillion MoE). Multi-rack required. NVL72 + SuperPod is the only viable path; HGX + InfiniBand alone hits inter-node bandwidth limits during DP allreduce.

Reasoning model inference. Long thinking traces need KV cache; large concurrent batches benefit from cache-coherent NVLink domain. NVL72 wins on per-user throughput by 2–3× for o1/o3-style reasoning workloads.

Cost-per-token math

For 70B Instruct serving:

  • HGX H100 (8 GPUs at $4/hr each = $32/hr): ~280 tok/s aggregate, ~$0.11 per 1k tok output.
  • HGX B200 (8 GPUs at $8/hr each = $64/hr): ~720 tok/s aggregate, ~$0.08 per 1k tok output.
  • NVL72 (per 8 of 72 GPUs at amortised $24/hr): ~840 tok/s for that slice, ~$0.07 per 1k tok output.

The per-token cost converges across platforms; the marginal cost of NVL72 is justified mainly by capabilities (larger TP, MoE all-to-all, reasoning workloads) rather than per-token efficiency on dense small models.

Capex vs opex

For workloads with sustained 70%+ utilisation over 2+ years, capex (buy NVL72) beats opex (rent GB200 hours at retail). Below 50% utilisation or shorter horizons, opex wins because retail pricing is competitive and avoids the depreciation risk on next-gen hardware (Rubin in 2026–2027 will outperform GB200; new-rack purchasers carry that delta).

Sustainability accounting

132 kW per rack × 24×365 × 0.8 utilisation = 924 MWh per rack-year. At average US grid intensity (370 gCO2/kWh in 2025), that's ~340 metric tons CO2 equivalent per rack-year. Hyperscalers offset via renewable PPAs; smaller buyers should plan for the carbon disclosure that EU CSRD requires for in-scope companies.


SHARP in-network aggregation and the case for offload

NVIDIA's SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) is one of the under-discussed accelerators in 2026 frontier training. The idea: instead of having GPUs exchange gradients via the network and aggregate locally, the network switches themselves perform the aggregation. The compute happens on the wire.

How SHARP works

Each NVSwitch (and certain InfiniBand switches) includes aggregation engines. During an all-reduce, GPUs send their partial gradients to the switch; the switch aggregates and sends the reduced result back. The wire-time of the collective drops because each GPU sends and receives less data — the switch is doing the math.

Performance impact

For all-reduce on small-to-mid-size tensors, SHARP can cut latency by 30–50% versus the same operation done in software via ring or tree all-reduce. For very large tensors (where the bandwidth dominates over latency), the gain is smaller — SHARP doesn't add bandwidth, it removes round-trips.

Where it pays off

  • Frequent small all-reduces in distributed training (gradient averaging at small batch sizes)
  • Latency-sensitive collectives (synchronizing model state in RL training)
  • All-to-all in MoE training: SHARP-style aggregation can help though all-to-all is bandwidth-heavy

What it requires

  • SHARP-capable switches (NVSwitch 3/4 in NVL72 / HGX; certain InfiniBand HDR/NDR switches with SHARP licenses)
  • NCCL configured to use SHARP (NCCL_COLLNET_ENABLE=1 and related)
  • SHARP credits — limits on simultaneous aggregations per switch

Configuration gotchas

  • SHARP works best on homogeneous trees; mixed-generation hardware can cause fallback to software path
  • The aggregation engines have finite memory; large all-reduces may spill to software
  • Operational visibility into SHARP usage is limited; profile with NCCL traces to confirm it's being used

DeepSeek-V3 expert parallelism on NVL72: a case study

DeepSeek-V3's training (Dec 2024) and inference deployments are an underappreciated case study in how NVL72-class fabrics enable specific model architectures. The model has 671B total parameters, 37B active per token, with 256 routed experts plus shared experts.

Why NVL72 matters here

Expert parallelism (EP) places different experts on different GPUs. For each token, the router selects ~8 experts; the activations get sent to those experts, computed, and gathered back. The all-to-all pattern this creates is bandwidth-intensive — and unlike data-parallel gradient sync (which is periodic), it happens on every forward pass.

On a 72-GPU NVL72 rack with NVLink 5 (1.8 TB/s per GPU): all-to-all between 72 GPUs at full bisection bandwidth supports the activation exchange without becoming the bottleneck. On an 8-GPU HGX H100 with NVLink 4 (900 GB/s) connected via InfiniBand: the inter-node all-to-all becomes the bottleneck, and throughput drops 3–5×.

The architecture implication

DeepSeek's 256-expert MoE is designed for rack-scale interconnects. The expert count and the routing top-k were chosen with all-to-all bandwidth in mind. Replicate the model on a different topology (say, 4-GPU nodes connected by Ethernet) and the same architecture serves at a fraction of the throughput.

The general lesson

Frontier MoE architectures are co-designed with the interconnect. NVL72 enabled 256-expert models with reasonable training and inference economics; pre-NVL72 the practical expert count was 8–32. As UALink and Ultra Ethernet mature, the same co-design will shift — what fits in a "rack" defines what models can be.

See MoE serving for the MoE inference side.


The GB200 NVL72 hardware-engineering story

The GB200 NVL72 (announced at GTC 2024, shipped late 2024 / 2025) is worth a section as a hardware-engineering milestone independent of the raw specs.

What's novel

Two things that no prior NVIDIA rack-scale product had:

  • Liquid-cooled, fully connected NVLink fabric across 72 GPUs in a single rack. Previous designs (HGX) had 8 GPUs per node, NVLinked locally, connected via InfiniBand. NVL72 makes all 72 GPUs visible to each other as if they were in the same node — 130 TB/s bisection bandwidth in a single rack.
  • Grace-Blackwell SuperChip per "compute tray". Each tray has 2× Blackwell GPUs + 1× Grace CPU connected via NVLink-C2C (900 GB/s coherent memory). The CPU is on the NVLink fabric, not an after-thought via PCIe.

The cooling story

The rack draws ~120 kW. Air cooling at that density is not feasible; the rack ships with liquid-cooled cold plates on every GPU and CPU. The coolant distribution unit (CDU) at the rack base manages the loop; cooling capacity must come from the data center's secondary loop. Many existing datacenters cannot accept this density without infrastructure work; the rack is a forcing function for liquid-cooled facility designs.

The cabling story

72 GPUs fully connected over NVLink requires staggering amounts of high-speed cabling. NVIDIA's design uses copper cables routed through a passive cartridge at the rack rear, with NVSwitch trays mounted in the middle of the rack. The cable count is in the thousands; the manufacturing tolerance is tight (every cable's electrical length matters for NVLink synchronization). Field-replaceable units (FRUs) and serviceability were designed in from the start.

The power-delivery story

At 120 kW per rack, each rack needs ~250A at 480V. Most existing datacenters were designed for racks in the 5–15 kW range. Power delivery to NVL72 racks typically requires dedicated PDUs and often new feeder runs. The deployment timeline is gated more on facility readiness than chip availability.

Why it matters

NVL72 changes the unit of compute. Before NVL72, a "frontier training cluster" was thousands of 8-GPU nodes connected by InfiniBand. After NVL72, the same compute is hundreds of 72-GPU racks connected by InfiniBand or Ethernet. The intra-rack bandwidth is 10–20× higher, which enables new model architectures (large MoE, very long context, expert parallelism at scale) and new training patterns.


NVLink-C2C and Grace-Hopper memory coherence

NVLink-C2C (chip-to-chip) is the variant of NVLink that connects the Grace CPU to the Hopper or Blackwell GPU within the same package or board. The headline number is 900 GB/s bidirectional.

What's different from regular NVLink

  • Memory coherence: the CPU and GPU share a single coherent address space. The CPU can read GPU memory at NVLink speeds; the GPU can read CPU memory at NVLink speeds. No explicit copies needed for most workloads.
  • Packaging: NVLink-C2C runs on a PCB-embedded high-speed link rather than cabled. Tight electrical tolerances; not field-serviceable in the way cabled NVLink is.
  • Use cases: graph workloads that don't fit in HBM (large embedding tables, knowledge graphs), CPU-side preprocessing pipelines, KV-cache offload to system DRAM for very long context.

Why it matters for training and inference

  • Long-context inference: KV cache offload to Grace's LPDDR5X memory becomes viable; instead of a 80GB HBM ceiling per Hopper, you have effectively 480GB+ per Grace-Hopper module (HBM + LPDDR5X via NVLink-C2C). The bandwidth is much lower than HBM, but it's high enough for offload-friendly attention algorithms.
  • Training of very large embedding models: graph neural networks, recommendation models with terabyte-scale embedding tables benefit directly.
  • Disaggregated inference: in disaggregated patterns (prefill + decode on separate stages), NVLink-C2C lets the CPU stage prefills cheaply.

See disaggregated inference for the inference patterns and KV cache for the memory math.


Failure blast radius: GPU vs NVSwitch vs rack vs row

The bigger the unit of compute, the bigger the unit of failure. NVL72 shifts the blast-radius math meaningfully.

Per-failure-domain analysis

  • Single GPU failure: one rank dies. With elastic training (TorchElastic, Ray Train), one rank can be replaced; without it, the entire job restarts. Frequency: roughly per-day at 10k-GPU scale.
  • NVSwitch failure: an entire NVL72 rack loses some-to-most of its internal bandwidth. The rack is effectively offline until repair. Frequency: weeks-to-months per rack.
  • Rack-level failure (CDU failure, power loss, network uplink): 72 GPUs unavailable simultaneously. The training job loses ~1% of capacity in a typical 10k-GPU deployment. With proper sharding (failures localize to specific TP/PP/DP shards), the job can restart from checkpoint with the remaining racks.
  • Row-level failure (cooling loop failure, facility power event): hundreds of GPUs unavailable. Recovery is hours-to-days; the run is paused, not failed, if checkpointing is working.
  • Site-level failure (datacenter event): all GPUs at the site unavailable. The run pauses indefinitely or fails over to a secondary site if cross-DC checkpointing is in place.

Mitigation strategies

  • Hot spare racks: provision 5–10% extra capacity so a rack failure doesn't reduce job throughput
  • Topology-aware sharding: place TP groups within a rack so an inter-rack network failure doesn't tear apart a TP group
  • Checkpoint cadence calibrated to rack-failure rate: at NVL72 scale, rack failures may bound your cadence math more tightly than per-GPU failures
  • Cross-site replication: see the cross-DC training section below

See checkpoint storage and recovery for the recovery side of all of this.


Reference rack design: power, cooling, networking, ops

A reference 2026 GB200 NVL72-class rack deployment, with the numbers you need to plan against:

Physical

  • Footprint: ~600mm × 1200mm (standard 19" data hall, but heavier; floor-load checks required)
  • Weight: ~1500 kg per rack populated; pre-deployment structural review needed
  • Height: 48U typically; some configurations 42U or custom

Power

  • Power draw: 120–132 kW per rack (NVL72); ~65 kW NVL36 variant
  • Feed: typically dual 250A 480V three-phase or equivalent in EU power
  • PDUs: rack-internal PDUs sized for the GPU and switch trays; cable-management space is non-trivial

Cooling

  • Coolant flow: ~80–100 LPM (liters per minute) per rack at the cold plate inlet
  • Inlet temperature: typically 25–35°C; outlet 40–55°C after passing through cold plates
  • CDU sizing: each rack's CDU sized for full rack thermal load plus margin
  • Secondary loop: facility chilled water at appropriate flow and temperature
  • Air cooling: still present for in-rack components not directly liquid-cooled (NICs, management cards); ambient ~25–30°C is sufficient

Networking

  • Inter-rack uplinks: 8–16 InfiniBand NDR (400 Gbps each) per rack, going to a scale-out spine
  • Management network: separate 100 Gbps Ethernet for cluster management, console access, OOB monitoring
  • Storage network: typically converged with the InfiniBand fabric for parallel-FS access

Day-2 operations

  • Monitoring: per-GPU telemetry (temperature, ECC errors, NVLink link status), per-NVSwitch port stats, CDU flow/temp/pressure
  • Alerts: thermal margin, NVLink link-down, ECC error rate, CDU pump fault
  • Repair workflow: documented FRU-replacement procedures; mean time to repair (MTTR) of a failed NVSwitch under 4 hours with on-site staff
  • Firmware updates: scheduled, coordinated; downtime per update typically tens of minutes per rack

Reference comparison

Component NVL72 spec HGX H100 8-GPU spec
GPU count 72 (Blackwell) 8 (H100)
Intra-rack bandwidth NVLink 5 (130 TB/s aggregate) NVLink 4 (900 GB/s per GPU within node)
Power 120 kW 10.5 kW per node
Cooling Liquid (mandatory) Air or liquid
Inter-node fabric (per rack) 8–16 IB NDR 4–8 IB HDR/NDR per node
Reference footprint 1 rack 1U–4U per node

Cross-DC training: when one site isn't enough

Frontier training runs in 2026 are bumping against the power limits of single datacenters. A 100k-GPU cluster at NVL72 density is ~140 racks × ~120 kW = ~17 MW; very few existing datacenters have that available capacity for a single tenant. xAI's Colossus (Memphis), Stargate (Oracle/OpenAI), Anthropic's clusters, and the leading hyperscalers' newest sites are all in the 100+ MW range, custom-built.

For training that needs to span multiple sites, the connectivity becomes the bottleneck:

Latency

Inter-site latency over fiber: ~5 ms per 1000 km. East coast to west coast: ~30–40 ms. Training collectives that synchronize every step are impractical; pipeline-parallel stages spanning sites are not (the pipeline absorbs latency).

Bandwidth

Inter-site bandwidth via leased fiber: 100s of Gbps to single-digit Tbps. Compared to ~1.8 TB/s per GPU intra-rack, this is 1000× slower. Designs that work: model parallel within a site, data parallel across sites (with delayed gradient sync), separate jobs per site with periodic checkpoint exchange.

Patterns

  • Single-site training, cross-site checkpoint replication: simplest. The training run lives at one site; checkpoints are replicated to another site for disaster recovery.
  • Data-parallel across sites: each site trains on a shard of the data with periodic gradient sync. Tolerates latency but throughput is bounded by sync cadence.
  • Pipeline-parallel across sites: stage 0 at site A, stage 1 at site B, pipeline bubbles absorb the latency. Requires careful staging to avoid stalls.
  • Federated training: each site trains independently; final model is averaged. Mostly research territory; not standard for frontier pretraining.

By 2027, the leading labs are likely to be running across multiple geographic sites by necessity; in 2026, single-site is still the dominant pattern with cross-site mostly for resilience.


TPU v5p, Trillium, and Ironwood interconnect details

Google's TPU lineage uses a fundamentally different interconnect architecture (Inter-Chip Interconnect, ICI) that's worth understanding for comparison.

TPU v5p

256-chip pods with 3D torus ICI topology. Each chip has 4.8 TB/s of HBM bandwidth (similar to H100 HBM3) and ICI links at ~9.6 Tbps aggregate. The 3D torus shape gives near-uniform latency between chips and naturally supports collectives via ring algorithms on each axis. Used heavily for Gemini training.

Trillium (TPU v6)

Announced 2024. Improved compute per chip; ICI generation upgrade. Pod sizing similar to v5p. Optimized for training and inference of frontier Google models including Gemini 2.x.

Ironwood (announced 2025 for 2025–2026 deployment)

Google's inference-optimized TPU. Different ICI topology choices favoring all-to-all and attention patterns. By 2026 expected to be the inference workhorse for Gemini production.

Comparison to NVLink/NVL72

  • TPU pods scale to thousands of chips with uniform interconnect (vs NVL72's 72 chips at full bandwidth, with InfiniBand beyond)
  • TPU ICI bandwidth per chip is lower than NVLink 5 per-GPU bandwidth, but the pod-scale bandwidth is competitive due to topology
  • TPU pods are Google-only; no third-party access except via Google Cloud
  • Software stack is JAX/TPU-native rather than CUDA/PyTorch-native; portability is the main downside

When TPU is preferable

  • Workloads already JAX-native (Gemini, Imagen, Veo)
  • Long-running training jobs that fit the pod size
  • Inference at very high throughput where the cost model favors TPU's per-chip economics

When NVIDIA wins

  • Workloads that need CUDA-only libraries (most research code)
  • Multi-vendor portability requirements
  • Cases where NVL72-class single-rack bandwidth matters

The bottom line

The problem is the cross-rack collapse: collectives that cross out of the NVLink domain pay a 10–20× bandwidth penalty, and that single fact governs which parallelism strategies survive at scale. The solution is to size your fast-fabric domain to the model architecture, not the other way around. NVL72 made the rack — not the 8-GPU node — the new unit of "tightly coupled," and that is the biggest lever in 2026 hardware planning.

  • Place TP and EP inside one NVLink domain. Always. Crossing the rack boundary with bandwidth-hungry collectives is the most common preventable throughput loss.
  • Use PP and DP to cross racks. Pipeline parallelism is latency-tolerant; DP gradient sync is overlap-friendly.
  • For dense ≤200B models, 8-GPU HGX is still competitive. Rack-scale is the right answer for trillion-parameter MoE, not every workload.
  • Watch UALink + Ultra Ethernet in 2027. First silicon will reset the "open vs proprietary" question and may change procurement math.
  • Budget the blast radius. A whole NVL72 rack is one failure domain; checkpoint cadence and topology-aware scheduling matter more, not less.

For the inter-rack side of this picture, read AI cluster networking; for how parallelism shapes which fabric you actually need, read distributed LLM training.


FAQ

Do I need NVSwitch for 2 GPUs? No. Direct NVLink between 2 GPUs is sufficient. NVSwitch matters when 4+ GPUs need full mesh.

Can I use NVLink across nodes? Not directly with traditional NVLink. NVLink Switch System and similar rack-scale fabrics extend it, but the rack boundary still matters.

What about PCIe-only deployments? Possible for small-scale inference. Bandwidth between GPUs is limited to PCIe Gen5 (~64 GB/s), so collectives are slow. Not recommended for serious multi-GPU work.

Does NVLink work between consumer GPUs? Most consumer GPUs don't support NVLink (or support limited variants). Production AI uses data-center GPUs (H100/H200/B200, MI300X, etc.).

Does TP=2 need NVLink? Highly recommended. TP=2 across PCIe is much slower than across NVLink.

How big is the in-rack power requirement for NVL72-class? ~120 kW for a fully loaded rack. Liquid cooling typically required. Substantial data-center modifications.

What's the bottleneck when I scale data parallel across many nodes? Usually the all-reduce for gradients. At 1000+ GPUs, network bandwidth becomes the limit. Mitigations: gradient compression, asynchronous updates (with quality cost).

Can I mix NVIDIA and AMD in one cluster? Possible but software-painful. Different libraries, different toolchains. Rare in production today.

What's the difference between NVLink 4 and NVLink 5? Per-link signaling rate roughly doubled (NVLink 4 at ~25 GB/s per direction per link, NVLink 5 at ~50 GB/s), and the aggregate per-GPU bandwidth went from ~900 GB/s on H100/H200 to ~1.8 TB/s on B200/GB200. The bigger structural change is that NVLink 5 plus NVSwitch 4 is the first generation designed to scale beyond a single node — it's what makes NVL72 possible.

What is HGX, and how is it different from DGX? HGX is NVIDIA's reference baseboard — the 8-SXM motherboard with NVSwitch silicon — that OEMs (Supermicro, Foxconn, Dell, Lenovo) ship in their own server chassis. DGX is NVIDIA's first-party fully integrated system built around an HGX baseboard. Same GPUs, same NVLink fabric; DGX adds NVIDIA-blessed BIOS, networking, cooling and support contracts.

Is GB200 NVL72 worth the premium over 9× DGX H100? For frontier training where TP > 8 or EP > 8 is required: yes. The NVLink fast-fabric advantage at TP=32+ is enormous for trillion-parameter dense or large-expert MoE training. For most workloads (≤200B, TP=8, conventional DP+PP), 9× DGX H100 or H200 is more flexible and significantly cheaper per GPU.

What is UALink and should I care? UALink is the open scale-up interconnect standard from AMD, Intel, AWS, Google, Meta, Microsoft and others — explicitly designed as an alternative to NVLink. The 1.0 spec landed in 2025, first silicon in 2026–2027. If you're planning purchases for 2027+ with multi-vendor procurement requirements, track UALink. For 2026 production deployments, NVLink remains the only mature scale-up option.

What's Ultra Ethernet and how does it relate to UALink? Ultra Ethernet (UEC) is the scale-out fabric standard — the inter-rack companion to UALink's scale-up. Together they form the open alternative to NVIDIA's NVLink + InfiniBand stack. See the AI cluster networking guide for the scale-out side.

Does NVSwitch SHARP do anything useful? Yes — SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) lets the NVSwitch chip itself perform reductions (sum, min, max) on data passing through it, instead of bouncing the data back to GPUs. For all-reduce-heavy workloads (DP gradient sync) this can cut collective time meaningfully. NCCL transparently uses SHARP when available; see the NCCL guide for the env vars.

Can NVLink span multiple racks? With NVIDIA's NVLink Switch System (the NVL576 reference design), yes — up to 8 NVL72 racks (576 GPUs) in one NVLink domain via external NVLink switches and optical cables. Bandwidth between racks is lower than intra-rack but still NVLink-class. Production deployments at NVL576 scale are early and rare in 2026.

How does NVLink interact with NCCL topology detection? NCCL detects NVLink topology automatically and builds rings/trees that maximize NVLink utilization while minimizing PCIe and IB hops. If nvidia-smi topo -m shows PHB instead of NV# between GPUs in your node, NVLink isn't being used — usually a misconfigured container, IOMMU issue, or wrong PCIe slot.

What's NVL36 and when would you pick it over NVL72? NVL36 is the half-rack variant: 36 GPUs in one NVLink domain, same NVSwitch 4 fabric topology. The use case is datacenters that can't accommodate ~120 kW racks but can do ~60 kW. Per-GPU cost is similar; the bandwidth-per-rack is the same architecturally, just half the GPUs. For workloads that don't need 72-wide TP or EP (most inference, mid-scale training), NVL36 is the easier physical fit.

How does NVL576 actually work — is it really a single NVLink domain? Yes, with caveats. The NVL576 reference connects 8 NVL72 racks via external NVLink switches and optical NVLink cables. From software's perspective, all 576 GPUs are one NVLink domain — NCCL sees them as a flat fabric, you can run TP=576 if you want. The caveat: inter-rack NVLink bandwidth is lower than intra-rack (optical SerDes power and cost), so practical TP groups still tend to stay within one rack. The cross-rack bandwidth is still much higher than InfiniBand. NVL576 deployments are early; most "NVL576-class" buyers in 2026 are running fewer racks per domain.

What changes with co-packaged optics (CPO)? CPO integrates the optical transceivers directly into the switch ASIC package, eliminating the pluggable optic. Lower power per gigabit, higher density, lower cost at scale. NVIDIA's Quantum-X Photonics (announced 2024, shipping 2026-2027) is the first generation; CPO becomes the default for NVLink 6 / Rubin-era hardware. The implication for cluster designers: NVLink scale-up domains can grow further (multi-rack at scale, with NVLink-class bandwidth) while inter-rack optical-NVLink power becomes manageable. Pluggable optics in the AI fabric will be a 2024-2027 generation phenomenon; post-CPO is the future.

Can I run TP=16 on an 8-GPU node? No, not in a useful way. TP=16 requires 16 GPUs in one fast-fabric domain. On an HGX H200 (8 GPUs, NVSwitch 3 domain), TP is bounded at 8. Going to TP=16 forces the second half of the TP group across InfiniBand at ~50 GB/s — the resulting throughput is 5-10× slower than TP=8 within the node. Either accept TP≤8 (use PP or DP for the rest) or upgrade to rack-scale hardware where TP=16+ is in the same fabric.

What's the role of SHARP for inference workloads (vs training)? SHARP's all-reduce-in-network helps any workload where the all-reduce is on the critical path. For training: data-parallel gradient all-reduce, which is large but tolerates overlap with compute, so SHARP gives modest wins (~5-15% on step time at large DP). For inference: tensor-parallel all-reduce on every forward pass, smaller payloads but on the critical-path. SHARP cuts TP all-reduce latency, improving TTFT and per-token latency at higher TP degrees. See vLLM PagedAttention for where this matters in serving.

How does scale-up topology affect MoE serving? Expert-parallel MoE serving relies on an all-to-all to dispatch tokens to experts every layer. The all-to-all's bandwidth requirement scales as O(N²) in the EP group size. On 8-GPU NVSwitch, EP=8 is the ceiling before InfiniBand crushes the all-to-all. On NVL72, EP=64 is comfortable. This is why DeepSeek-V3's serving stack and GPT-class MoE inference depend on rack-scale NVLink — without it, the expert-parallel layout has to be narrower, which forces smaller per-expert routing or more layers, both quality-negative tradeoffs.

What's the difference between InfiniBand and Spectrum-X for inter-rack AI fabrics? InfiniBand (Quantum-2 NDR, Quantum-3 XDR) is the historical default, lossless and tuned for HPC. Spectrum-X is NVIDIA's RoCEv2-based Ethernet alternative with AI-specific congestion control (Adaptive Routing, BlueField-3 DPUs at the endpoints). Performance is comparable at the bandwidth level (both ship 400G/800G); Spectrum-X integrates better with existing Ethernet management tooling and avoids InfiniBand's specialized switch stack. Hyperscalers increasingly prefer Spectrum-X for greenfield deployments; HPC-traditional buyers stay on InfiniBand. See the InfiniBand vs RoCE guide for the deeper comparison.

How does AMD MI355X / MI400X position against B200 / Rubin? MI355X (2025 refresh of the MI350 line) is competitive with H200 on raw FLOPs and HBM capacity; per-GPU NVLink-equivalent (Infinity Fabric) bandwidth lags NVLink 5 / B200 by ~30-50%. MI400X (2026 expected) targets B200 parity on compute but Rubin will likely be ahead by then. For inference, AMD is competitive on TCO at mid-scale. For frontier pretraining, the rack-scale gap and ROCm software maturity still favor NVIDIA, though the gap narrows each generation.

Should I be worried about the 5-year horizon if I'm buying NVLink today? Less than you might think. The NVLink + InfiniBand stack has a 10+ year head start in production maturity. UALink + UEC needs not just first silicon but the entire software ecosystem — collective libraries, fault-tolerance patterns, scheduler integrations — to match. That's a 2028-2030 horizon for parity. Buying NVLink today for 2026-2028 service is the lowest-risk choice. Track UALink for 2029+ procurement.

What's the actual cost of NVL72 vs equivalent DGX H200? NVL72 list pricing is heavily negotiable and varies by buyer; reported numbers cluster around $3-4M per rack for GB200 NVL72. An equivalent in raw GPU count (~9 DGX H200 nodes, 72 GPUs) is roughly $2-2.5M. The NVL72 premium reflects the rack-scale fabric, liquid cooling integration, and Grace CPU integration. The premium pays back if you need TP > 8 or EP > 8 — for typical training, it's a 20-40% real performance win. For workloads that don't, the DGX H200 layout is more flexible and cheaper.

Where does the decentralized GPU economics story interact with rack-scale? Decentralized GPU networks (Akash, io.net, Render Network etc.) almost universally operate at the single-node level, not rack-scale. The 8-GPU node is the largest tightly-coupled unit available on these networks; NVL72-class systems exist only in dedicated frontier-lab datacenters. The implication: decentralized inference for mid-size models (≤70B with TP=8) works; decentralized training at frontier scale doesn't, because the fast-fabric requirement is incompatible with the decentralized model. The gap is structural.

Why does NVLink 5 use PAM4 instead of NRZ? Bandwidth per pin. NRZ encodes one bit per symbol; PAM4 encodes two bits per symbol at the cost of much tighter SNR margins. To double per-lane bit rate without doubling the analog frequency (which is hard to do reliably above 50 GHz), you either add pins (more wires, more area, more thermal cost) or move to PAM4. NVIDIA chose PAM4 with strong FEC. The downside is the FEC latency tax (~10–15 ns per link traversal) and tighter cabling tolerances — a real but manageable cost.

Why is NVL72 only 72 GPUs and not 128 or 256? Power and physics. 72 B200 GPUs at ~1.2 kW each plus NVSwitches, Grace CPUs, and supporting infrastructure hits ~132 kW per rack — already at the edge of what direct liquid-to-chip cooling can extract in standard 42U form factor. Doubling to 144 GPUs hits ~265 kW per rack, which exceeds what the busbar, CDU, and floor-load can support in a standard rack. The 576-GPU SuperPod is the answer for "bigger than 72": multiple racks linked by external NVLink. NVSwitch silicon could support more ports; the limiting factor is rack-level power/cooling, not silicon.

What's SHARPv3 and does it actually help my workload? SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) lets NVSwitch perform allreduce inside the switch silicon rather than relaying data among endpoints. SHARPv3 in NVSwitch4 adds FP8 and FP16 dtype support — the precisions that dominate modern training. Empirically: allreduce throughput improves 1.5–2× on medium-sized tensors (1–256 MB) where ring's latency dominates. For very large tensors (>1 GB) ring approaches SHARP's effective bandwidth. Enable via NCCL_NVLS_ENABLE=1; benchmark on your tensor sizes.

How much of my training cost goes to interconnect? For frontier training on NVL72: roughly 25–35% of the total system cost is interconnect (NVSwitches, NVLink cabling, InfiniBand inter-rack). The compute (GPUs) is 50–60%; the rest (host CPUs, memory, storage, cooling, power infrastructure) is 10–20%. The interconnect share has grown over generations as scale-up domain has grown — NVL72 has proportionally more switching than HGX H100.

Why don't I see ICI/TPU details in NCCL? ICI (Google's TPU interconnect) is a separate stack — XLA's collective library targets ICI directly. NCCL targets NVIDIA GPUs. If you're on TPU, you use JAX or PyTorch/XLA, and the compiler handles topology. There is no equivalent of NCCL_TOPO_FILE on TPU because XLA's compiler is the topology mapper.

How does inference parallelism differ from training on rack-scale? Inference uses much less collective traffic than training. TP for inference handles per-layer activation slicing (the all-reduce per layer); no DP allreduce for gradients. So inference fits comfortably in smaller domains. A 405B inference deployment uses TP=8 in one HGX or TP=16 in NVL72; no need to span multiple racks for any single inference request. Where rack-scale matters for inference: serving many concurrent requests with shared KV cache and prompt cache benefits from a large single fabric for cache coherency.

Will UALink replace NVLink? Long-term, possibly. Short-term (2026–2028), no — UALink v1.0 hardware ships in mid-2026 at AMD MI400X scale; software stack maturity is 2027+. NVIDIA has a 10+ year head start in the production fabric for AI. UALink will become the open alternative for non-NVIDIA accelerators (AMD, Intel, custom silicon). The realistic 2030 picture: NVIDIA NVLink dominant in frontier; UALink dominant in non-NVIDIA segment; both ecosystems mature.

What's co-packaged optics (CPO) and why does it matter for Rubin? CPO moves the optical transceiver onto the chip package, replacing pluggable optical cages. Benefits: 3–5× lower power per bit, lower latency, higher density. Trade-offs: thermal complexity (lasers don't like hot environments), manufacturing complexity, lack of field pluggability. NVIDIA Rubin (late 2026) is the first NVIDIA generation with CPO at scale. The strategic implication: CPO production volume (limited 2026–2028) is the bottleneck for the next NVLink generation, not silicon design.

How does AWS EFAv3 compare to InfiniBand for AI? EFAv3 (announced 2024, GA 2025) is AWS's RDMA-over-Ethernet implementation tuned for AI. Per-link: 400 Gbps. Comparable raw bandwidth to InfiniBand NDR. Differences: EFA uses SRD (Scalable Reliable Datagrams) instead of InfiniBand's RC. SRD does out-of-order delivery with reordering at the endpoint — better for AI all-to-all patterns but requires application support. NCCL on AWS uses libfabric / OFI to bridge — works well but with slight latency tax vs native InfiniBand on equivalent hardware.

What's the typical allreduce bandwidth efficiency in a real cluster? For ring allreduce on NCCL inside an 8-GPU HGX H100: 75–85% of peak NVLink bandwidth on tensors >32 MB. On NVL72 with SHARPv3: 85–95% on tensors 1–256 MB, 75–85% on tensors >1 GB. Inter-rack allreduce on InfiniBand NDR: 55–70% of peak — the drop reflects the higher-tier fabric cost. Multi-rack training jobs design DP scheduling to overlap allreduce with compute to hide the inter-rack inefficiency.

Can I run NVL72 in an existing datacenter? Probably not without retrofit. Standard datacenters provide 8–15 kW per rack with air cooling. NVL72 requires 132 kW with direct liquid-to-chip. Retrofit costs are real: liquid plumbing infrastructure, CDU installation, floor reinforcement, power distribution upgrades. Most colos that support NVL72 in 2026 either built liquid-ready greenfield (Equinix LD11 Phase 2, Digital Realty multiple sites) or charge a premium per kW. Plan 6–12 months lead time for procurement and installation.

How do checkpoints work at NVL72 scale? Critical concern: a single NVL72 rack failure loses 72 GPUs worth of state. Frontier training checkpoints every 15–60 minutes depending on cost-to-restart math. Checkpoint volume: 1T-parameter model with optimiser states ≈ 12 TB at BF16. Write throughput must hit ~5–20 GB/s per node sustained for 60-second checkpoint targets. Modern stacks use sharded checkpoints (each GPU writes its slice) to parallel storage (Lustre, GPFS, S3 Glacier multipart). See checkpoint storage and recovery.

Why do TPU pods use a 3D torus instead of a switched fabric? Trade-off: torus uses fixed per-chip optical links (no central switch). Pros: lower cost per chip than building a switched fabric (no NVSwitch equivalent), lower latency for nearest-neighbour comms. Cons: arbitrary all-to-all is multi-hop; the compiler must lay out collectives to match the topology. XLA does this well for JAX; for PyTorch, the integration is younger. Both topologies are viable for AI; the choice reflects different design philosophies.

What's UALink's realistic timeline against NVLink? UALink consortium (AMD, Broadcom, Cisco, Google, HPE, Intel, Meta, Microsoft) announced May 2024; UALink 1.0 specification finalized 2024. First silicon implementations expected 2025-2026 in AMD MI400-class and Intel Falcon Shores products. Realistic production deployment at scale: 2026-2027. NVIDIA maintains a multi-year lead in actual rack-scale deployments; UALink's competitive position improves as the ecosystem ships hardware. The most likely outcome is a duopoly where some hyperscalers prefer NVIDIA, others (notably Microsoft, Google for some workloads) adopt UALink-based alternatives.

How does Ultra Ethernet Consortium fit alongside UALink? UALink handles intra-rack scale-up; Ultra Ethernet Consortium (UEC) handles inter-rack scale-out — both aimed at NVIDIA-stack alternatives. UEC 1.0 specification published 2024; silicon and switches shipping 2025-2026. Different problem space than UALink: UEC is replacing InfiniBand or RoCE; UALink is replacing NVLink. A complete non-NVIDIA stack would use UALink within racks and UEC between racks. See InfiniBand vs RoCE for the inter-rack context.

What's the actual bandwidth difference between NVLink 4 (H100) and NVLink 5 (B200)? H100 NVLink 4: 900 GB/s per GPU bidirectional total. B200 NVLink 5: 1.8 TB/s per GPU bidirectional total — 2× the per-GPU bandwidth. The per-link bandwidth doubled (50 GB/s vs 100 GB/s per link), and the number of links per GPU stayed at 18, giving the 2× increase. Within a rack, the bisection bandwidth scales with both factors. NVLink 6 (Rubin generation) is expected to push this further.

Does NVL36 (the half-rack variant) make sense for any deployments? For some. NVL36 fits in a single rack at ~65 kW — closer to existing datacenter power budgets without major retrofit. The intra-rack bandwidth is half of NVL72 but still 8–10× HGX. Good fit for sites that can't accept 120+ kW racks; suboptimal for workloads where the full 72-GPU bisection bandwidth matters (very large MoE, very long context).

How do I think about NVLink failures vs NIC failures in the same cluster? Different blast radii. A single NIC failure kills inter-node comms for one node (8 GPUs at HGX, contributing to a TP/PP shard). A single NVSwitch failure within an NVL72 rack reduces intra-rack bandwidth and may take an entire rack offline depending on the topology. Statistically: NIC failures are more frequent per-component; NVSwitch failures are less frequent but more impactful per-event. Plan for both with redundancy and elastic resharding.

What's the difference between SHARP and "regular" all-reduce? SHARP does the aggregation inside the switch silicon, not in the GPU. For small all-reduces (latency-bound), SHARP saves 30–50% of the round-trip time. For large all-reduces (bandwidth-bound), the difference is smaller — the switch can't add bandwidth, only remove the GPU-local computation step. In practice: enable SHARP on supported hardware, profile, and confirm NCCL is actually using it (NCCL trace logs).

How does NCCL's algorithm choice change with NVL72? NCCL detects the topology and selects an algorithm. On NVL72: a single tree across all 72 GPUs is now feasible (the intra-rack bandwidth supports it); previously, NCCL would have used a hierarchical tree (intra-node tree, then inter-node ring). The simpler topology often gives better throughput. Tune via NCCL_ALGO and NCCL_TOPO_FILE if needed; in most cases the auto-detection is correct. See NCCL guide for the tuning surface.

Are there workloads that NVL72 doesn't help? Yes. Anything embarrassingly parallel (independent inference requests, small fine-tunes that fit in 8 GPUs) doesn't benefit from rack-scale bandwidth. The wall-clock and per-token cost may even be worse on NVL72 because the chip premium is real. NVL72 is justified by workloads that need the bandwidth: large MoE training, very long context attention, expert parallelism, training of trillion-parameter dense models.

What's the right way to benchmark a new NVL72 rack on delivery? Standard suite: NVIDIA's nccl-tests for collectives (all-reduce, all-gather, all-to-all at various sizes), HBM bandwidth (cuda-samples bandwidthTest), CPU-GPU bandwidth (NVLink-C2C via memcpy benchmarks), end-to-end with a known training workload (Llama-2 7B / 70B reference runs). Capture baseline numbers; compare to NVIDIA's published specs and to other rack deliveries; use as the regression baseline for subsequent firmware/driver updates.

How do I plan for the next NVLink generation (Rubin/NVLink 6)? Expect another 2× per-GPU bandwidth increase, larger per-rack scale-up (possibly 144 GPUs per rack), increased power density. Plan facility readiness now: ensure liquid cooling infrastructure can scale to 200+ kW per rack; ensure power feeds can accommodate next-gen densities. Most existing datacenters that handle NVL72 will need further retrofit for Rubin-class deployments; greenfield buildouts often skip ahead.

Can I mix NVL72 racks and HGX H100 racks in one cluster? Technically yes; operationally messy. The InfiniBand or Ethernet fabric connects them, but the topology is non-uniform — collectives that span NVL72 and HGX have to be planned carefully. NCCL handles the heterogeneity but it's not as efficient as a homogeneous fabric. Recommendation: schedule jobs per-fabric (NVL72 jobs use NVL72 racks; older HGX serves older workloads or fine-tuning).

How does NVLink-C2C change the cost equation for KV-cache-heavy serving? Significantly. KV cache offload to Grace LPDDR5X via NVLink-C2C (900 GB/s) enables ~480 GB of usable memory per Grace-Hopper module — 6× the HBM-only capacity at lower bandwidth. For long-context serving with high concurrency, this changes the GPU sizing math entirely. See KV cache for the memory math and long-context attention for the algorithmic side.

What does co-packaged optics change about rack-scale interconnects? Co-packaged optics (CPO) integrate optical I/O into the same package as the switch silicon, eliminating the pluggable optical transceiver and its energy cost. NVIDIA and others have demoed CPO products; volume deployment expected 2026-2027. Implication for NVL72-class systems: longer-reach NVLink (potentially across racks) becomes more thermally viable; the rack-scale fabric could expand to a "row-scale" fabric. Watch for Rubin-generation NVLink-over-CPO announcements.

How are training-job schedulers handling topology-aware placement? Slurm with topology plugins, Kubernetes with custom scheduling (Volcano, Kueue, NVIDIA Run:ai), and internal hyperscaler schedulers (Borg variants, NVIDIA Base Command) all attempt topology-aware placement. The challenge: workloads ask for "N GPUs" without specifying topology constraints; schedulers must infer or be told. 2026 best practice: explicit topology hints (--gpus-per-node, --num-nodes, --topology=nvl72_rack) so the scheduler can place TP groups within a rack.

Is there a clear roadmap to "the entire training run in one rack"? For non-frontier models: yes, NVL72 already fits most production fine-tuning and small-frontier training in a single rack. For frontier models (100B+ parameters, multi-trillion-token training corpora): no — even with Rubin's expected 144 GPUs per rack, frontier training requires hundreds-to-thousands of racks. The single-rack-everything vision is for smaller-scale workloads; frontier-scale will continue to span many racks for the foreseeable future.

What's the practical bandwidth degradation when a single NVSwitch fails? Depends on the topology. NVL72 has redundant NVSwitch ports; a single switch failure typically degrades bisection bandwidth by 10–20%, not 100%. Two-switch failure within a short window can take the rack offline. Operational practice: alert on single-switch failures immediately; replace within 24-48 hours; treat two-switch-in-72-hours as an incident requiring rack drain.


Glossary

  • All-reduce / all-gather / all-to-all — collective communication primitives.
  • Fabric — the interconnect topology and protocol of a GPU cluster.
  • Infinity Fabric — AMD's GPU-to-GPU interconnect.
  • InfiniBand — high-performance network commonly used for inter-node GPU clusters.
  • NVLink — NVIDIA's GPU-to-GPU interconnect.
  • NVSwitch — crossbar chip joining multiple NVLink-equipped GPUs.
  • NVL72 — NVIDIA rack-scale system with 72 Blackwell GPUs in one NVLink fabric.
  • RDMA — Remote Direct Memory Access.
  • RoCE — RDMA over Converged Ethernet.
  • Scale-up / scale-out — tightly-coupled vs network-coupled GPU systems.
  • Tensor / pipeline / data / expert parallelism — ways of splitting a model across GPUs.
  • UALink — Ultra Accelerator Link, industry-standard scale-up interconnect.

References

  • Megatron-LM — Narayanan et al., 2021. "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM." arXiv:2104.04473. The canonical reference for tensor/pipeline/data parallelism layout.
  • NVIDIA NVLink and NVSwitch documentationnvidia.com/en-us/data-center/nvlink/ and developer docs.
  • NVIDIA GB200 NVL72 reference architecturenvidia.com/en-us/data-center/gb200-nvl72/. Rack-scale NVLink architecture, NVSwitch 4, and the NVL576 multi-rack reference.
  • Ultra Ethernet Consortiumultraethernet.org. The scale-out partner to UALink for open AI fabrics.
  • NCCL Tuning Guidedocs.nvidia.com/deeplearning/nccl/. The collective library that turns NVLink/NVSwitch topology into actual collective bandwidth.
  • Meta — RDMA over Ethernet for Distributed AI Training at Meta Scale — Gangidi et al., SIGCOMM 2024. ACM Digital Library. The inter-rack scale-out reference for very large RoCE clusters.
  • NCCL — NVIDIA Collective Communications Library. github.com/NVIDIA/nccl.
  • Pathways — Barham et al., 2022. "Pathways: Asynchronous Distributed Dataflow for ML." arXiv:2203.12533. Google's TPU pod approach.
  • AMD ROCm and Infinity Fabric documentation — AMD developer docs.
  • UALink consortiumualinkconsortium.org.
  • InfiniBand Trade Association — official IB specs at infinibandta.org.
  • DeepSeek-V3 technical report — DeepSeek-AI, 2024. arXiv:2412.19437. Section on training infrastructure documents how custom all-to-all kernels were used to fit expert-parallel groups inside the fast-fabric domain.
  • ZeRO / DeepSpeed — Rajbhandari et al., 2019. "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." arXiv:1910.02054. Foundational reference for sharding choices that interact with topology.