Robotics Foundation Models & VLAs: The Ultimate Guide (2026 Edition)

For two decades robotics research progressed by hand-engineering pipelines: perception modules, planning modules, motion controllers, each developed separately and stitched together. In 2023 Google's RT-2 and the RT-X collaboration showed that a single end-to-end neural model — trained on internet-scale vision-language data plus tens of millions of trajectories from real robots across labs — could outperform the pipeline stack on most manipulation benchmarks. The Vision-Language-Action (VLA) paradigm was born. In 2024-2025 it consolidated: Physical Intelligence's π-zero showed generalist manipulation on hundreds of household tasks; NVIDIA's GR00T (Project GR00T) launched as the humanoid foundation model; OpenVLA proved open weights could be competitive; Octo, RDT-2, RoboFlamingo, RoboFlamingo-Plus, Diffusion Policy, and dozens of other variants matured. By 2026 there's a coherent stack: foundation VLA → fine-tuned policy → robot embodiment, with a small set of leaders and a vibrant open-source layer. Simultaneously the humanoid robot companies (Figure, 1X, Tesla, Apptronik, Agility, Sanctuary, Unitree, Xpeng, Booster) raced to ship hardware that could use these models, with mixed results — most humanoids in commercial deployment in 2026 are doing narrow industrial tasks, not folding laundry.

The take: in 2026 the VLA / robotics-foundation-model field looks like LLMs in 2022 — a clear paradigm shift, demonstrated capabilities, a small handful of frontier labs, but commercial deployment lagging research by years. The frontier closed VLAs are Physical Intelligence π-zero / π-1 / Hi and Figure Helix. The frontier open VLAs are NVIDIA GR00T N1.5, OpenVLA / OpenVLA-OFT+, Octo, and RDT-2 — all within striking distance of the closed leaders on academic benchmarks. The remaining bottlenecks are data (real-robot trajectories are expensive), evaluation (no GPQA-equivalent in robotics — every paper uses different sim/real splits), and embodiment generalization (a policy trained on one robot rarely transfers cleanly to another). This guide is the map: what a VLA actually is, the closed and open leaders, the humanoid hardware landscape, the benchmarks worth tracking, the data flywheel problem, and the research questions that will determine when robotics has its "ChatGPT moment."

Companion reading: open weights ultimate guide for the LLM half (VLA bases are often VLM-derived), world models for the simulation half, multimodal serving (vision + audio) for the inference-serving side, and synthetic data and distillation for the data-scarcity tactics that dominate robotics.

Key takeaways
What a VLA actually is
The architectural lineage: VLM → VLA → action chunking
The 2026 VLA roster: closed and open
The data problem: Open-X Embodiment and what's missing
Humanoid robot companies racing to ship
Non-humanoid foundation-model robotics (arms, mobile, drones)
The benchmark landscape (CALVIN, LIBERO, SimplerEnv, RoboCasa)
Training a VLA: pretraining, post-training, on-robot RL
Inference: how a VLA runs on real hardware
Sim-to-real and the reality gap
Embodiment generalization (the cross-robot transfer problem)
Safety in physical AI
The 2026 → 2027 outlook

Key takeaways

VLA = Vision-Language-Action model. Takes images + language instruction → outputs continuous robot actions. Architecturally a VLM + action head trained on internet pretraining + robot trajectories.
Closed frontier: Physical Intelligence π-zero / π-1 / Hi (generalist manipulation), Figure Helix (humanoid manipulation), Google DeepMind RT-2 / Gemini Robotics (general robotics). Closed labs dominate commercial deployment.
Open frontier: NVIDIA GR00T N1.5, OpenVLA / OpenVLA-OFT+, Octo, RDT-2, RoboFlamingo-Plus, Diffusion Policy variants. Within 5-15 success-rate points of closed on most benchmarks.
The data flywheel is the bottleneck. Open-X Embodiment (RT-X collaboration) released ~2.4M trajectories from 22 robot embodiments. That's tiny relative to the internet-scale data LLMs train on. Synthetic-data generation via world models is the most promising direction (see world models guide).
Humanoid hardware progressed faster than expected in 2025-2026 but commercial deployments remain narrow: Figure (BMW factory pilots), 1X Neo (research/early consumer), Tesla Optimus (Tesla factories), Apptronik Apollo (logistics), Agility Digit (logistics), Unitree H1/G1/H2 (research market), Xpeng Iron, Booster T1, Sanctuary Phoenix.
Most "humanoid demo" videos are teleoperated or scripted. Real autonomous capability is far behind perception. Verify any demo by asking: was the model end-to-end on this task, or were there human-in-the-loop / teleop assists?
Action chunking (predicting N future actions instead of 1) is the standard technique that made VLAs work at 30-50Hz instead of 1Hz. Diffusion Policy popularized; π-zero refined.
Inference latency is the binding constraint. Robot control needs >10Hz for most tasks; >30Hz for delicate manipulation. Most VLAs run on a single H100 or A6000-class GPU at 30-100Hz with action chunking.
Benchmarks are immature. CALVIN, LIBERO, SimplerEnv, RoboCasa, Open-X are the most-cited but each measures a narrow capability. No "MMLU-equivalent" yet.
Sim-to-real transfer remains hard. Massive Isaac Sim / Genesis / MuJoCo training helps, but real-world deployment still requires either real-data fine-tuning or domain randomization.
Embodiment generalization is the open research question. Cross-robot transfer (policy trained on Franka arm → works on UR5) works partially via action-normalization tricks; cross-form (arm → humanoid) barely works.
Safety frameworks for physical AI are nascent. ISO/TS 15066 (cobots) and emerging ISO/IEC SC 42 working groups address parts; full safety case for autonomous-VLA humanoids does not yet exist.

What a VLA actually is

A Vision-Language-Action model takes:

One or more camera images (often multi-view: front, wrist, gripper)
A natural-language instruction ("pick up the red block and put it on the green plate")
Optional proprioception (joint angles, end-effector pose)

And outputs:

A sequence of robot actions (typically 7-DoF arm + 1-DoF gripper, or 26+ joints for a humanoid)
Each action is a continuous vector (joint targets, end-effector deltas, or motor commands)

The model is trained on:

Internet pretraining (vision-language data — same corpora as VLMs).
Robot trajectory data — episodes of (images, language, action) tuples collected from real robots, simulation, or teleoperation.

The single-model end-to-end pattern replaces the older robotics-pipeline stack (perception → planning → control), collapsing the robot into a single embodied AI agent rather than a chain of hand-tuned modules. Same paradigm shift as: pixel-to-pixel ML replaced hand-engineered computer vision; pixel-to-text replaced OCR pipelines; LLMs replaced sentence-level NLP pipelines.

The architectural lineage

VLAs descended from VLMs (Vision-Language Models), the canonical multimodal architecture. The standard recipe:

Start with a VLM base (PaLI, CLIP+LLaMA, Flamingo, BLIP-2, etc.). The model already understands images and language.
Add an action head — a small MLP or transformer that maps the VLM's final hidden state to continuous action vectors.
Fine-tune on robot trajectories, often with action chunking (predict N future actions in one forward pass).

Key architectural innovations over 2023-2026:

Action chunking (ALOHA, ACT, then RT-1, π-zero) — predict N actions in one forward pass instead of one. Crucial for hitting >10Hz control rates with billion-parameter models.

Diffusion Policy — train the action head as a denoising diffusion model. Improved multi-modal action distributions (the model can express "either of two valid grasps" rather than averaging them into a wrong middle).

OFT (One-shot Fine-Tuning) — efficient adaptation to new robot embodiments with few trajectories.

Flow matching — newer alternative to diffusion for action prediction; faster inference.

Cross-embodiment heads — separate action heads per robot type, sharing the VLM trunk.

Text-tokenized actions (RT-2 style) — encode actions as text tokens so the LLM can predict them directly. Simpler but limits action precision.

Continuous action regression (π-zero, OpenVLA) — direct continuous prediction; more precise.

The 2026 VLA roster

Closed (commercial / research):

Physical Intelligence π-zero / π-0.5 / π-1 / Hi — the strongest generalist VLA family of 2025-2026. π-zero (Oct 2024) demonstrated hundreds of household tasks; π-1 (mid-2025) scaled. Hi (late 2025) is the long-horizon-reasoning variant. The lab is the most-cited "frontier-VLA-from-a-startup" story.
Figure Helix — Figure's in-house VLA for the Figure 02 humanoid. Notable for fast inference (>200Hz on the upper body) and dual-system architecture (System 1: fast; System 2: slow planner).
Google DeepMind Gemini Robotics (formerly RT-2 lineage) — Gemini-based VLA; tightly integrated with Google's broader robotics research. Closed; demos shown in 2025 papers.
Tesla Optimus — Tesla's humanoid model; closed; vertically integrated with Tesla hardware and dojo / FSD-derived training.
Sanctuary Carbon (Phoenix) — closed; cognitive architecture combining classical AI + neural for the Phoenix humanoid.
1X World Model + policy — proprietary; powers Neo humanoid.
Apptronik Apollo policy — proprietary; NASA / Mercedes / GXO partnerships.

Open weights:

NVIDIA GR00T N1 / N1.5 — open VLA. N1 (Mar 2025) released as a base model; N1.5 (mid-2025) improved. Apache-style licence. Designed to be the open foundation for humanoids; tightly tied to Isaac Sim training pipeline. NVIDIA's Project GR00T also includes the Cosmos world model and a broader humanoid stack.
OpenVLA / OpenVLA-OFT+ — Stanford et al., 2024-2025. 7B-parameter VLA on Llama 2 + SigLIP. The reference open VLA; widely fine-tuned by the community.
Octo — Berkeley, 2024. Smaller and faster than OpenVLA; transformer policy over multimodal observations. Open under Apache.
RDT-2 (Robotics Diffusion Transformer 2) — Tsinghua et al. Strong open-source diffusion-based VLA. Apache.
RoboFlamingo / RoboFlamingo-Plus — open VLA built on Flamingo.
Diffusion Policy (Columbia, MIT) — not a single model but a recipe; many open implementations.
ACT (Action Chunking Transformer) — ALOHA project (Stanford). Open implementation widely used for bimanual manipulation.
Cogact / Hi-Robot / Helix-style open clones — emerging open replications of closed architectures.
GR-1 / GR-2 / GR-3 (ByteDance) — open VLA-style releases from Chinese labs.
Pi0 community ports (RoboPi, etc.) — research replications of Physical Intelligence's recipe.
OpenPi — community open implementation of π-style architecture.

Chinese frontier labs entering:

DeepSeek Robotics (rumored, unconfirmed at time of writing).
Alibaba DAMO Embodied AI — research releases.
Xpeng XBrain — for Iron humanoid.
Booster Robotics — T1 humanoid + open paper releases.
Unitree — releases models alongside H1/G1/H2 hardware; mix of open and proprietary.

The data problem

The fundamental bottleneck for VLAs is data. Compare:

LLMs train on trillions of tokens from the internet — essentially free at scale.
VLAs need (image, language, action) triplets from physical robots — each robot-hour produces ~100s-1000s of trajectories at best, and requires expensive hardware + teleoperation labour.

Open-X Embodiment (RT-X collaboration, Google + ~30 labs, 2024) released ~2.4M trajectories across 22 robot embodiments — a watershed dataset that enabled most subsequent open-source VLA work. But 2.4M trajectories is tiny relative to the internet pretraining LLMs enjoy.

Data-collection tactics in 2026:

Teleoperation farms — Physical Intelligence, Tesla, 1X, Figure all operate large teleop facilities where humans drive robots through tasks. Expensive but high-quality.
In-the-wild deployment — 1X Neo, Apptronik Apollo collect data while doing real customer work. Slow but realistic.
Simulation — Isaac Sim, MuJoCo, Genesis, Habitat, AI2-THOR, Pybullet. Cheap to scale; sim-to-real gap is real.
World-model-generated synthetic — generative video models (Sora, Veo, Cosmos, Genie, Lumiere) as "physics simulators" for VLA training. Active research direction; see world models guide.
Cross-embodiment pooling — use trajectories from many robots to train one model that generalizes via action normalization.
Human video as training data — large-scale human-activity video (Ego4D, EpicKitchens) as VLA pretraining. Limited because of the action-decoding gap.

The data flywheel pattern: deploy a robot → collect trajectories → improve the model → deploy more capable robots → more trajectories. The companies winning the flywheel race in 2026 are Tesla (highest robot count via Optimus rollout), Figure (deepest enterprise pilots), Physical Intelligence (most teleop labour), and 1X (consumer-data-collection narrative).

Humanoid robot companies racing to ship

The 2026 humanoid roster, with deployment status:

US:

Tesla Optimus — Gen 3 in production; >1000 units in Tesla factories per Elon claims (verification spotty). Bipedal + dexterous hands. Vertically integrated with Tesla AI / FSD inference stack.
Figure (Figure AI, formerly Brett Adcock's lab) — Figure 02 + Helix VLA. BMW factory deployment + commercial partnerships. Strong investor backing ($2B+ raised by 2026).
1X Technologies — Neo + Neo Beta (consumer/research). Norwegian-American; Eve (mobile) deployed in early settings; Neo focused on home.
Apptronik Apollo — industrial focus (Mercedes-Benz, GXO). NASA partnership.
Agility Digit — bipedal logistics; ~hundreds in deployment with Amazon, GXO, Spanx.
Sanctuary AI Phoenix — Canadian; cognitive architecture; smaller deployments.
Boston Dynamics Atlas (electric) — research → early commercial Hyundai factory pilots.
Persona AI — newer, smaller US entrant.

China:

Unitree H1, G1, H2 — affordable research market; H2 (2025) the latest. Unitree leads China on volume of units sold.
Xpeng Iron — auto-OEM-backed; XBrain VLA; production line integration ambitions.
Booster Robotics T1 — Tsinghua-affiliated; strong research output.
UBTech Walker S / S2 — long-running Chinese humanoid program; deployed in EV factories.
Fourier GR-1 / GR-2 — research market.
Agibot (Zhiyuan) — newer Chinese frontier humanoid; major investor backing.
LimX Dynamics — research and quadruped + humanoid; growing.
Galbot — bimanual manipulation focus; partnership with Tsinghua.

Other:

Mentee Robotics (Israel) — Menteebot; bipedal generalist.
Engineered Arts Ameca — UK; expressive face / interaction focus.
HMND (UK) — early-stage entrant.

Reality check: most humanoid demos in 2026 are still narrow. Folding laundry remains hard. Door-handle generalization remains hard. Mobile manipulation with dynamic obstacles remains hard. The deployments that work in production are: factory pick-and-place with strong scene constraints, logistics tote handling, simple inspection rounds. Real "general purpose home robot" capability remains years away.

Live data: /leaderboard/physical tracks humanoid companies + VLA models.

Non-humanoid foundation-model robotics

VLAs are not just for humanoids:

Single arm + gripper (Franka, UR5, ALOHA) — the easiest target; most academic VLA work happens here.
Bimanual (ALOHA, Mobile ALOHA, Yumi) — folding laundry, cooking demos.
Mobile manipulation (Stretch by Hello Robot, mobile ALOHA bases, Spot with arm) — manipulation while moving.
Quadrupeds (Boston Dynamics Spot, Anymal, Unitree Go2) — locomotion + tool use; less arm-manipulation focused.
Drones (Skydio, Anduril, autonomous DJI) — VLA-style perception-action models for aerial; less mature than ground.
Surgical robots (Intuitive da Vinci, Medtronic Hugo) — narrow VLA-style models for specific procedures; regulatory bar much higher.
Industrial / factory (KUKA, ABB, Fanuc + VLA wrappers) — incumbents adding learning layers.

Foundation models for non-humanoid robotics see less hype but more real deployment.

The benchmark landscape

The "best" robotics benchmarks in 2026 — each measuring something narrow:

CALVIN — long-horizon language-conditioned manipulation; 34-task suite; sim-only. Standard for VLA papers.
LIBERO — lifelong robot manipulation; 130 tasks across 4 task suites. Strong for studying continual learning.
SimplerEnv (Real-Sim eval) — sim that's calibrated to closely match real-robot results for several benchmarks. Bridges sim-to-real gap for evaluation.
RoboCasa — large-scale household manipulation in MuJoCo, with 100+ kitchen tasks.
Open-X Embodiment evaluation suites — multiple cross-embodiment benchmarks built on the dataset.
Habitat 3.0 / HSSD — embodied AI in house environments; navigation + manipulation.
AI2-THOR / RoboTHOR / ManipulaTHOR — older but still cited.
Genesis Bench — newer; built on the Genesis simulator.
Bridge / BridgeData V2 — real-robot trajectory dataset + accompanying eval.
DROID — large real-robot dataset (76k demos, 18 months collection); also serves as eval.

Caveats:

Most papers report on different splits or different sets of tasks. Cross-paper comparison is hard.
Sim performance often doesn't transfer to real performance; SimplerEnv and BridgeData try to address this.
No held-out hidden benchmark exists yet — every benchmark's tasks are public, so contamination through training-data inclusion is possible.
Real-robot evaluation is expensive — most papers use sim for primary results and report a small real-robot eval as a sanity check.

The field needs an "MMLU" / "MTEB" — a standardized aggregate benchmark with hidden tasks. As of 2026, this does not exist.

Training a VLA

Standard 2026 training recipe:

VLM pretraining — start from a strong open-weight VLM (Qwen-VL, LLaVA, PaliGemma, SigLIP+Llama). Skip if you're using a closed VLM you don't control.
Action-head initialization — add a small transformer or diffusion head for action prediction.
Trajectory fine-tuning — supervised fine-tuning on Open-X + your own trajectory data. Often with action chunking (predict 50 actions per inference).
Embodiment normalization — apply scaling / centering to actions across robot types so the same model handles multiple embodiments.
Optional: on-robot RL — RL on the deployed robot to fix specific failure modes. Slow because of sample cost; growing as RFT (reinforcement fine-tuning) techniques mature.
Optional: distillation from teacher VLA — train a smaller VLA to mimic a larger one for deployment.

Compute requirements: training a 7B-parameter VLA from a strong VLM base takes ~1000-10000 H100-hours depending on data size and recipe. Frontier closed labs (Physical Intelligence, Figure, NVIDIA) likely spend 100k+ H100-hours per generation.

Inference: how a VLA runs on real hardware

A typical VLA inference loop on a robot:

loop @ 30 Hz:
    images = capture from N cameras
    proprioception = read joint angles + end-effector pose
    if action_buffer is empty:
        action_chunk = VLA.predict(images, language, proprioception)
        action_buffer = action_chunk  # e.g. 50 actions
    next_action = action_buffer.pop()
    execute(next_action) on robot

Action chunking means the VLA runs at ~1-3 Hz (model inference) while the robot runs at 30-100 Hz (action execution). The mismatch is bridged by predicting many actions per inference.

Hardware:

Most academic VLAs run on a single H100 or A6000 (~80GB VRAM).
Real robot deployments use NVIDIA Jetson Thor / Orin (~32-64 GB GPU memory, edge form factor).
Figure Helix runs on dual onboard GPUs; system 1 (fast) on edge, system 2 (planning) on a cloud or faster local model.
Tesla Optimus uses Tesla's HW4 / HW5 inference silicon.

Latency budget:

VLA inference: 50-500ms per prediction depending on model size.
With action chunking (50 actions per prediction): effective action-rate of 50-100Hz.
Camera capture + preprocessing: ~10-30ms.
Action execution: depends on robot controller (typically <10ms loop).

Sim-to-real and the reality gap

The reality gap — the divergence between simulated and real physics — is the single biggest blocker for sim-trained policies.

Tactics that work:

Domain randomization — randomize friction, mass, lighting, textures, sensor noise during sim training so the policy learns robust behaviors.
Real-data fine-tuning — pretrain in sim, fine-tune on real-robot data. The 2026 standard for most production deployments.
Hybrid sim — simulators calibrated to specific robot hardware (Isaac Sim's GPU-accelerated physics; Genesis's differentiable physics; MuJoCo MJX).
System ID — learn the dynamics gap as a residual model and add it back to the simulator.
World-model generated data — use Cosmos, Veo, Sora as "video simulators" for VLA training. Active research direction.

The simulators that matter in 2026:

Isaac Sim / Isaac Lab (NVIDIA) — the leading commercial sim platform; GPU-accelerated; tight integration with Project GR00T.
MuJoCo (DeepMind) — the academic standard; MJX is the JAX-accelerated version.
Genesis — newer; differentiable physics with first-class sim-to-real focus.
PyBullet — older but still widely used in research.
Habitat 3.0 — embodied AI in house environments.
Cosmos (NVIDIA, late 2025) — generative world model for physical AI simulation.
Genie 3 (DeepMind, mid-2025) — neural world model for game/sim environments; potential robotics use.

Embodiment generalization

The cross-robot transfer problem — train on Franka, deploy on UR5 — is the open research question.

What works partially:

Action normalization (scale + center actions per robot).
Cross-embodiment training (Open-X) — train on many robots at once.
VLA → small adapter per robot (parameter-efficient transfer).

What doesn't work well yet:

Cross-form transfer (arm → humanoid → quadruped). Some recent results show partial transfer but it's far from solved.
Cross-sensor (RGB-only → RGB+depth) without fine-tuning.
Cross-task transfer in fully novel domains.

Physical Intelligence's π-zero paper is the most-cited recent demonstration of strong cross-embodiment behavior; it works on multiple robot types from a single model, but with embodiment-specific fine-tuning rather than zero-shot generalization.

Safety in physical AI

VLA-driven robots introduce real-world risk that LLMs don't:

Physical injury to humans nearby.
Property damage.
Goal misspecification at scale (a humanoid misinterpreting "clean up the kitchen" can be expensive).
Adversarial perturbations to visual inputs (sticker on a stop sign analog).

2026 safety frameworks:

ISO/TS 15066 — cobots (older but still applicable to industrial humanoid arms).
ISO 13482 — service robots safety standard.
EU AI Act Article 14 / 15 — high-risk AI human oversight + robustness; applies to robotics.
NIST AI Risk Management Framework — voluntary US framework.
ISO/IEC SC 42 — emerging working groups on AI safety.

Operational mitigations:

Force limits + emergency stop reachable.
Geofenced operating zones; humans excluded during autonomous operation in many deployments.
Tele-supervision for novel tasks.
Recorded video + sensor logs for post-incident analysis.

No equivalent of LLM red-teaming exists yet for physical VLAs. The field is early.

The 2026 → 2027 outlook

Closed VLAs continue to lead on commercial deployment; Physical Intelligence, Figure, NVIDIA-via-partners hold the frontier.
Open VLAs continue to close the gap on academic benchmarks; expect GR00T-class open models within ~6 months of closed leaders.
Humanoid deployments grow but remain narrow. Factory pick-and-place, logistics tote-handling, simple inspection are the workloads that work. Home humanoids remain mostly demo-only.
World-model-generated training data matures and becomes a real input to VLA training pipelines (Cosmos, Genie, Veo etc.).
Cross-embodiment transfer improves but doesn't get solved.
A standardized "MMLU-of-robotics" benchmark likely emerges by 2027; current candidates include extensions of Open-X eval suites and the SimplerEnv-Lite project.
Safety regulation tightens, especially in EU and California. Expect emerging requirements for incident reporting, audit trails, and human oversight for autonomous humanoids.
Investor sentiment moderates from 2024-25 hype; companies with real revenue + factory deployments (Figure, Apptronik, Agility) consolidate share.
Chinese humanoid programs continue to scale on volume (Unitree, Xpeng, Booster, UBTech, Agibot); software gap remains real but narrowing.