World Models: The Ultimate Guide (2026 Edition)
Comprehensive 2026 guide to world models — what they are (vs video generators, vs simulators), the closed and open roster (Sora 2, Veo 3, Cosmos, Genie 3, Lumiere, Kling 2, Hailuo, V-JEPA 2, DINO World Model), how they're trained, the physics-fidelity question, applications in robotics / agents / games, the benchmarks (VBench, WorldVQA, Genesis Bench), and the open research questions about whether 'real' world models are emerging or whether what we have is just very good video generation.
The phrase "world model" has been used to mean several different things in 2024-2026, and the conflation matters because the engineering trade-offs are very different. In its strictest Yann LeCun / JEPA-school sense, a world model is a system that predicts future states of an environment in a learned latent space and can be queried for what would happen if X were done — a planner-friendly representation, not a renderer. In its OpenAI / Sora sense, a world model is a generative video model that displays emergent physics-like behavior (objects persist, occluders work, gravity sometimes holds) and is therefore a "simulator of the world" at the pixel level. In its NVIDIA / Cosmos sense, a world model is a video / physics generator specifically trained to produce useful synthetic data for physical AI (robotics, AVs). In its DeepMind / Genie sense, a world model is an interactive environment generator: input a few seconds of video or a single image and get a playable, action-conditional environment back. By 2026 all four meanings have product-market fit somewhere, and the underlying tech has converged in interesting ways — but they answer different questions and you should know which one you actually need.
The take: in 2026 the term "world model" usefully splits into four camps. Generative video models (Sora 2, Veo 3, Kling 2, Hailuo MiniMax, Runway Gen-4, Pika 2.x, LTX-Video, Open-Sora, CogVideoX) — produce video from text/image; useful for content creation but only loosely useful as "simulators." Physical-AI world models (NVIDIA Cosmos, Cosmos-1.0 Predict / Transfer / Reason, V-JEPA 2) — generate physically plausible video specifically for robot / AV training-data needs. Interactive world models (DeepMind Genie 2 / 3, Decart Oasis, World Labs' Large World Models, Wayve GAIA-2) — produce playable, action-conditional environments. Latent-space (planner-friendly) world models (DreamerV3, JEPA-family, MuZero / EfficientZero) — predict in compact latent space; used in reinforcement learning and (increasingly) robotics. The closed frontier on raw video generation belongs to OpenAI (Sora 2) and Google DeepMind (Veo 3); the Chinese frontier (Kling 2, Hailuo, Wan, MiniMax Hailuo-02) is competitive on most public benchmarks; the open-weight frontier (Open-Sora 2, CogVideoX-5B, Mochi 1, LTX-Video, HunyuanVideo, Wan 2.1) has narrowed the gap to ~6 months behind closed. Whether any of these are actually world models in the rigorous sense — predicting outcomes, supporting planning, capturing physics — is a contested empirical question. This guide is the map.
Companion reading: robotics foundation models / VLA ultimate guide for the consumer of synthetic video, multimodal serving (vision + audio) for the inference side, open weights ultimate guide for the LLM half, and synthetic data and distillation for the broader data-generation pattern.
Table of contents
- Key takeaways
- Four definitions of 'world model'
- Generative video models (Sora, Veo, Kling, Hailuo, Pika, Runway, open weights)
- Physical-AI world models (Cosmos, V-JEPA, World Labs)
- Interactive world models (Genie 2/3, Oasis, GAIA-2)
- Latent-space world models (DreamerV3, JEPA family, MuZero)
- How they're trained
- The physics-fidelity question
- Applications: robotics, AVs, games, content
- Benchmarks (VBench, WorldVQA, Genesis Bench, MagicPath)
- The data and compute requirements
- Open research questions
- The 2026 → 2027 outlook
Key takeaways
- "World model" means at least four different things in 2026 usage. Conflating them causes confused product decisions. Clarify: generative-video, physical-AI-data-generator, interactive-environment, or latent-space-planner.
- Generative video frontier (closed): Sora 2 (OpenAI, late 2025), Veo 3 (Google DeepMind, mid-2025), Kling 2 / Kling 2.0 Master (Kuaishou), Hailuo 02 (MiniMax), Pika 2.x, Runway Gen-4. Within a small range on most public benchmarks; differentiate on price, length, motion, prompt-adherence.
- Generative video frontier (open weights): HunyuanVideo (Tencent, 13B, MIT), Wan 2.1 (Alibaba, 14B, Apache), Open-Sora 2 (Stanford), CogVideoX-5B (Tsinghua), Mochi 1 (Genmo, Apache), LTX-Video (Lightricks, Apache). Open weights ~6 months behind closed on quality; comparable at smaller resolutions/lengths.
- Physical-AI world models: NVIDIA Cosmos-Predict / Transfer / Reason (released Jan 2025; open under permissive licence). Designed specifically for robot/AV training data. V-JEPA 2 (Meta, 2025) is the JEPA-style alternative.
- Interactive world models: DeepMind Genie 2 (Feb 2024) and Genie 3 (mid-2025); Decart Oasis (Minecraft-style playable model); World Labs (Fei-Fei Li's startup, "spatial intelligence" framing); Wayve GAIA-2 (driving simulator). Real progress on "play a generated game."
- Latent-space models quietly dominate in reinforcement learning: DreamerV3 still SOTA on many control benchmarks; V-JEPA-2 extends JEPA to video; MuZero / EfficientZero variants drive AlphaGo-lineage planning.
- Whether generative video is "actually" a world model remains empirically contested. Sora-style models produce physically-plausible video most of the time but fail on rigid-body physics, multi-step causality, and out-of-distribution scenes. They are useful as data generators even if they are not correct simulators.
- Cosmos and similar physical-AI world models already feed into robotics training data pipelines in 2026; the cycle of "VLA needs data" → "world model generates synthetic data" → "VLA trains" is starting to close.
- Compute: training a frontier video model costs tens of millions of GPU-hours; serving frontier video is the most expensive AI inference category by far (~$2-15 per 5-second clip retail).
- Evaluation: VBench is the standard for general video quality; WorldVQA tests factual / physical knowledge; Genesis Bench tests sim-grade physics; nothing yet evaluates "true world-model-ness" rigorously.
- The open question for 2026-2027: do these systems converge into genuinely useful world models (planner-friendly, physically faithful, action-controllable) or stay as expensive content-generation services with emergent simulator-like behavior?
Four definitions of 'world model'
1. Generative video model: takes text and/or image input, outputs a video clip. Trained on internet-scale video data. Examples: Sora 2, Veo 3, Kling 2, HunyuanVideo. Whether they are "world models" depends on whether you think pixel-level coherence implies world understanding.
2. Physical-AI world model: a video / state generator specifically trained to produce data useful for physical-AI training (robotics, AVs). Constrained to be physically plausible; often action-conditional; tightly integrated with simulator pipelines. Examples: NVIDIA Cosmos, V-JEPA 2.
3. Interactive world model: take input (image, video, prompt) and produce a playable environment — the user provides actions over time and the model produces video / state in response. Examples: DeepMind Genie, Decart Oasis, Wayve GAIA, World Labs Marble.
4. Latent-space world model: trained to predict next-state in a compact learned latent representation. Used for planning and reinforcement learning. Examples: DreamerV3, JEPA-family, MuZero, EfficientZero.
These are not mutually exclusive. Cosmos is part 1 + part 2. Genie 3 is part 1 + part 3. The JEPA / V-JEPA family targets parts 2 + 4. The boundaries are fuzzy and the field hasn't agreed on terminology.
Generative video models
The 2026 frontier in raw text-to-video and image-to-video:
Closed:
- Sora 2 (OpenAI) — late 2025. Up to 20-second clips at 1080p; native synchronised audio in the December 2025 update; native vertical/horizontal/square; C2PA + SynthID watermarks. The strongest perception+motion model for cinematic shots.
- Veo 3 (Google DeepMind) — mid-2025. 8-second clips, 4K-ish quality, native audio (dialogue, SFX, music). Tight integration with Vertex AI and YouTube Shorts; strong on photo-realism + camera motion.
- Kling 2 / Kling 2.0 Master (Kuaishou) — mid-2025. 10-second 1080p; strong on cinematic motion. China-developed but globally available. Most-cited "Chinese frontier video model."
- Hailuo 02 (MiniMax) — late 2024 / 2025 lineage. Strong prompt adherence; one of the cheaper closed options.
- Pika 2.x (Pika Labs) — consumer-focused; strong creative-effects ecosystem.
- Runway Gen-4 — pioneer; strong on cinematic features (Motion Brush, camera control). Less raw quality than Sora/Veo, deeper creative-tools UX.
- Adobe Firefly Video — Adobe-stack-integrated; copyright-clean training data.
- Luma Ray 2 — multimodal generation; strong on stylized output.
- DreamMachine (Luma) — older Luma product; mostly superseded by Ray 2.
- Wan 2.1 (Alibaba) — released as both API and open weights (see open section).
Open weights:
- HunyuanVideo (Tencent) — 13B parameters, Dec 2024; MIT licence. Leading open video model on most benchmarks. Strong base for fine-tuning.
- Wan 2.1 (Alibaba) — 14B parameters, early 2025; Apache 2.0. Includes text-to-video, image-to-video, and video editing capabilities.
- CogVideoX-5B (Tsinghua) — 5B parameters, late 2024; Apache 2.0. Strong open baseline.
- Mochi 1 (Genmo) — 10B parameters, late 2024; Apache 2.0. Optimized for adherence to prompt.
- LTX-Video (Lightricks) — fast inference, open weights, Apache 2.0.
- Open-Sora 2 (Stanford / community) — open replication of Sora's architecture; ~5B parameters.
- Allegro (Rhymes AI) — open; tighter compute footprint.
- AnimateDiff — older but still cited; motion modules over Stable Diffusion.
- EasyAnimate (Alibaba) — open; good documentation.
Evaluation note: open video models typically perform comparably to closed at lower resolutions and shorter durations; the gap shows up in longer clips (10s+), consistent multi-shot scenes, and prompt-adherence on complex inputs.
Physical-AI world models
Designed specifically to generate training data for robotics and AVs:
- NVIDIA Cosmos (Jan 2025, with continuous updates through 2026) — open under permissive NVIDIA Open Model Licence. Cosmos-Predict (text-to-video, image-to-video), Cosmos-Transfer (sim-to-real translation), Cosmos-Reason (reasoning about future states). Designed to plug into Isaac Sim and feed training data to GR00T. The most commercially-integrated "physical AI world model" in 2026.
- V-JEPA 2 (Meta) — mid-2025. Self-supervised video model in JEPA-style latent prediction. Released open. Aims to capture physical-world structure without per-pixel prediction.
- DINO-WM — DINOv2-based world model; latent-prediction style; open.
- DriveWorld / GAIA-2 (Wayve) — driving-specific world model for AV simulation. Closed.
- CARLA + Maps2DV — older simulators with growing learned components.
- Tesla World Model (rumored, closed) — Tesla's internal world model for FSD simulation; some details leaked but no public release.
The physical-AI camp is where "world model" most rigorously matches its name — these systems are evaluated on their utility as training data for downstream policies, and they're tuned for physical plausibility over visual fidelity.
Interactive world models
Models you can play, not just watch:
- DeepMind Genie 2 / Genie 3 (Feb 2024, mid-2025) — input: image. Output: playable 2D/3D environment that responds to keyboard input. Genie 3 extended to longer episodes and more realistic environments. Headline demos: generate a video game from a hand-drawn sketch.
- Decart Oasis — Minecraft-like playable world model; runs at >20 FPS; demonstrably a "playable AI world."
- World Labs (Fei-Fei Li) — "spatial intelligence" startup; Marble (announced 2025) generates explorable 3D environments from a single image. Tight academic + commercial position.
- Wayve GAIA-2 — driving-specific interactive world model; AV simulation.
- WHAM (Microsoft Research, 2024-2025) — "world model" for game environments; can generate Bleeding Edge-style game play.
- Sora 2 with action conditioning (research demo) — Sora variants that take simulated controller inputs.
Interactive world models are the youngest of the four categories but the most visually striking demos of 2025. Whether they generalize beyond demo conditions remains an open question.
Latent-space world models
Older and less hype-laden but quietly dominant in RL research:
- DreamerV3 (DeepMind, 2023; widely deployed since). Latent-space world model + policy + value function. SOTA on many continuous-control and game benchmarks. Strong baselines extended into 2025-2026.
- JEPA-family (Meta / LeCun's group) — I-JEPA (images), V-JEPA / V-JEPA 2 (video). Latent-space prediction; trained self-supervised; aimed at "understanding without per-pixel reconstruction."
- MuZero / EfficientZero (DeepMind) — model-based RL via learned latent dynamics; powers AlphaGo-lineage planning.
- TD-MPC / TD-MPC2 — model-predictive-control with learned dynamics; strong on continuous-control.
- PlaNet, PlaNet-V2 — older Dreamer-family ancestors; still cited.
- Dynalang — language-conditioned latent world models.
This camp is the one most clearly answering "what is a world model" in LeCun's framing: a learned predictor of latent future states usable for planning. The other camps generate pixels; this one generates representations.
How they're trained
Generative video models:
- Diffusion architectures (DiT — Diffusion Transformer) trained on hundreds of millions to billions of video clips.
- Two-stage: VAE encoder compresses video → latent space → diffusion in latent → VAE decode.
- Conditioning: text encoder (CLIP, T5, custom), optional image conditioning, optional motion / camera-trajectory conditioning.
- Training compute: ~1M-50M H100-hours per generation for frontier models.
- Data: scraped + licensed video; the legal landscape is contested (NYT v. OpenAI, Disney v. Midjourney, etc.).
Physical-AI world models:
- Often diffusion-based but trained on simulation outputs (Isaac Sim, CARLA) + real-robot / AV-video data.
- Action-conditional: input includes the proposed action; output includes the resulting future state.
- Often paired with a discriminator that filters physically implausible outputs.
Interactive world models:
- Trained on labeled gameplay video (input action + resulting frame). Genie's innovation was learning latent actions from unlabeled video.
- Roll-out architecture: predict next frame; feed it back; predict next; etc.
- Heavy compute for training; light-ish for inference (real-time playability is required).
Latent-space world models:
- Self-supervised training: predict latent representation of t+1 given t and action.
- Much cheaper to train than pixel-level video models (10× to 100× less compute).
- Standard component of model-based RL since ~2018.
The physics-fidelity question
Are generative video models "actually" learning physics?
Evidence for:
- Sora-class models produce plausible cloth dynamics, fluid simulation, occlusion handling, soft-body physics.
- Cosmos demonstrates physically-grounded multi-second predictions of robot arms manipulating objects.
- VBench-physics sub-scores show steady improvement (~30% to ~55% on physical-realism subsets from 2023 to 2026).
Evidence against:
- Persistent failures: object permanence over occlusions, rigid-body collisions, multi-step causality (knock over a domino → all subsequent dominoes), explicit counting.
- "Physics-y" outputs often fall apart at 5+ second time horizons.
- Adversarial prompts (a glass that should shatter when dropped; an object that should fall) often fail.
- Pixel-level coherence ≠ physical understanding; the models pattern-match plausibility, they don't solve physics equations.
The synthesis: generative video models do learn statistical regularities that look like physics. They do not learn physics in the sense that a physics engine does (energy conservation, momentum, exact rigid-body dynamics). They are useful as data generators for training other models (RL, VLA) but their outputs need to be filtered or post-processed for tasks where rigor matters.
Applications
Content creation: the biggest market by revenue. Sora, Veo, Runway, Pika serve ad agencies, YouTube creators, film pre-visualization. Adoption growing but constrained by quality (5-10s clips), price ($2-15/clip), and IP concerns.
Robotics training data: Cosmos + GR00T pipeline. World-model-generated synthetic trajectories supplement real-robot data for VLA training. Early-stage but real. See robotics ultimate guide.
AV simulation: Wayve GAIA-2, Waabi Worldwide Simulator, Tesla's internal world model. Synthetic driving scenarios for AV safety testing.
Game environments: Genie, Decart Oasis, WHAM. Generative game worlds; early commercial use.
Sim-to-real research: V-JEPA 2, Cosmos-Transfer aim to bridge sim and real for downstream policy training.
Education / training simulations: emerging — interactive medical training, mechanical-repair walkthroughs, etc.
Benchmarks
The evaluation suite for world models is fragmented:
- VBench — 16 sub-metrics for video quality (motion smoothness, subject consistency, physical realism, etc.). The most-cited general benchmark.
- VBench++ — extended VBench with more diverse evaluation prompts.
- WorldVQA (used on /leaderboard/visual) — factual / physical knowledge in video models.
- Genesis Bench — sim-grade physics fidelity evaluation.
- EvalCrafter — 17-metric video evaluation suite.
- FVD (Fréchet Video Distance) — distribution-level video quality metric; lower is better.
- VideoScore / FETV — semantic quality + temporal consistency.
- VBench-Long — extension for longer-duration video.
For interactive world models:
- Genie Eval Suite — playability + coherence.
- WorldBench / MapBench — emerging benchmarks for generated environment quality.
For latent-space / RL world models:
- DMC suite — DeepMind control benchmark.
- Atari 100k — sample-efficient RL benchmark.
- Crafter — open-world RL benchmark.
For physical-AI world models:
- Most evaluation is downstream task performance (does the VLA trained on this data do better than baseline?) rather than direct world-model evaluation. This is the cleanest signal but expensive.
Live data: /leaderboard/visual tracks video gen + WorldVQA scores.
Compute and data requirements
Training:
- Frontier video model: ~$50-300M training cost; tens of millions of H100-hours.
- Open-weight video model (5-15B): $1-10M training cost; thousands of H100-hours.
- Latent-space world model (DreamerV3-class): $10k-1M depending on environment / data volume.
Inference:
- Sora-class 5-second 1080p clip: ~$2-10 retail; underlying compute ~30-180 H100-seconds.
- Open-weight HunyuanVideo / CogVideoX: serve on 4-8× H100; ~1-3 clips/sec aggregate; serving cost ~$0.10-0.50/clip.
- Interactive world models (Genie, Oasis): must run at >20 FPS; large hardware demands; usually demoed on H100/B200-class.
Data:
- Video datasets: WebVid (~10M), Panda-70M (70M), LVD-2B (LumaLabs, 2B clips), Internvideo (200M), various proprietary YouTube derivatives. Licensing contested.
- Physical-AI: Open-X Embodiment (
2.4M trajectories), DROID (76k demos), various simulation data.
Open research questions
The questions that will determine 2027-2030:
- Do generative video models converge to true world models with more scale and better architectures, or do they remain "very good plausibility engines"?
- Is action-conditioning sufficient to make video models useful for planning, or do you need an explicit dynamics model?
- Can JEPA-style latent-space models scale to internet-scale video while preserving their planner-friendly properties?
- What is the right benchmark for world-model quality — a single number doesn't capture the full capability surface.
- Can world-model-generated data substitute for real-robot data at scale? Early evidence is positive but bounded.
- Are interactive world models a real category or a research demo that won't scale?
- How does generative-video copyright shake out? The legal landscape will significantly constrain training data.
- Will Chinese labs lead the open frontier here as they have for LLMs and image gen? Wan 2.1 and HunyuanVideo suggest yes; Cosmos and V-JEPA 2 show NVIDIA / Meta still investing heavily.
The 2026 → 2027 outlook
- Sora and Veo continue to lead closed video generation; the gap to Kling / Hailuo / Wan stays at ~3-6 months.
- Open-weight video models close to within ~3 months of closed on most metrics; HunyuanVideo / Wan / CogVideoX / Mochi successors lead.
- Cosmos and V-JEPA 2 (and successors) become standard components in robotics training pipelines.
- Genie / Decart Oasis interactive demos mature into early game / education products.
- VBench / EvalCrafter consolidate as standard evaluation. A "WorldMMLU" benchmark likely emerges by 2027.
- Latent-space world models quietly dominate RL and increasingly robotics policy training, without much public attention.
- The physics-fidelity question doesn't get definitively resolved; instead, the community develops more granular sub-evals (rigid body, fluid, multi-step causality, conservation).
- Legal / IP landscape tightens — expect major settlements or rulings in 2026-2027 that constrain training data access for generative-video models.
- Watermarking + provenance become regulatory requirements (EU AI Act Article 50 from Aug 2026; expected US executive-order successors).
Further reading
Internal:
- Robotics foundation models & VLAs: ultimate guide
- Open weights: the ultimate guide
- Multimodal serving (vision + audio)
- Synthetic data and distillation
- Vector search & embeddings
- AI coding agents: ultimate guide
External: