NVIDIA’s SANA-WM Is Not a World Model. The Paper Knows.

NVIDIA’s SANA-WM Is Not a World Model. The Paper Knows.

/ Maxim Starkweather

NVIDIA Labs dropped a paper last week titled SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer. The headline number is real: a 2.6B-parameter open-source model that synthesizes one-minute 720p video with precise 6-DoF camera control on a single H100, trained on roughly 213K public clips across 15 days. The repo still says the weights are coming. The paper is on arXiv, the project page is up, and the trade-press has already labeled it the next entry in the “world model race.” That last move is the one worth pausing on. “World model” is now stretched across three different research programs that don’t share a problem definition, a method, or even a frontier. SANA-WM is the third one. It is the most diluted, and it is the one most worth being precise about.

What SANA-WM Actually Does

Read the paper without the headline and the substance comes back into focus. SANA-WM is a hybrid linear diffusion transformer, and the four architectural pillars the authors stress are concrete. A frame-wise Gated DeltaNet aggregator is paired with periodic softmax blocks for memory-efficient long-context modeling. A dual-branch camera control system combines a latent-rate UCPE branch capturing global 6-DoF trajectory with raw-frame Plücker mixing that recovers fine motion inside each VAE temporal stride. A two-stage generation pipeline pushes a stage-1 sequence through an independent refiner for quality and consistency. And a pose-annotation pipeline reconstructs metric-scale 6-DoF camera trajectories from public video, yielding the training corpus.

The numbers in the full paper are honest. SANA-WM with refiner scores VBench Overall 80.62/81.89 on the 60-second benchmark. LingBot-World scores 81.82/81.89 but requires eight GPUs and runs at 480p. Matrix-Game 3.0 scores 78.53/78.79 at 720p. Rotation error sits at 4.50°/8.34°. Throughput is 22 videos per hour on a single H100, roughly 36x faster than the closest baseline. The few-step distilled variant runs in 34 seconds on an RTX 5090 with quantization. None of these numbers are oversold.

Two details rarely surface in the recaps. First, the model is released under CC BY-NC-SA 4.0, non-commercial use with attribution and share-alike. Second, the SANA-WM code and weights aren’t actually out yet. The NVlabs/Sana repository on GitHub hosts the earlier SANA image and video models, but SANA-WM appears under documentation as “coming soon” and as an unchecked item on the project’s to-do list. As of this writing, the paper is the artifact. The model is a roadmap with a publication date attached.

Now look at what the paper does not claim. Nowhere in the abstract or the design sections do the authors argue for physics-aware or causal modeling. They describe scene persistence and camera-conditioned motion through architecture rather than explicit physics simulation. They state their limitations directly: SANA-WM “remains scale-limited, lacks explicit 3D scene memory, and can drift in dynamic scenes, rare viewpoints, or longer rollouts.” That is a candid paragraph from authors who built something specific and know what it can’t do. It is also the paragraph the press summaries left out.

Long-context video with a threaded camera trajectory.

What Other People Mean By “World Model”

This is where the conflation gets expensive. Walk three meters in any direction from SANA-WM in the 2026 literature and you land on two completely different programs claiming the same name.

DeepMind’s Genie 3 announcement, released as a research preview in August 2025, opens with a definition. World models, DeepMind writes, are “AI systems that can use their understanding of the world to simulate aspects of it, enabling agents to predict both how an environment will evolve and how their actions will affect it.” Genie 3 generates 720p worlds at 24 frames per second that users navigate in real time. Consistency holds for a few minutes. Visual memory extends back about a minute. Promptable events let a user change weather, drop in objects, alter the dynamics with text. The load-bearing claim is interactivity: user inputs processed multiple times per second, action conditioning, agent prediction. SANA-WM does none of that. SANA-WM is an offline generator. Camera control is not action conditioning.

Walk in the other direction and you hit LeWorldModel, co-authored in March 2026 by Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. LeWM is a Joint Embedding Predictive Architecture: a JEPA. It encodes observations into a shared latent space and predicts future embeddings without ever generating pixels. It trains end-to-end on raw pixel data with two loss terms, a next-embedding prediction loss and a regularizer. It uses roughly 15 million parameters and trains on a single GPU in hours. It demonstrates competitive performance on 2D and 3D control tasks, and its learned representations can detect physically implausible events. The point of LeWM, and of LeCun’s broader argument, is that pixel-level generative models spend their capacity on visual fidelity instead of on the predictive structure that matters for agents. SANA-WM is exactly the kind of pixel-generator that argument is aimed at.

So three things are now labeled “world models” with a straight face: a long-form video diffusion model with camera control from NVIDIA, a real-time interactive environment simulator from DeepMind, and a small latent-space predictor from LeCun’s collaborators. The architectures differ. The training objectives differ. The intended downstream user differs. The metrics they report aren’t comparable to each other. MIT Technology Review noted in April that “definitions of the term ‘world model’ vary, but they all center on the ways in which intelligent systems represent the external world.” That is true in the sense that the word “vehicle” centers on the way things move. It also covers a unicycle, a 747, and a barge.

Three different problems in three different orbits under one label.

Why the Conflation Matters

The lazy version of this argument is that words don’t matter and the work speaks for itself. The lazy version is wrong, because the work doesn’t speak for itself when investors, infrastructure decisions, and competitive narrative all key off the same label.

Builders need to know what they’re picking up. A team integrating a “world model” into an agent loop typically wants a system that takes a state and an action and returns a next state. That is partially what Genie 3 offers, with real-time inputs and consequences. It is not what SANA-WM offers. SANA-WM produces a video sequence given a starting frame and a camera trajectory, with no notion of arbitrary action conditioning and no notion of the agent existing in the simulated world. The two systems are not interchangeable, and a team that buys SANA-WM expecting agent-training infrastructure is buying a different product than the label implies.

Funders are making different bets on each program. AMI Labs, the company LeCun co-founded after leaving Meta, is now pursuing the JEPA agenda as a funded venture. NVIDIA Labs spent 64 H100s for 15 days on SANA-WM. DeepMind released Genie 3 to a research preview cohort. These are three different programs at three different stages of distribution, with three different research bets. When the banner over all of them reads “world model race,” capital allocation gets mispriced. The investor who funds AMI for predictive latent structure and the investor who funds an NVIDIA partner for video diffusion are funding work that does not blur in the lab even if it blurs in the headline.

The most uncomfortable part of the conflation is that it makes one program — the pixel-diffusion one — look like progress on the JEPA program, which it isn’t. Genie 3 is at least adjacent: it makes an interactive system that an agent could plug into, even if it isn’t yet the predictive simulator LeCun has in mind. SANA-WM is not adjacent. Calling SANA-WM a world model is rhetorically equivalent to calling a high-end driving simulator a step toward self-driving cars. Both involve cars and graphics. Only one of them is solving the relevant problem.

The Paper Is Better Than Its Coverage

The irony of this episode is that the SANA-WM authors mostly didn’t do this to themselves. The paper is candid. The abstract says “world model” but the architectural claims are about long-context video generation with camera trajectory control. The limitations section names what the model can’t do. The benchmarks compare against other video-generation systems — LingBot-World, Matrix-Game 3.0, HY-WorldPlay — and not against Genie 3 or any JEPA predictor, because those comparisons would be apples-to-fish. The project page, the press summaries, and the social discourse around the launch did the heavy stretching.

If NVIDIA Labs had called the project SANA-Cine — efficient long-form video synthesis with cinematic camera control on a single GPU — the announcement would read as a real engineering result aimed at filmmakers, simulation researchers, and game-asset pipelines. Which is what it actually is. The model isn’t less interesting under that framing. It is more interesting, because the field gets a usable open-source video generator with strong camera conditioning instead of yet another contested claim in a race the system isn’t running in.

Three groups of researchers are doing real, different work on three real, different problems. One of those problems — predicting how the world changes when an agent acts on it — has hardly moved at 2.6 billion parameters’ worth of video diffusion. Another — building interactive simulators users can pilot in real time — has moved a great deal at DeepMind. A third — learning compact latent predictors that capture physical structure without rendering — is being prosecuted by LeCun’s collaborators on a budget that fits in a hobbyist’s GPU bin. Treating all three as the same problem helps no one. The honest version of NVIDIA’s announcement is the one the paper already tells. This is a fast, efficient, camera-aware video generator. The next time the trade press writes the recap, they should let the paper do the talking.

AI-generated editorial image

AI-generated editorial illustration · TemperatureZero · May 16, 2026

Keep reading the signal

Get the Daily Signal — a concise briefing on what actually matters in AI and the systems around it.

Subscribe Free

Continue the archive

Latest BriefingsArticlesAbout Temperature Zero