NVIDIA Introduces SANA-WM: A 2.6B-Parameter Open-Source World Model That Generates Minute-Scale 720p Video on a Single GPU
NVIDIA has introduced SANA-WM, a 2.6B-parameter open-source world model that can generate minute-scale 720p video on a single GPU, a significant advancement in embodied AI, simulation, and robotics research.

Model That Generates Minute-Scale 720p Video on a Single GPU">
['The field of world models, systems that synthesize realistic video sequences from an initial image and a set of actions, is rapidly becoming central to embodied AI, simulation, and robotics research. However, the core challenge lies in scaling these systems to generate minute-long, high-resolution video without requiring prohibitively large clusters for both training and inference. Most competitive open-source baselines either require multi-GPU inference or sacrifice resolution to stay within compute budgets.', "NVIDIA's SANA-WM directly targets these bottlenecks.
Built on the SANA-Video codebase and available through the NVlabs/Sana GitHub repository, SANA-WM is a 2.6B-parameter Diffusion Transformer (DiT) trained natively for one-minute generation at 720p with metric-scale 6-DoF camera control. It supports three single-GPU inference variants: a bidirectional generator for high-quality offline synthesis, a chunk-causal autoregressive generator for sequential rollout, and a few-step distilled autoregressive generator for faster deployment. The distilled variant denoises a 60-second 720p clip in 34 seconds on a single RTX 5090 with NVFP4 quantization.", 'One of the significant challenges in generating long videos is dealing with the limitations of standard softmax attention, which has memory and compute complexity that grows quadratically with sequence length.
SANA-WM addresses this by replacing most attention blocks with frame-wise Gated DeltaNet (GDN), which processes one entire latent frame per recurrent step. This approach, combined with an algebraic key-scaling approach and a two-stage refinement pipeline, enables SANA-WM to generate high-quality, minute-scale videos on a single GPU.', 'The development of SANA-WM also involved creating a robust data annotation pipeline, which included modifying the VIPE pose annotation engine and training on a diverse dataset of 212,975 clips with metric-scale pose annotations. The model was trained on 64 H100 GPUs over approximately 18.5 days, including a VAE pre-adaptation step and a four-stage progressive DiT training schedule.', "SANA-WM's capabilities extend to camera-controlled world modeling, where it faithfully follows a continuous 6-DoF trajectory.
The model uses two complementary branches operating at different temporal rates and achieves a Camera Motion Consistency (CamMC) of 0.2047 on the OmniWorld benchmark. The second-stage refiner significantly reduces long-horizon visual drift, from 3.79 to 1.17 on the Simple-Trajectory split and from 3.09 to 0.31 on the Hard-Trajectory split.", 'SANA-WM is open-source and available through the NVlabs/Sana GitHub repository, offering a range of inference variants, including bidirectional, chunk-causal autoregressive, and distilled autoregressive settings. The model has the potential to democratize access to advanced video generation capabilities, enabling a wide range of applications in fields such as robotics, simulation, and embodied AI.']
Source: MarkTechPost