NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation
NVIDIA has released Cosmos 3, a family of omnimodal world models for physical AI that unifies physical reasoning, world generation, and action generation in a single open model.

NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation">
["NVIDIA's AI team has unveiled Cosmos 3, a groundbreaking family of omnimodal world models designed for physical AI. This innovative model combines physical reasoning, world generation, and action generation, all within a single open framework. By integrating these capabilities, Cosmos 3 paves the way for more sophisticated and versatile AI applications in fields such as robotics, autonomous vehicles, and warehouse monitoring.
A key feature of Cosmos 3 is its two-tower Mixture-of-Transformers (MoT) architecture, comprising a reasoner tower and a generator tower. The reasoner tower, described by NVIDIA as the model's 'brain,' is a vision-language model (VLM) that interprets images, videos, and text through an autoregressive architecture. This enables it to understand complex physical context, including motion and object interactions.
The generator tower, on the other hand, produces future observations and action sequences using a diffusion-based process. This process is physics-aware and generates outputs conditioned on the reasoner tower's understanding. Notably, information flows one way, from the reasoner to the generator, allowing the reasoner to operate independently while the generator always activates both towers for guided generation.
NVIDIA has released three model scales: Edge, Nano, and Super, each utilizing the dual-tower MoT design. The Nano and Super models are currently available, along with task-specific variants like Super Text2Image, Super Image2Video, and Nano-Policy-DROID. These models have demonstrated impressive performance, leading VANTAGE-Bench and the Traffic Anomaly Reasoning (TAR) leaderboard, and achieving open-source state-of-the-art results on R-Bench, PAI-Bench, Physics-IQ, and RoboLab.
To evaluate the generated videos, NVIDIA introduced the Cosmos Human Evaluation (HUE) framework, which assesses four dimensions across seven physical AI domains. While the model showcases promising capabilities, it is essential to acknowledge potential limitations, such as temporal inconsistency, unstable motion, and sound-video misalignment. As with any AI model, safety-critical control requires validation, guardrails, and system-level analysis.
The Cosmos 3 release includes open-sourced checkpoints, training scripts, deployment tools, and datasets, making it a valuable resource for researchers and developers. By leveraging Diffusers and Transformers for research and vLLM-Omni and vLLM for serving, users can explore a wide range of applications for this innovative model."]
Source: MarkTechPost