Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents
NVIDIA unveils Nemotron 3 Nano Omni, a cutting-edge AI model capable of processing and understanding long-context multimodal data, including documents, audio, and video.
['NVIDIA has announced the Nemotron 3 Nano Omni, a significant advancement in multimodal AI technology. Building on the Nemotron Nano V2 VL, this new model delivers substantial visual gains and adds entirely new audio and video+audio capabilities. It also leads another open-weights omni model, Qwen3-Omni, in many domains.
The Nemotron 3 Nano Omni is designed to handle a wide range of workloads, including long, messy, high-value documents, speech understanding, and mixed audio and visual evidence.', 'The model is aimed at five classes of workloads, particularly those that require understanding complex documents such as contracts, technical papers, and reports. It can handle 100+ page documents and includes strong speech understanding capabilities that enable high-quality transcription across diverse audio conditions. Additionally, Nemotron 3 Nano Omni is built to reason over mixed audio and visual inputs, such as screen recordings with narration, training videos, and meetings with slides.', 'The Nemotron 3 Nano Omni model uses a unified encoder-projector-decoder design, with a language backbone paired with vision and audio encoders.
Its architecture includes 23 Mamba selective state-space layers, 23 MoE layers with 128 experts, and 6 grouped-query attention layers. This design allows the model to maintain strong reasoning performance while being practical for long, multimodal contexts. The model also features dynamic resolution processing for images and a dedicated Conv3D tubelet embedding path for video.', 'One of the key features of Nemotron 3 Nano Omni is its ability to process and reason over long-context multimodal data.
It can analyze and reason over long documents, perform joint audio-visual analysis, and interpret and reason about general audio. The model can also be integrated into agentic computer-use systems to reason over user intents, analyze GUI elements, and execute actions to accomplish tasks. NVIDIA has open-sourced substantial parts of the training code and introduced multi-environment text and omni training.', 'The model was trained on an enhanced dataset that emphasizes high-quality reasoning across multiple modalities.
This includes synthetic data for complex reasoning scenarios where public datasets are limited. For example, approximately 11.4M synthetic QA pairs (~45B tokens) were generated from a large corpus of real-world PDFs using NeMo Data Designer. This dataset delivers a 2.19× improvement in overall accuracy on MMLongBench-Doc.', 'Nemotron 3 Nano Omni demonstrates impressive capabilities in various applications, such as retrieving financial metrics across a 100+ page document, performing joint audio-visual analysis, and interpreting and reasoning about general audio.
The model can analyze charts, figures shown in images, along with audio files to generate commonalities and discrepancies across the media. It represents a significant advancement in multimodal AI technology, with potential applications in areas such as document understanding, video analysis, and agentic computer use.']
Source: Hugging Face