NVIDIA Unveils Nemotron 3.5 ASR: A Breakthrough in Real-Time Speech Recognition
NVIDIA's Nemotron Speech team has released Nemotron 3.5 ASR, a 600M-parameter streaming Automatic Speech Recognition model that can transcribe 40 language-locales in real-time.

Nemotron 3.5 ASR: A Breakthrough in Real-Time Speech Recognition">
NVIDIA's Nemotron Speech team has made a significant breakthrough in the field of Automatic Speech Recognition (ASR) with the release of Nemotron 3.5 ASR. This cutting-edge model boasts 600 million parameters and is capable of transcribing 40 language-locales in real-time, using a single checkpoint. What's more, Nemotron 3.5 ASR comes equipped with built-in punctuation and capitalization, making it a production-ready solution.
The architecture of Nemotron 3.5 ASR is based on a Cache-Aware FastConformer-RNNT, which is an efficient evolution of the Conformer architecture. This model extends the capabilities of the previous nvidia/nemotron-speech-streaming-en-0.6b model, adding prompt-based language-ID conditioning to support multiple languages. This innovative approach eliminates the need for per-language models or model-swapping, making it a highly versatile solution.
Nemotron 3.5 ASR is designed to handle two primary workloads: low-latency streaming for live audio and high-throughput batch transcription. The model's output is production-ready text with proper casing and punctuation, eliminating the need for a separate punctuation-restoration step. The model's efficiency is largely due to its "cache-aware" design, which caches encoder self-attention and convolution activations, reusing these cached states as new audio arrives.
The model's performance can be fine-tuned for specific languages, domains, or accents, thanks to its open weights, which are available on Hugging Face under the OpenMDW-1.1 license. In fact, NVIDIA has published a worked example of fine-tuning the model for Greek and Bulgarian, achieving significant improvements in Word Error Rate (WER). For instance, the WER for Greek decreased from 35 to 24, a 32% relative improvement, while Bulgarian saw a decrease from 22 to 15, a 31% relative improvement.
The implications of Nemotron 3.5 ASR are vast, with potential applications in a wide range of industries, from live audio transcription to voice assistants. With its unparalleled combination of accuracy, efficiency, and versatility, Nemotron 3.5 ASR is poised to revolutionize the field of speech recognition.
Source: MarkTechPost