AI Models

How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent

AI News Desk

Hugging Face

Jun 04, 2026

2 min read

NVIDIA releases Nemotron 3.5 ASR, a 600M-parameter speech-to-text model that transcribes 40 language-locales in real-time, with punctuation and capitalization built-in.

How to Fine-Tune Nemotron 3.5 ASR for Your Language, Domain, or Accent">

['Introducing NVIDIA Nemotron 3.5 ASR, a groundbreaking speech-to-text model that can transcribe 40 language-locales in real-time, with punctuation and capitalization built-in. This 600M-parameter model is the successor to the popular Nemotron 3 ASR model, which was released on Hugging Face and as a NIM earlier this year. Nemotron 3 ASR has been validated by independent benchmarks at Artificial Analysis, where it ranks 2nd in latency among all streaming ASR models, with just 0.07 seconds to final transcript after end of speech.', "Nemotron 3.5 ASR uses a Cache-Aware FastConformer-RNNT architecture that streams audio without redundant recomputation, allowing for low latency and high accuracy.

The model ships as open weights on Hugging Face, enabling users to inspect, fine-tune, and deploy it without API dependencies or per-call billing. The model's architecture consists of a Cache-Aware FastConformer encoder and an RNNT decoder, which work together to provide a robust and efficient transcription system.", 'The model was trained on a massive speech dataset spanning all supported languages, using a blend of public and proprietary data normalized to punctuated, properly-cased text. Nemotron 3.5 ASR provides a single multilingual model that can be deployed, customized, and fine-tuned for specific use cases.

The model can be fine-tuned for a particular language, domain, or accent, making it a highly versatile tool for developers.', "To demonstrate the model's fine-tuning capabilities, NVIDIA assembled a balanced, ~2000-hour mix of Greek and Bulgarian audio from public multilingual corpora. A straightforward full fine-tune of the streaming RNNT model resulted in significant improvements in Word Error Rate on the held-out FLEURS test set, especially for languages that started out weakest. The fine-tuned model can be deployed in the same serving path as the base model, allowing users to pick their latency/accuracy operating point at inference time.", "The full walkthrough of the fine-tuning process, including data prep scripts, training configs, and benchmark numbers, is available in the companion GitHub repo.

For production serving, a NIM release is expected later this month, providing gRPC streaming and support across various NVIDIA architectures. Whether you're building voice agents, multilingual captioning systems, or on-device speech applications, Nemotron 3.5 ASR provides a powerful tool for developers to create accurate and efficient speech-to-text systems."]

Share this article

X LinkedIn Telegram

Source: Hugging Face