Towards Speed-of-Light Text Generation with Nemotron-Labs Diffusion Language Models
Large language models (LLMs) have become the default interface for code generation, math problem solving, summarization, document understanding, and many other developer workflows.

Large language models (LLMs) have become the default interface for code generation, math problem solving, summarization, document understanding, and many other developer workflows. Under the hood, though, many LLMs still generate text the same way: one token at a time, and each token depends on the tokens that appeared before it. As such, these models are called autoregressive, since they consume their own outputs.
That autoregressive (AR) approach has been remarkably successful. It is stable to train, simple to serve, and responsible for much of the progress in modern language modeling. But it also creates a hard limit: every new token requires a full model pass and every weight has to be loaded from the memory before computation can start. For developers building latency-sensitive applications, running smaller batch sizes, or trying to make better use of modern GPUs, token-by-token generation can leave performance on the table as most of the GPU’s time is spent on memory operations, rather than computation.
Additionally, once a token is generated by an autoregressive model, it is final and they do not inherently have the ability to revise previous tokens. Consequently, mistakes can propagate during the course of generation.
Nemotron-Labs Diffusion introduces a new path forward: diffusion language models (DLM) that work by generating multiple tokens in parallel, then iteratively refining the generated tokens in multiple steps. Not only can these models better leverage the computational model of the modern GPUs and offer significant runtime performance benefits, but they can also revise generated tokens, making them more suitable for revising existing text and addressing fill-in-the-middle objectives. This generate-and-refine property also offers a built-in way to control the inference budget. By reducing the number of refinement steps, one can reduce the compute requirements of these models at runtime.
The Nemotron-Labs Diffusion family includes text models at 3B , 8B , and 14B scales, all available under the commercially-friendly NVIDIA Nemotron Open Model License , as well as a 8B scale vision-language model (VLM), available under the NVIDIA Source Code License, granting broad research flexibility. Across the lineup, NVIDIA is releasing both base models and instruction-tuned chat variants. NVIDIA is also releasing the code for training these models through the NVIDIA Megatron Bridge framework .
Nemotron-Labs Diffusion is designed around a simple idea: autoregressive and diffusion generation should not be separate model families. They should be capabilities of the same model. The model supports three generation modes:
Source: Hugging Face