AI Models

NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6× Tokens Per Forward Over Qwen3-8B

AI News Desk

MarkTechPost

May 20, 2026

6 min read

NVIDIA researchers have released Nemotron-Labs-Diffusion, a language model family that unifies three decoding modes in one architecture.

NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6× Tokens Per Forward Over Qwen3-8B

NVIDIA researchers have released Nemotron-Labs-Diffusion, a language model family that unifies three decoding modes in one architecture. The model supports autoregressive (AR) decoding, diffusion-based parallel decoding, and self-speculation decoding. It is available in 3B, 8B, and 14B parameter sizes.

The family includes base, instruct, and vision-language variants. Standard autoregressive (AR) language models generate text one token at a time, left to right. Each token depends on all previous tokens.

This sequential dependency limits GPU parallelism per generation step. The result is low hardware utilization at low batch sizes — the typical setting for single-user or edge deployment. Diffusion language models (LMs) offer a different approach.

Instead of generating tokens sequentially, they denoise multiple tokens in parallel per forward pass. This enables higher throughput. The tradeoff has been accuracy: diffusion LMs have consistently lagged behind AR models on benchmarks, requiring substantially more data to reach comparable performance.

A key reason is that diffusion training treats all token permutations uniformly, rather than leveraging the strong left-to-right prior inherent in natural language. Nemotron-Labs-Diffusion is trained on a joint AR-diffusion objective. At inference time, it operates in three modes depending on the deployment context.

There are no mode-specific architectural modifications — the same weights serve all three modes. AR mode is standard left-to-right autoregressive decoding using causal attention. This mode is best suited for high-concurrency cloud serving.

Diffusion mode denoises multiple tokens in parallel within a fixed-length block. The sequence is partitioned into contiguous blocks. Within each block, tokens attend bidirectionally.

Across blocks, attention remains causal, so prior blocks can reuse their KV cache. A lightweight trained sampler predicts, per masked position, whether the model’s top-1 prediction at the current denoising step is correct. Positions predicted as correct are committed in that step.

This allows the model to commit multiple tokens per forward pass. Self-speculation mode uses the diffusion pathway to draft candidate tokens and the AR pathway to verify them, within the same single model. No auxiliary draft model or separate prediction head is required.

The diffusion pathway generates a block of k candidate tokens in parallel. The AR pathway then runs a second forward pass over those candidates using causal attention, verifying the longest contiguous prefix that matches AR predictions. Each cycle produces between 1 and k+1 verified tokens.

Share this article

X LinkedIn Telegram

Source: MarkTechPost