NVIDIA Releases Nemotron-Labs-TwoTower: an Open-Weight Diffusion Language Model Built on a Frozen Autoregressive Nemotron-3-Nano-30B-A3B Backbone
NVIDIA has released Nemotron-Labs-TwoTower , a diffusion language model built on a pretrained autoregressive backbone.

NVIDIA has released Nemotron-Labs-TwoTower , a diffusion language model built on a pretrained autoregressive backbone. It ships as open weights under the NVIDIA Nemotron Open Model License. The release targets a throughput bottleneck in text generation.
Autoregressive (AR) models decode one token at a time. That serial process caps generation throughput. Discrete diffusion language models take another route. They generate tokens in parallel and refine them iteratively.
Most diffusion language models use one network for two jobs. It represents clean tokens and denoises corrupted ones at every step. TwoTower separates these jobs into two towers. It keeps 98.7% of the AR baseline’s aggregate benchmark quality. It also reports 2.42× higher wall-clock generation throughput.
TwoTower is a block-wise autoregressive diffusion model. It is instantiated on Nemotron-3-Nano-30B-A3B, an open-weight hybrid backbone. That backbone interleaves Mamba-2, self-attention, and mixture-of-experts (MoE) layers.
Each tower has 52 layers: 23 Mamba-2, 6 self-attention, and 23 MoE. The released checkpoint ships both towers, roughly 60B total parameters. Active parameters per token are about 3B per tower. The MoE uses 128 routable experts, of which 6 activate, plus 2 shared experts.
Both towers start as copies of the same backbone checkpoint. Only the denoiser tower is trained. The AR context tower stays frozen. The denoiser was trained on ~2.1T tokens, a fraction of the backbone’s 25T-token pretraining.
The AR context tower runs causally over the prompt and committed tokens. It produces per-layer KV cache and final Mamba-2 states. It preserves the backbone’s autoregressive capability.
The diffusion denoiser tower refines noisy blocks. Within a block, it uses bidirectional in-block attention. It stays causal with respect to past clean blocks.
The towers connect layer-by-layer. Denoiser layer i cross-attends to context tower layer i . This layer-aligned cross-attention gives multi-scale access to the backbone’s representations. Prior approaches broadcast only the last hidden state.
Two more denoiser modifications matter. Mamba-2 layers seed their initial state from the context tower’s Mamba state. The diffusion timestep modulates each layer through adaLN-single time conditioning. That adaLN module adds only ~1.5M parameters.
Generation runs block by block. Each block starts as S [MASK] tokens. The denoiser refines it over T steps, then commits it. The context tower then processes committed tokens to update its caches.
Source: MarkTechPost