Google AI Releases DiffusionGemma, a 26B MoE Open Model Using Text Diffusion for Up to 4x Faster Generation
Google AI team including the Google DeepMind researchers have just released DiffusionGemma, an experimental open model for text generation.

Google AI team including the Google DeepMind researchers have just released DiffusionGemma, an experimental open model for text generation. It uses text diffusion instead of standard autoregressive decoding. The model ships under a permissive Apache 2.0 license. Google positions it for devs and researchers exploring speed-critical, interactive local workflows. Examples include in-line editing, rapid iteration, and generating non-linear text structures.
Most language models in use today are autoregressive. They generate one token at a time, left to right. Each new token depends on the token before it. DiffusionGemma works differently. It generates entire blocks of text simultaneously, in parallel. On dedicated GPUs, this delivers up to 4x faster generation.
DiffusionGemma is a 26B Mixture of Experts (MoE) model. It activates only 3.8B parameters during inference. It is built on the Gemma 4 backbone, specifically the 26B-A4B architecture. Google integrated a diffusion head onto that base.
The model is multimodal. It processes interleaved text, image, and video inputs. It generates text outputs from those inputs. The context window is 256K tokens, and it supports 140+ languages.
Quantized, the model fits within 18GB of VRAM. That places it inside high-end consumer GPU limits. On a single NVIDIA H100, it reaches 1000+ tokens per second. On an NVIDIA GeForce RTX 5090, it reaches 700+ tokens per second.
Google is very direct about the trade-off. DiffusionGemma prioritizes speed and parallel layout generation. Its overall output quality is lower than standard Gemma 4. For maximum quality production work, Google still recommends autoregressive Gemma 4.
Text diffusion borrows its core idea from AI image generators. Those models start with visual static and refine it iteratively. DiffusionGemma applies the same pattern to text generation.
The process runs in three conceptual stages. First, the model starts with a canvas of random placeholder tokens. Second, it makes multiple passes over that canvas. It locks in high-confidence tokens and uses them as context. Third, the text converges into the final output.
Google calls the core mechanism Uniform State Diffusion. Highly confident tokens help resolve adjacent positions during denoising. The full sequence then snaps into focus over several passes.
In practice, the model denoises a 256-token canvas in parallel. It finalizes roughly 15-20 tokens per forward pass. That parallelism is what drives the throughput gains.
The model uses bidirectional attention during denoising. Every token on the canvas can attend to every other token. This is a sharp break from autoregressive models. Those models can only look backward at prior tokens.
Source: MarkTechPost