Google Releases DiffusionGemma, a Diffusion-Based Language Model
Google's DiffusionGemma generates 256 tokens in parallel, self-correcting as it goes, with speeds up to 4x faster than standard models on GPUs.

Releases DiffusionGemma, a Diffusion-Based Language Model">
GenAI image generators like Stable Diffusion do not draw a picture pixel by pixel from left to right. They start with noise and iteratively refine the entire image in parallel until it converges, in a process known as diffusion. For years, applying that same principle to text generation had remained out of reach at scale.
Standard language models work like a typewriter: one token at a time, left to right, with no ability to revise a committed output. That pattern works in the cloud, where batch sizes keep GPUs saturated. For local inference or low-concurrency deployments, the GPU is idle most of the time.
Google's DiffusionGemma, released this week, is an open source experimental model that applies diffusion to text generation at production scale. Built on the Gemma 4 backbone and released under the Apache 2.0 license, it is the first diffusion language model natively supported in the open source vLLM inference platform. It generates a 256-token block in parallel rather than sequentially, with every token position attending to every other.
Google says DiffusionGemma generates text up to 4x faster than standard models on GPUs. At batch size 1 on a single Nvidia H100, the FP8 version reaches 1,008 tokens per second. On H200, it hits 1,288 — roughly six times a standard autoregressive baseline, according to vLLM benchmark results published today.
Despite the speed gains, Google did not oversell the release. The company's launch post acknowledged directly that DiffusionGemma's overall output quality is lower than standard Gemma 4, adding "For applications that demand maximum quality, we recommend deploying standard Gemma 4." DiffusionGemma does not generate tokens in order. It starts with a block of 256 random placeholder tokens, effectively a blank canvas, and runs multiple refinement passes over the entire block at once.
On each pass, it evaluates every position and locks in the ones it is most confident about. Uncertain positions get randomized and reconsidered on the next pass, with the model using what it resolved in the previous round to inform the next attempt. The block converges progressively until enough positions stabilize to anchor the rest.
Two things follow from that architecture. Self-correction. An autoregressive model that commits to a wrong token is stuck with it, because subsequent tokens are already conditioned on the mistake.
DiffusionGemma can identify low-confidence positions and re-evaluate them on the next pass. Bidirectional context. Every position attends to every other position in the block simultaneously, including tokens that appear later in the sequence.
Source: VentureBeat