AI Research

DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

AI News Desk

MarkTechPost

Jun 24, 2026

4 min read

Autoregressive large language models generate text one token at a time.

DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

Autoregressive large language models generate text one token at a time. Each token waits for the one before it. This serial loop leaves modern GPUs underused and keeps inference slow. The cost grows worse with long Chain-of-Thought reasoning models. Their lengthy outputs make latency the dominant part of generation.

Speculative decoding is the standard fix. A small draft model proposes future tokens. The large target model verifies those tokens in parallel. Accepted tokens are kept, so the output stays lossless. But most methods, including the state-of-the-art EAGLE-3, still draft autoregressively. That serial drafting caps real-world speedups near 2–3×.

DFlash , introduced by research team from UC San Diego team (z-lab), takes a different route. It is a lightweight block diffusion model built for drafting. Instead of drafting tokens one at a time, it proposes a whole block in a single forward pass. The target model then verifies that block in parallel.

The research team reports over 6× lossless acceleration across a range of models and tasks. It reaches up to 2.5× higher speedup than EAGLE-3. On NVIDIA Blackwell, NVIDIA engineering team reports up to 15× higher throughput for gpt-oss-120b. That figure holds at the same user interactivity target.

Block diffusion models denoise a block of masked tokens at once. They blend parallel generation with autoregressive block structure. DFlash applies this idea only to the drafting stage. Verification stays with the trusted autoregressive target model.

This split matters for quality. Standalone diffusion LLMs often trail autoregressive models on accuracy. They also need many denoising steps, which slows their raw inference speed. DFlash sidesteps both problems. The draft only needs to be good enough to be accepted. The target’s parallel verification guarantees the final output distribution.

A second benefit is drafting cost. An autoregressive drafter’s cost grows linearly with the number of speculative tokens. A diffusion drafter generates all tokens in one parallel pass. So drafting latency stays largely flat as the block grows. This frees DFlash to use deeper, more expressive draft models without adding latency.

This separates DFlash from earlier diffusion-drafter work. Methods like DiffuSpec and SpecDiff-2 used massive 7B drafters, capping speedups near 3–4×. DFlash instead uses a small five-layer drafter (eight layers for Qwen3-Coder).

DFlash’s core idea is simple: the target knows best. Large autoregressive models’ hidden features encode information about multiple future tokens. DFlash extracts hidden states from several target layers. It fuses them into one compact target context feature. This feature then conditions the draft model.

Share this article

X LinkedIn Telegram

Source: MarkTechPost