DeepSeek Releases DSpark, a Speculative Decoding Framework That Accelerates DeepSeek-V4 Per-User Generation 60–85% Over MTP-1
DeepSeek released DSpark , a speculative decoding framework, with open-source checkpoints and training code.

DeepSeek released DSpark , a speculative decoding framework, with open-source checkpoints and training code. It is a serving optimization, not a new model. The checkpoints DeepSeek-V4-Pro-DSpark and DeepSeek-V4-Flash-DSpark reuse the existing V4 weights, with a draft module attached.
The DeepSeek research team also open-sourced DeepSpec , an MIT-licensed codebase for training and evaluating speculative decoding drafters. The work targets one problem: faster large-model inference in busy production serving.
Speculative decoding splits generation into two roles. A small draft model proposes a block of tokens. The full target model then verifies that block in one forward pass.
Rejection sampling accepts the longest valid prefix and appends one bonus token. Because the rule preserves the target distribution exactly, there is no quality loss. DSpark keeps this guarantee. It changes how tokens are drafted and how many get verified.
Per-token latency follows one equation from the paper: L = (T draft + T verify ) / τ . Here τ is the number of tokens accepted per cycle. Speedup comes from three levers only.
You can draft faster, lowering T draft . You can draft better, raising τ . Or you can verify smarter, reducing wasted T verify . DSpark pulls all three levers at once.
Earlier drafters force a trade-off. Autoregressive drafters like Eagle3 condition each token on prior ones. That gives strong acceptance, but drafting cost grows with block size.
Parallel drafters like DFlash produce the whole block in one pass. Drafting stays cheap, but each position ignores its neighbors. The result is ‘multi-modal collision’ and rapid acceptance decay along the suffix.
DSpark splits drafting into two stages. A heavy parallel backbone, DFlash in their setup, produces base logits for every position. Then a lightweight sequential head adds a prefix-dependent bias before sampling each token.
The default sequential head is a Markov head. It only looks at the immediately preceding token. A low-rank factorization (rank 256) keeps it cheap, even with large vocabularies.
Once position one samples ‘of’, the head boosts ‘course’ and suppresses ‘problem’. An optional RNN head tracks the full block prefix. It adds only marginal gains, so the Markov head ships as the default.
The payoff shows up position by position. DSpark inherits the parallel backbone’s high first-token accuracy. The sequential head then holds acceptance steady deep into the block.
Training freezes the target model and reuses its embedding and output head. A total-variation loss is the key term. Minimizing that distance directly maximizes the draft’s acceptance rate.
Source: MarkTechPost