Meet EAGLE 3.1: The Speculative Decoding Algorithm That Fixes Attention Drift in LLM Inference
Speculative decoding speeds up large language model inference, but EAGLE 3.1 upgrade tackles 'attention drift' for more reliable performance.

['Speculative decoding is a technique for accelerating large language model inference by having a small, fast draft model propose several tokens, which are then verified in parallel by the large target model. If the proposed tokens are accepted, inference is faster; if rejected, the system falls back gracefully. The EAGLE series, including EAGLE 1, EAGLE 2, and EAGLE 3, has become one of the most widely adopted and practically deployed families of speculative decoding algorithms across both research and production systems.
Today, the family gets a targeted reliability upgrade with the introduction of EAGLE 3.1.', 'While speculative decoding performs well in controlled settings, its performance often degrades under different chat templates, long-context inputs, or out-of-distribution system prompts. The EAGLE team traced this fragility to a phenomenon called attention drift, where, as speculation depth increases, the drafter gradually shifts attention away from sink tokens and toward its own generated tokens. In simpler terms, the drafter, a small model that predicts future tokens, starts attending to its own prior outputs instead of the original context, degrading acceptance length and output stability.', 'The EAGLE team identified two underlying issues contributing to attention drift.
First, the fused input representation becomes increasingly imbalanced as higher-layer hidden states dominate the drafter input. Second, hidden-state magnitude grows across speculation steps due to the unnormalized residual path. To address attention drift, EAGLE 3.1 comes with two key architectural improvements: FC normalization after each target hidden state and before the FC layer, and feeding post-norm hidden states into the next decoding step.
These changes stabilize the hidden states that the drafter receives from the target model and make the method behave more like recursively invoking the drafter across decoding steps.', 'EAGLE 3.1 demonstrates several improvements over EAGLE 3, including better training-time to inference-time extrapolation, stronger long-context robustness, higher resilience to chat template and system prompt variation, and more stable acceptance length across diverse serving environments. In long-context workloads, EAGLE 3.1 achieves up to 2× longer acceptance length compared with EAGLE 3. TorchSpec now provides efficient training support for EAGLE 3.1 and future speculative decoding algorithms, lowering training overhead and simplifying experimentation workflows.', 'The research team also trained and open-sourced an EAGLE 3.1 draft model for Kimi K2.6, available on HuggingFace.
The model serves as an example of deploying EAGLE 3.1 with TorchSpec training and vLLM serving support on a real-world serving model. EAGLE 3.1 lands in vLLM as a config-driven extension of the existing EAGLE 3 implementation, preserving backward compatibility with existing EAGLE 3 checkpoints. The team benchmarked the Kimi K2.6 EAGLE 3.1 draft model, finding that EAGLE 3.1 delivers 2.03× higher per-user output throughput at concurrency 1, with meaningful speedups as concurrency scales.']
Source: MarkTechPost