Parallax: A Parameterized Local Linear Attention That Keeps Softmax and Adds a Learned Covariance Correction Branch
Researchers introduce Parallax, a new attention mechanism that builds on Local Linear Attention and adds a learned covariance correction branch, achieving better performance and efficiency.

['The Transformer’s attention mechanism has remained largely unchanged since its introduction in 2017. While numerous efforts have focused on replacing softmax attention with more efficient alternatives, a team of researchers from Northwestern University, Tilde Research, and University of Washington has taken a different approach. They have developed a parameterized Local Linear Attention called Parallax, which keeps softmax attention and adds a correction branch.', 'Parallax is designed to scale to large language model (LLM) pretraining and is codesigned with the Muon optimizer.
Unlike other efficiency-focused methods that aim to reduce compute, Parallax deliberately adds compute and then makes it cheaper to run on modern GPUs. The researchers prove that Parallax yields strictly smaller integrated mean squared error, providing better bias-variance tradeoffs for associative memory.', 'The key innovation in Parallax is the replacement of the per-query solve with a learned, query-like projector. This makes Parallax simpler, more efficient, and easier to implement.
Parallax reformulates LLA as softmax attention plus an additive correction, which is the KV covariance multiplied by a learned probe. The researchers also found that dropping the boundary amplification factor, setting it to zero, is necessary for stability.', 'The researchers prototyped a decode kernel in CuTeDSL on NVIDIA Hopper GPUs and profiled it against FlashAttention 2 and 3 on H200 GPUs. The results show that Parallax matches or outperforms FlashAttention across all configurations, with speedups of 1.54× in the compute-matched setting and 1.14× in the I/O-matched setting.
Parallax also achieved state-of-the-art results on synthetic tasks and LLM pretraining at 0.6B and 1.7B scales.', 'A core finding of the research is the importance of optimizer-architecture codesign. Parallax shows a significant advantage when used with the Muon optimizer, but this advantage shrinks or disappears when used with AdamW. The researchers argue that this points to the mechanism itself, rather than extra parameters or compute, as the source of Parallax’s gains.']
Source: MarkTechPost