MiniMax Releases MSA, a Sparse Attention Method for Long Contexts
MiniMax releases MSA, a sparse attention method built on Grouped Query Attention, targeting the quadratic cost of softmax attention at long context.

MiniMax has released MSA (MiniMax Sparse Attention), a sparse attention method built directly on Grouped Query Attention (GQA). It targets one bottleneck: the quadratic cost of softmax attention at long context. The MiniMax research team tested it inside a 109B-parameter Mixture-of-Experts model trained with native multimodal data.
They also open-sourced an inference kernel and shipped a production model, MiniMax-M3. MSA factors attention into two stages: an Index Branch and a Main Branch. The Index Branch decides which key-value blocks each query should read.
The Main Branch then runs exact softmax attention over only those blocks. Selection happens at block granularity, not per token. The default block size is B k = 128 tokens.
Each query and GQA group keeps k = 16 blocks. That fixes the per-query budget at kB k = 2,048 key-value tokens. The two cost structures differ.
Dense GQA attention scales per query as O(N), the full context. MSA scales as O(kB k ), which stays fixed as N grows. The compute gap therefore widens as context length increases.
Selection is shared inside each GQA group but independent across groups. One key-value head serves several query heads, and they share one block set. Different groups can attend to different long-range regions.
The Index Branch adds only two projection matrices to a standard GQA layer. It defines one index query head per GQA group and one shared index key head. It scores visible key tokens, then max-pools those scores to the block level.
A Top-k operator then selects the highest-scoring blocks per query and group. The local block containing the query is always included. This prevents the selector from dropping the query's immediate neighborhood.
The Main Branch gathers causally visible tokens from the selected blocks. It applies scaled dot-product softmax attention restricted to those tokens. Each query head keeps its own query projection but shares the group's block set.
A visualization in the report shows what the learned indexer selects. Heads concentrate on the local diagonal and the first block. They reserve the rest of the budget for a few long-range stripes.
Top-k selection is non-differentiable, so the language-modeling loss cannot train the index projections. MSA solves this with a KL alignment loss. The loss matches the Index Branch distribution to the Main Branch attention pattern.
The teacher is the group-averaged Main Branch distribution over the selected tokens. Three mechanisms stabilize sparse training. Gradient Detach applies stop-gradient to the Index Branch input.
Source: MarkTechPost