MiniMax Unveils M3 Model with Groundbreaking MSA Architecture
MiniMax releases MiniMax M3, a cutting-edge model featuring the innovative MSA architecture, supporting 1M-token context, native multimodality, and agentic coding.

["MiniMax officially launched MiniMax M3 on June 1, 2026, marking a significant milestone in the development of artificial intelligence. The latest model in the M-series line, M3, introduces the MiniMax Sparse Attention (MSA) architecture, which enables a 1M-token context window. This advancement allows M3 to support image and video input and operate on desktop computers natively.
The M3 API is now live, providing developers with access to this powerful tool. The MSA architecture is a key innovation in M3, addressing the limitations of standard full attention mechanisms. Traditional full attention has quadratic computational complexity, meaning that as the context length grows, the compute cost increases exponentially.
MSA is designed to mitigate this issue by partitioning the KV cache into blocks more precisely, achieving higher effective context coverage compared to approaches like DSA and MoBA. At the operator level, MSA employs a 'KV outer gather Q' approach, which enables more efficient memory access and computation. The results of MSA's implementation are impressive.
MiniMax reports that M3's per-token compute is 1/20th that of the previous-generation M2 models at a context length of 1 million tokens. This translates to a speedup of more than 9× in the prefill stage and over 15× in the decoding stage. Furthermore, MSA matched full attention on the majority of capabilities across multiple ablation studies.
M3 also exhibits significant improvements in coding and agentic capabilities. On the OmniDocBench, a multimodal document understanding benchmark, M3 scores above Gemini 3.1 Pro. Additionally, M3 achieves a 70.06% task completion rate for computer use on OSWorld-Verified (361 samples).
MiniMax has built an interactive user simulator framework for training and evaluation, simulating multi-turn developer collaboration. The development of M3 involved mixed-modality training from step 0, combining text, images, and video from the beginning. MiniMax emphasizes the importance of interleaved data, where text and images are naturally intermixed, in model performance.
The training data was scaled to the order of 100 trillion tokens after rebuilding the entire data pipeline for interleaved formats. MiniMax M3 has been tested on various internal tasks, demonstrating its capabilities. These tasks include reproducing a research paper, optimizing a CUDA kernel, and autonomous model training.
The model weights and technical report are scheduled for release within 10 days of launch, making M3 an open-weight model that combines frontier-level coding performance, a 1M-token context window, and native multimodal input in a single architecture."]
Source: MarkTechPost