DeepSeek-V4: A Million-Token Context That Agents Can Actually Use
DeepSeek releases V4 with a 1M-token context window, competitive benchmark numbers, and innovative architecture for efficient large context length support.
['DeepSeek has released V4 of its AI model, boasting a 1M-token context window that could revolutionize agentic tasks. Two MoE checkpoints are now available on the Hub: DeepSeek-V4-Pro, with 1.6T total parameters and 49B active, and DeepSeek-V4-Flash, with 284B total parameters and 13B active. Both models have a 1M-token context window, making them competitive in benchmark tests, although not state-of-the-art.', 'The real innovation here lies in how DeepSeek V4 is designed for efficient large context length support, making it one of the best candidates for agentic tasks.
Currently, running a frontier open model as an agent often breaks in predictable ways, such as the model stopping, requiring reprompting, or the trace blowing past the context budget. V4 aims to fix these known failures and pave the way for the community to follow.', "DeepSeek V4's architecture does things differently to make long-context inference cheap. It introduces Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA), which split attention into two mechanisms and interleave them across layers.
This results in significant reductions in single-token inference FLOPs and KV cache size. For instance, at 1M tokens, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs compared to DeepSeek-V3.2 and uses 10% of the KV cache memory.", 'The model also makes specific post-training and infrastructure choices that target agent use cases directly. For example, V4 preserves reasoning content across user message boundaries when the conversation contains tool calls, allowing for a coherent, cumulative chain of thought over long-horizon agent tasks.
Additionally, it introduces a |DSML| special token and an XML-based tool-call format to reduce parsing errors.', 'In terms of performance, DeepSeek-V4-Pro-Max achieves a 67% pass rate on an internal R&D coding benchmark, compared to 47% for Sonnet 4.5 and 70% for Opus 4.5. It also demonstrates impressive long-context retrieval capabilities, with MRCR 8-needle accuracy staying above 0.82 through 256K tokens and holding at 0.59 at 1M. Four checkpoints are available on the Hub, supporting various reasoning modes and sampling parameters.']
Source: Hugging Face