Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains
JetBrains unveils Mellum2, an open Mixture-of-Experts model optimized for low-latency text-and-code workloads.

Today, JetBrains is proud to announce the release of Mellum2, an open Mixture-of-Experts model designed to handle low-latency text-and-code workloads with ease. Building on its predecessor, Mellum, which initially emerged as a code completion model, Mellum2 expands its capabilities to tackle a broader range of natural language and software engineering tasks while maintaining a focus on efficient inference and deployability. Modern AI systems have become increasingly reliant on multiple model calls, including routing, retrieval, summarization, planning, validation, and tool use.
However, many of these operations are latency-sensitive and do not necessitate the use of the largest available model. This is where Mellum2 comes into play, targeting these specific workloads with its optimized architecture. According to the technical report released by JetBrains, Mellum2 has been evaluated across various benchmarks, including code generation, reasoning, science, and math.
The results show that Mellum2 is competitive with similarly sized open models, while delivering more than 2x faster inference. This makes it an ideal choice for high-throughput production workloads. The model architecture of Mellum2 is based on a Mixture-of-Experts (MoE) model, which keeps total model capacity high while activating only a subset of parameters for each token.
This approach makes inference more efficient and helps reduce serving costs for real-time workloads. Mellum2's specialization in text and code, rather than multimodal tasks, allows it to remain compact and efficient for software engineering workloads. Its capabilities extend to functioning as a lightweight routing and orchestration model in multi-model systems, including prompt classification, tool selection, and intermediate control-flow steps.
Additionally, the model is well-suited for latency-sensitive retrieval pipelines, such as context compression, summarization, and retrieval post-processing. The versatility of Mellum2 also makes it suitable for agent subtasks like planning, validation, transformation, and context preparation, reducing the need to invoke larger models for intermediate operations. As an open and efficient model to serve, Mellum2 can be deployed in self-hosted environments involving proprietary code or internal data.
As AI systems continue to mature, the most effective architectures are shifting away from being monolithic. A single frontier model can be powerful, but production systems often require several specialized components working together. JetBrains envisions Mellum2 as a 'focal' model – a fast, well-scoped model optimized for high-frequency tasks inside larger AI systems.
The goal is not to replace every model in the stack but to make the stack faster, cheaper, and easier to control. For those building AI systems for software engineering, Mellum2 is now ready to be tried out.
Source: Hugging Face