EMO: A Pretrained Mixture of Experts Model that Emerges with Modularity
Researchers release EMO, a new mixture-of-experts model pretrained end-to-end to allow modular structure to emerge directly from data, enabling selective expert use without sacrificing performance.

Researchers at Allen Institute for Artificial Intelligence have developed a new mixture-of-experts (MoE) model called EMO, which is pretrained end-to-end to allow modular structure to emerge directly from data. EMO enables selective expert use without sacrificing performance, making it a strong general-purpose model when all experts are used together. The model has 1 billion active parameters and 14 billion total parameters, with 8 experts active and 128 experts total.
It was trained on 1 trillion tokens. Large language models are typically trained and deployed as monolithic systems, but applications often require only a subset of capabilities. EMO addresses this by using a small subset of experts for a given task while maintaining near-full-model performance.
This approach can reduce computational costs and memory requirements. Mixture-of-experts models seem like a natural solution, but existing MoEs still require the full model to work well. EMO instead encourages modularity by routing tokens to experts based on document boundaries, rather than predefined semantic domains.
This approach allows the model to organize itself into coherent groups of experts that can be selectively used and composed. EMO's experts organize into groups that correspond to semantically meaningful domains, such as Health, Medical & Wellness, News Reporting, and US Politics & Elections. The model supports selective expert use, allowing users to select a small subset of experts for a given task or domain while retaining near-full-model performance.
The researchers also developed an interactive visualization to showcase the clustering results. They are releasing the full EMO-trained model, a matched standard-MoE baseline trained on the same data, and the training code. This release aims to facilitate further research into emergent modularity in MoEs and the development of more modular language models that are easier to deploy, adapt, inspect, and compose.
The key to EMO's success lies in its routing mechanism, which restricts tokens within a document to choose their active experts from a shared expert pool. This approach encourages groups of experts to form domain specialization. The researchers applied load balancing globally across many documents, which helped stabilize training.
EMO's performance was evaluated on various benchmarks, and it was found to match the performance of a standard MoE model. When only a subset of experts was used, EMO remained robust, losing only about 1% absolute performance across all benchmarks when keeping 25% of the experts. In contrast, the matching standard MoE degraded sharply as the expert subset got smaller.
Source: Hugging Face