Miso Labs Unveils MisoTTS: A Revolutionary 8B Emotive Text-to-Speech Model with Open Weights
Miso Labs has released MisoTTS, an open-weights 8-billion-parameter text-to-speech model capable of generating expressive speech from text and audio context.

Model with Open Weights">
Miso Labs has unveiled MisoTTS, a groundbreaking text-to-speech model that leverages residual vector quantization (RVQ) to produce highly expressive speech. With 8 billion parameters, MisoTTS is a significant advancement in the field, offering a more nuanced and emotive voice synthesis experience. The model generates speech from both text and audio context, allowing it to respond to the speaker's tone and create a more natural-sounding dialogue.
MisoTTS is built on the Sesame CSM architecture and pairs a Llama 3.2-style backbone with a smaller audio decoder. This architecture enables the model to condition on both text and prior audio, allowing it to capture the subtleties of human speech. The text vocabulary consists of 128,256 tokens, and there are 32 audio codebooks.
The model uses Mimi as the audio tokenizer, with a maximum sequence length of 2,048. One of the key challenges in text-to-speech synthesis is the vocabulary size problem. Standard transformers rely on a fixed vocabulary of discrete tokens, which can be limiting when it comes to human speech.
Miso Labs argues that expanding the audio vocabulary is essential, but this typically requires more parameters in a standard transformer. MisoTTS addresses this issue with RVQ, which allows the model to emit a vector of indices rather than a single token index. The model's performance is impressive, with a claimed latency of 110ms.
For comparison, ElevenLabs has a latency of 700ms, while Sesame has a latency of 300ms. MisoTTS achieves this low latency while maintaining high-quality speech synthesis. The model is available with open weights, and users can set it up using uv with Python 3.10.
The weights can be downloaded from Hugging Face, and audio is watermarked by default via SilentCipher. Notably, one-shot voice cloning is possible with a ~10-second clip. The implications of MisoTTS are significant, with potential applications in various industries, including voice assistants, audiobooks, and customer service.
As the field of text-to-speech synthesis continues to evolve, MisoTTS is an important milestone, offering a more natural-sounding and emotive voice synthesis experience. MisoTTS's innovative approach and impressive performance make it an exciting development in the field of AI research.
Source: MarkTechPost