Stability AI Releases Stable Audio 3: A Family of Fast Latent Diffusion Models for Audio Generation and Editing
Stability AI has released open weights for Stable Audio 3, a family of fast latent diffusion models for audio generation and editing, along with a technical research paper.

AI Releases Stable Audio 3: A Family of Fast Latent Diffusion Models for Audio Generation and Editing">
["Stability AI has unveiled Stable Audio 3, a groundbreaking family of latent diffusion models designed for audio generation and editing. The models can produce high-quality stereo audio at 44.1 kHz and support variable-length outputs, inpainting-based editing, and fast inference. The open weights for Stable Audio 3 have been released, accompanied by a comprehensive technical research paper that delves into the model's architecture and capabilities.", 'Stable Audio 3 comprises three model scales: small, medium, and large.
These models are built on a latent diffusion framework that generates audio by progressively removing noise from a compressed representation of audio, known as a latent. The models differ in capacity and maximum generation length, with the small and medium models available on Hugging Face and the large model accessible under an enterprise license.', 'A key component of Stable Audio 3 is the Semantically-Aligned Music autoEncoder (SAME), which converts stereo 44.1 kHz audio into a compact latent representation and back. SAME achieves a high 4096× compression ratio through a two-stage process involving patching and a Transformer Resampling Block.
This enables efficient long-form generation on consumer hardware. The diffusion transformer, the other main component, generates latent sequences conditioned on text, duration, and inpainting masks.', "The training process for Stable Audio 3 involves a multi-stage approach, including flow matching pre-training, distillation warmup, and adversarial post-training. This process allows the model to generate high-quality audio in a single forward pass without requiring classifier-free guidance at inference.
The model's performance is highlighted by its ability to generate 20 seconds of audio in approximately 0.62 seconds on an H200 GPU and 380 seconds of audio in 1.31 seconds on the same hardware.", 'Evaluations of Stable Audio 3 have shown impressive results across various metrics, including Fréchet Audio Distance (FAD) and CLAP score. For instrumental music generation, the large model achieves FAD 0.101 and CLAP 0.393, while the medium model achieves FAD 0.107 and CLAP 0.390. The model also excels in sound effects generation and audio editing tasks, demonstrating its versatility and effectiveness in handling different types of audio content.']
Source: MarkTechPost