smol-audio: A Colab-Friendly Notebook Collection for Fine-Tuning Whisper, Parakeet, Voxtral, Granite Speech, and Audio Flamingo 3
A new repository, smol-audio, provides a collection of Jupyter notebooks designed to simplify the fine-tuning of various audio AI models, including Whisper, Parakeet, and Audio Flamingo 3.

Audio AI has had a breakout year, with significant advancements in automatic speech recognition, audio understanding, and text-to-speech technology. Models like OpenAI's Whisper, NVIDIA's Parakeet, and Mistral's Voxtral have dramatically improved speech recognition capabilities. Meanwhile, NVIDIA's Audio Flamingo 3 and Meta's Perception Encoder Audiovisual (PE-AV) have pushed the boundaries of audio understanding and multimodal processing.
However, a major challenge remains: the practical knowledge required to work with these models is scattered across various sources, including GitHub issues, research blogs, and private notebooks. This is where smol-audio comes in – a repository of self-contained Jupyter notebooks designed to make fine-tuning and adapting these models more accessible. Released under the Apache-2.0 license by the Deep-unlearning team, smol-audio provides a flat repository of notebooks, each focused on a single practical audio AI task.
These notebooks are designed to be opened directly in Google Colab, require no local GPU setup, and are built entirely on the Hugging Face ecosystem. The repository's design is deliberate, exposing every step of the process to provide transparency and facilitate learning. The repository currently features notebooks for fine-tuning various models, including Whisper, Parakeet, Voxtral, Granite Speech, and Audio Flamingo 3.
Each notebook provides a unique example of how to adapt these models to specific tasks, such as ASR fine-tuning, dialogue-style text-to-speech, and audio captioning. For instance, the Whisper notebook demonstrates how to fine-tune the model using transformers and datasets, while the Parakeet notebook covers both full fine-tuning and LoRA (Low-Rank Adaptation). The smol-audio repository also includes notebooks for more advanced models, such as Meta's PE-AV, which enables zero-shot video classification and audio text retrieval.
The repository's focus on transparency and education makes it an invaluable resource for ML engineers and researchers looking to work with these cutting-edge audio AI models. Overall, smol-audio has the potential to democratize access to audio AI technology, making it more accessible to a wider range of developers and researchers. By providing a collection of practical, easy-to-use notebooks, the repository can help accelerate progress in the field and enable new applications for audio AI.
Source: MarkTechPost