Meet OmniVoice Studio: A Local, Open-Source Alternative to ElevenLabs
OmniVoice Studio offers a local, open-source alternative to ElevenLabs' voice AI services, which charge between $5 and $330 per month.

For those seeking a more affordable and transparent solution for voice AI tasks, OmniVoice Studio emerges as a promising alternative to ElevenLabs. The latter charges users between $5 and $330 per month for its services, with the added caveat that every audio file processed goes through their cloud servers. In contrast, OmniVoice Studio is an open-source desktop application designed to perform a range of voice AI tasks locally, eliminating the need for cloud-based processing.
OmniVoice Studio is a comprehensive tool that handles voice cloning, video dubbing, real-time dictation, vocal isolation, and speaker diarization. This desktop application bundles six distinct capabilities, making it a versatile solution for users. Voice cloning, for instance, can be achieved with just a 3-second audio clip, leveraging zero-shot learning to clone voices the system has never been trained on before.
This is made possible by conditioning a diffusion-based TTS (text-to-speech) model on the short reference audio. The underlying model, OmniVoice from k2-fsa, supports an impressive 600+ languages. Beyond voice cloning, OmniVoice Studio offers a voice design feature that allows users to build new voices from parameters such as gender, age, accent, pitch, speed, emotion, and dialect, all without cloning an existing voice.
The application's video dubbing capability takes a YouTube URL or a local video file, transcribes it using WhisperX, translates the transcript, synthesizes new audio using the TTS engine, and exports an MP4, all within a local pipeline. Additionally, the dictation widget provides a system-wide floating overlay that streams transcription via WebSocket and auto-pastes the result into the focused application. Users can also utilize the Batch Queue to process up to 50 videos at once, complete with per-job progress bars.
The technical backbone of OmniVoice Studio includes a React frontend communicating with a FastAPI backend, which exposes 97 API endpoints and utilizes Server-Sent Events (SSE) for streaming updates, all while storing data in SQLite. The project leverages four core ML libraries for its operations and is built with Tauri, a Rust-based framework for cross-platform native apps. Interestingly, the codebase is comprised of 56% Python, 23.6% JavaScript, 11% CSS, 3.4% Shell, 3.3% Rust, and 2.6% TypeScript.
For GPU support, the backend auto-detects CUDA (NVIDIA), MPS (Apple Silicon Metal), and ROCm (AMD), with automatic offloading to CPU during transcription for systems with 8 GB VRAM or less. OmniVoice Studio also boasts a pluggable multi-engine TTS backend, allowing users to switch engines seamlessly. It comes with six built-in engines, including OmniVoice (default, 600+ languages), CosyVoice 3 (9 languages plus 18 dialects), MLX-Audio (Apple Silicon-only), VoxCPM2 (30 languages), MOSS-TTS-Nano (20 languages), and KittenTTS (English-only, CPU-only).
Source: MarkTechPost