OpenAI Unveils Three Realtime Audio Models for Advanced Voice Applications
OpenAI has released three new audio models through its Realtime API, enabling developers to build more sophisticated voice applications with capabilities like live speech translation and streaming transcription.

["OpenAI has taken a significant step forward in voice technology with the release of three new audio models through its Realtime API. These models are designed to enhance live voice applications, allowing for more natural and interactive conversations. The flagship model, GPT-Realtime-2, boasts GPT-5-class reasoning and can process complex requests, manage interruptions, and maintain context over longer conversations.
This is achieved through an expanded context window of 128K tokens, up from 32K, enabling the model to handle more intricate tasks without losing context. GPT-Realtime-2 also introduces several features aimed at improving user experience, including the ability to use short preamble phrases to signal that the agent is working on a request. This helps to alleviate awkward silences, a common issue in voice interactions.
Developers can also adjust the model's reasoning effort across five levels, allowing for a balance between performance and latency. This feature is particularly useful for production builders who need to fine-tune the model's performance based on specific use cases. Additionally, GPT-Realtime-2 offers tone control, enabling it to adjust its speaking style according to the situation, and it has improved understanding of industry-specific terminology.
The model's capabilities are reflected in benchmark results, where GPT-Realtime-2 with high reasoning scored 96.6% on Big Bench Audio, compared to 81.4% for its predecessor, GPT-Realtime-1.5. This represents a 15.2 percentage point improvement. For GPT-Realtime-2 with xhigh reasoning, the score on Audio MultiChallenge instruction following was 48.5%, compared to 34.7% for GPT-Realtime-1.5.
In terms of pricing, GPT-Realtime-2 is set at $32 per 1M audio input tokens and $64 per 1M audio output tokens. Alongside GPT-Realtime-2, OpenAI has introduced GPT-Realtime-Translate, a live translation model capable of translating speech from 70+ input languages into 13 output languages in real time. This model is designed for applications requiring live interpretation, such as bilingual customer support or live events.
It is priced at $0.034 per minute. The third model, GPT-Realtime-Whisper, is a streaming speech-to-text model built for low-latency transcription. It allows for controllable latency, making it suitable for live broadcasts, meeting notes, and voice agents that need to understand users in real time.
GPT-Realtime-Whisper is priced at $0.017 per minute. These models are now available through the OpenAI Realtime API, which has exited beta and is generally available. Developers can choose between voice-agent, translation, and transcription sessions, depending on their application's needs.
The release also includes two new voices, Cedar and Marin, exclusively available through the API. With these advancements, OpenAI is poised to significantly impact the development of voice applications, offering tools that enable more sophisticated and natural interactions."]
Source: MarkTechPost