StepFun Unveils StepAudio 2.5 Realtime: A Revolutionary Voice Model with Advanced Roleplay Capabilities
StepFun, a Shanghai-based AI lab, has released StepAudio 2.5 Realtime, an end-to-end real-time speech large language model with customizable persona capabilities and advanced paralinguistic comprehension.

Model with Advanced Roleplay Capabilities">
StepFun, a Shanghai-based AI lab, has unveiled StepAudio 2.5 Realtime, a groundbreaking voice model that operates in real-time, processing audio input and output through a single unified system. Unlike traditional pipeline-based systems that separate speech recognition, reasoning, and synthesis into sequential steps, StepAudio 2.5 Realtime is an end-to-end model that supports both Chinese and English languages. The model connects via a WebSocket API, with the endpoint wss://api.stepfun.com/v1/realtime using the model string step-2.5-realtime.
According to StepFun's research team, three core architectural innovations drive the model's capabilities. Firstly, the team built a million-scale persona feature matrix, starting from over 10,000 high-quality natively authored personas, and combined it with millions of real-world conversational samples for training. This approach aims to achieve generalization, particularly stable performance on difficult, long-tail conversational topics.
To address the known failure mode of "out-of-character" (OOC) behavior in conversational AI, where a model drifts away from its defined persona mid-conversation, StepFun's team conducted dedicated Reinforcement Learning from Human Feedback (RLHF) optimization specifically for persona consistency in roleplay scenarios. This targeted design choice enables the model to maintain a consistent persona throughout interactions. StepAudio 2.5 Realtime also inherits the StepAudio 2.5 TTS capabilities and deeply fuses speech understanding and generation through reinforcement learning.
This enables what StepFun calls "global scene-level tonal setting" and "intra-sentence detail sculpting," allowing the model to set an overall emotional register for a response while adjusting finer acoustic details within individual sentences. A distinct area of this model is paralinguistic perception, which refers to the analysis of non-verbal acoustic information in speech, such as tone, speaking rate, pauses, sighs, and laughter. By capturing these signals, the model can perceive the user's mood and underlying intentions.
For example, it can identify fatigue from a low tone or frustration from a rapid speech rate. StepAudio 2.5 Realtime scored 82.18 on the paralinguistic comprehension benchmark, demonstrating perception of vocal speed, emotion, age, and other acoustic features. The StepFun research team conducted a comprehensive suite of subjective and objective evaluations, benchmarking StepAudio 2.5 Realtime against leading real-time voice models across five dimensions.
Human evaluation was conducted through real mobile app conversations scored by human raters, with detailed scores available for review.
Source: MarkTechPost