Gradium Launches stt-translate and s2s-translate, Real-Time Speech Translation Models Beating gpt-realtime-translate on Accuracy and Latency
Gradium today released two real-time speech translation models: stt-translate and s2s-translate .

Gradium today released two real-time speech translation models: stt-translate and s2s-translate . Both run across five languages and stream results live in the browser.
Gradium claims a better accuracy-latency tradeoff than gpt-realtime-translate and gemini-3.5-live-translate . It also adds output voice control, including cloning, that gpt-realtime-translate lacks.
stt-translate takes speech in one language and returns text in another. It supports English ( EN ), French ( FR ), German ( DE ), Spanish ( ES ), and Portuguese ( PT ).
Any source maps to any target across that set. That is 20 language pairs in total, in every direction.
The key design choice is collapsing two steps into one. Transcription and translation happen in a single pass, inside the speech model. There is no intermediate transcript to wait on and no handoff between systems.
According to Gradium: the approach draws on the Hibiki-Zero framework . The model optimizes low latency and high accuracy jointly through Reinforcement Learning. This means fewer moving parts in the pipeline.
s2s-translate turns spoken audio in one language into spoken audio in another, end to end. It builds on stt-translate and pairs it with a Gradium TTS model in one service.
You stream audio in over a WebSocket. You receive both the synthesized output audio and the translated transcript as they are produced.
That removes integration work. You do not wire STT and TTS together yourself or manage two connections. The server runs the pipeline and streams results back.
Input audio is PCM at 24 kHz, 16-bit signed mono. Output audio is PCM at 48 kHz, 16-bit signed mono. WAV, Opus, mu-law, and A-law are also supported.
Translation quality is not one number, so Gradium reports two complementary metrics:
BLEU (Bilingual Evaluation Understudy) is the long-standing machine translation standard (Papineni et al.). It measures n-gram overlap between model output and human reference translations. It runs from 0 to 100, where higher is better.
BLEU is fast, reproducible, and comparable across systems. Its limit is that it rewards surface word matching. A correct translation using different wording can be penalized.
MetricX is a learned, neural quality metric developed by Google (Juraska et al.). It predicts how a human would rate a translation. It is an error score, so lower is better, and it tracks human judgment more closely than BLEU.
Source: MarkTechPost