Gradium Launches stt-translate and s2s-translate, Real-Time Speech Translation Models Beating gpt-realtime-translate on Accuracy and Latency

2026年6月24日 20:00

重點摘要

站內 AI 整理稿

Gradium today released two real-time speech translation models: stt-translate and s2s-translate. Both run across five languages and stream results live in the browser. Gradium claims a better accuracy-latency tradeoff than gpt-realtime-translate and gemini-3.5-live-translate. It also adds output voice control, including cloning, that gpt-realtime-translate lacks. TL;DR Gradium launched two real-time speech translation models: stt-translate (speech → text) and s2s-translate (speech → speech). They cover five languages (EN, FR, DE, ES, PT) and 20 pairs, collapsing the usual 3-model cascade into 2. Accuracy leads gemini-3.5-live-translate on BLEU and MetricX, and beats gpt-realtime-translate on BLEU (comparable on MetricX). Latency averages 3.0s — ahead of gpt-realtime-translate (3.6s), just behind gemini-3.5-live-translate (2.9s). Unlike gpt-realtime-translate, you pick the output voice or clone your own, all over one duplex WebSocket. stt-translate stt-translate takes speech in one language and returns text in another. It supports English (EN), French (FR), German (DE), Spanish (ES), and Portuguese (PT). Any source maps to any target across that set. That is 20 language pairs in total, in every direction. The key design choice is collapsing two steps into one. Transcription and translation happen in a single pass, inside the speech model. There is no intermediate transcript to wait on and no handoff between systems. According to Gradium: the approach draws on the Hibiki-Zero framework. The model optimizes low latency and high accuracy jointly through Reinforcement Learning. This means fewer moving parts in the pipeline. s2s-translate s2s-translate turns spoken audio in one language into spoken audio in another, end to end. It builds on stt-translate and pairs it with a Gradium TTS model in one service. You stream audio in over a WebSocket. You receive both the synthesized output audio and the translated transcript as they are produced. That removes integration work. You do not wire STT and TTS together yourself or manage two connections. The server runs the pipeline and streams results back. Input audio is PCM at 24 kHz, 16-bit signed mono. Output audio is PCM at 48 kHz, 16-bit signed mono. WAV, Opus, mu-law, and A-law are also supported. How Gradium Measures Quality: BLEU and MetricX Translation quality is not one number, so Gradium reports two complementary metrics: BLEU (Bilingual Evaluation Understudy) is the long-standing machine translation standard (Papineni et al.). It measures n-gram overlap between model output and human reference translations. It runs from 0 to 100, where higher is better. BLEU is fast, reproducible, and comparable across systems. Its limit is that it rewards surface word matching. A correct translation using different wording can be penalized. MetricX is a learned, neural quality metric developed by Google (Juraska et al.). It predicts how a human would rate a translation. It is an error score, so lower is better, and it tracks human judgment more closely than BLEU. The two catch different failures. BLEU checks lexical fidelity; MetricX checks semantic adequacy. Benchmark Gradium benchmarks on a proprietary dataset of conversational speech. The data reflects everyday topics like work, travel, and weather, rather than scripted text. Against gemini-3.5-live-translate, Gradium leads on both BLEU and MetricX. Against gpt-realtime-translate, Gradium leads on BLEU and is comparable on MetricX. CapabilityGradiumgpt-realtime-translategemini-3.5-live-translateAverage latency (all pairs)3.0s3.6s2.9sBLEU (higher is better)Leads bothLower than GradiumLower than GradiumMetricX (lower error is better)Comparable to GPT; leads GeminiComparable to GradiumHigher error than GradiumChoose output voiceYes (catalogue)NoNot statedClone your own voiceYesNoNot statedLanguages5 languages, 20 pairsNot statedNot stated Accuracy (BLEU and MetricX) is measured on stt-translate‘s translation; latency is for the full s2s-translate pipeline. Read it as a tradeoff, not a clean sweep. Gemini is fractionally faster; Gradium is more accurate and adds voice control. Why Two Models Beat Three The standard speech-to-speech stack uses three models: Speech-To-Text, then Text-To-Text translation, then Text-To-Speech. Each stage is a separate inference call. Each adds processing time and a handoff. Gradium uses two. stt-translate performs transcription and translation in a single pass. The dedicated Text-To-Text stage disappears entirely. That removes one full model from the critical path, along with its latency and handoff. The end-to-end path is shorter than a three-model cascade at equivalent quality. The numbers back the design. s2s-translate averages 3.0s across all language pairs. That beats gpt-realtime-translate at 3.6s and sits near gemini-3.5-live-translate at 2.9s. Use Cases With Examples Live dubbing and localization: Clone a presenter’s voice once. Translate a French keynote into Spanish that still sounds like the original speaker. Multilingual voice agents: Route a support call through s2s-translate. An English agent hears a German caller in English, and replies stream back in German. Real-time meetings: Pipe microphone audio in over the WebSocket. Each participant receives translated speech and transcript in their own language. Accessibility and captioning: Use stt-translate alone when you only need text. Render live translated captions without generating audio. Translate in a Few Lines of Code The Python SDK streams audio through the Speech-To-Speech endpoint and returns translated audio plus transcript. Copy CodeCopiedUse a different Browserimport asyncio import numpy as np from gradium import client as gradium_client grc = gradium_client.GradiumClient() # reads GRADIUM_API_KEY from the environment setup = { "model_name": "s2s-translate", "input_format": "pcm_24000", # 24 kHz, 16-bit signed mono input "output_format": "pcm_48000", # 48 kHz, 16-bit signed mono output "voice_id": "cLONiZ4hQ8VpQ4Sz", # must be a voice in the target language "stt_model_name": "stt-translate", "tts_model_name": "default", "target_language": "en", } # Raw 24 kHz, 16-bit mono PCM bytes (from a file, buffer, or microphone). with open("input_24k_mono.pcm", "rb") as f: pcm = f.read() async def main() -> np.ndarray: audio_out: list[bytes] = [] async with grc.s2s_realtime(wait_for_ready_on_start=True, **setup) as s2s: async def send_loop(): for i in range(0, len(pcm), 1920): # 1920 bytes = 40 ms at 24 kHz await s2s.send_audio(pcm[i : i + 1920]) await s2s.send_eos() # signal end of input async def recv_loop(): async for msg in s2s: if msg["type"] == "audio": audio_out.append(msg["audio"]) # translated speech (bytes) elif msg["type"] == "text": print(msg["text"], end=" ", flush=True) # translated transcript elif msg["type"] == "end_of_stream": break async with asyncio.TaskGroup() as tg: tg.create_task(send_loop()) tg.create_task(recv_loop()) return np.frombuffer(b"".join(audio_out), dtype=np.int16) # 48 kHz mono PCM translated_pcm = asyncio.run(main()) The SDK exposes three ways to drive S2S. Use s2s_realtime for live sources, s2s_stream for finite iterables, and s2s for buffered files. All three talk to wss://api.gradium.ai/api/speech/s2s. Strengths and Weaknesses Strengths Single-pass stt-translate removes one model from the latency path Leads gemini-3.5-live-translate on both BLEU and MetricX Output voice choice and cloning, which gpt-realtime-translate lacks One duplex WebSocket replaces a hand-wired STT-plus-TTS pipeline Weaknesses Five languages at launch, with 20 pairs only across that set gemini-3.5-live-translate is fractionally lower latency at 2.9s MetricX is only comparable to, not ahead of, gpt-realtime-translate Benchmarks use a proprietary dataset, so external replication is limited Interactive Explainer Try it</button> <button class="gtx-tab" role="tab" aria-selected="false" data-v="bench">Benchmarks</button

原始來源：MarkTechPost AI ↗

查看原始來源

36氪生成式AI

Grammarly母公司收購GPTZero：一邊幫你用AI寫，一邊幫你查AI寫？

這篇消息聚焦「Grammarly母公司收購GPTZero：一邊幫你用AI寫，一邊幫你查AI寫？」。原始導語提到：Superhuman年營收超47億元，GPTZero年營收超2億元。從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

45 分鐘前閱讀分析

36氪生成式AI

砍掉90%冗餘詞元，省下70萬美元：Netflix開源工具狙擊AI賬單黑洞

這篇消息聚焦「砍掉90%冗餘詞元，省下70萬美元：Netflix開源工具狙擊AI賬單黑洞」。原始導語提到：在近期的開源峰會上，Chopra 表示，Headroom 已為用戶節省了約 70 萬美元，這些用戶可以將節省的 2000 億Token用在其他地方。從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

2 小時前閱讀分析

36氪生成式AI

Claude剛剛上線“群聊Agent”：Karpathy盛讚的交互新範式，還是打工人的“數字監工”?

這篇消息聚焦「Claude剛剛上線“群聊Agent”：Karpathy盛讚的交互新範式，還是打工人的“數字監工”?」。原始導語提到：“Claude Code 升級版”來了！24小時在線待命（進一步消耗你的Token）從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

4 小時前閱讀分析

36氪生成式AI

剛剛，Claude進入美國版飛書，成了我的AI新同事

這篇消息聚焦「剛剛，Claude進入美國版飛書，成了我的AI新同事」。原始導語提到：Claude進群變身“打工人”。從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

5 小時前閱讀分析

鈦媒體生成式AI

AI太會寫代碼，人類已經審不過來了

這篇消息聚焦「AI太會寫代碼，人類已經審不過來了」。原始導語提到：從月均2.5萬行，到月均25萬行。從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

6 小時前閱讀分析

智東西生成式AI

具身智能的數據困境，不只在數量

智東西作者 | 許麗思編輯 | 漠影過去幾年，大模型的發展證明了，模型進化依賴於底層數據紅利的爆發，數據就是模型的能力邊界。這也是當前具身智能行業的一大共識。雖然VLA、世界模型等各種技術路線五花八門，行業尚未形成統一答案，但對數據重要性的判斷已經趨於一致：數據荒漠已成為制約具身智能泛化能力突破的核心瓶頸。與此同時，具身智能處於從實驗室探索走向產業化前夜。

7 小時前閱讀分析

相關文章

Grammarly母公司收購GPTZero：一邊幫你用AI寫，一邊幫你查AI寫？

砍掉90%冗餘詞元，省下70萬美元：Netflix開源工具狙擊AI賬單黑洞

Claude剛剛上線“群聊Agent”：Karpathy盛讚的交互新範式，還是打工人的“數字監工”?

剛剛，Claude進入美國版飛書，成了我的AI新同事

AI太會寫代碼，人類已經審不過來了

具身智能的數據困境，不只在數量