Hugging Face BlogAI Agent

語音代理能否應對雙語客戶？針對語碼轉換語音的前沿ASR基準測試

2026年6月9日 19:38

重點摘要

全球超過一半的人口使用一種以上的語言。對於許多雙語使用者而言，語碼轉換——在對話中甚至句子中間流暢切換語言——是日常溝通的自然組成部分。無論是在日常對話、聯絡中心還是IT服務臺，使用者都會流暢地適應當下最自然的語言。儘管全球雙語使用者普遍存在，但針對語音代理如何處理語碼轉換語音的研究仍十分有限。

站內 AI 整理稿

Back to Articles Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech Enterprise Article Published June 9, 2026 Upvote 4 Shama Gupta shamagupta Follow ServiceNow-AI Lindsay Brin lindsaybrin Follow ServiceNow-AI Fanny Riols FannyRiols Follow ServiceNow-AI Introduction Over half of the world's population speaks more than one language. And for many bilingual speakers, code-switching — seamlessly switching between languages, even mid-sentence — is a natural part of everyday communication. Whether in casual conversations, contact centers, or IT helpdesks, speakers fluidly adapt to whichever language feels most natural in the moment. Despite the prevalence of bilingual speakers across the world, there has been little work focused on how voice agents handle code-switched speech in enterprise settings. So, when a customer asked us how our voice agents would perform for their largely bilingual customer base who routinely code-switched, we decided to build our own benchmark and dataset to evaluate models. We focused on automatic speech recognition (ASR) — the first step in any voice agent pipeline — because transcription errors propagate forward into every downstream component. In enterprise settings, where a misrouted ticket or misunderstood policy question has real operational consequences, getting the transcript right is an especially important step of the voice agent pipeline. Our benchmark covers four language pairs that were most relevant for our customer base: Spanish-English, French-English, Canadian French-English, and German-English. It uses the non-English language as the matrix framing, with English embedded at varying lengths. The data covers a wide range of Human Resources (HR) and IT Service management (ITSM) scenarios, including employee inquiries about benefits or payroll, and support requests such as password resets, VPN access, or device troubleshooting. To measure how various models perform, we report three metrics: Word Error Rate (WER), Semantic Word Error Rate (SWER), and Answer Error Rate (AER). We choose these metrics to capture both (1) the models' exact accuracy in transcription, as well as (2) their ability to preserve the meaning of the utterance for downstream tasks. We release our benchmark and data through our harness for evaluating voice models, AU-Harness. We also provide results from seven ASR systems, including some Large Audio Language Models (LALMs), frontier ASRs, and open-source ASRs. Our main finding is that the cost of codeswitching varies depending on the language-pair and model tested. ElevenLabs Scribe V2, Gemini 3 Flash, and Assembly AI Universal 3-Pro surface as the top models across metrics for the task. The Benchmark Data Pipeline We start with an internal corpus of IT support and HR interactions. To create each code-switched utterance, we begin with parallel user utterances in English and one of our four non-English languages, then filter for good code-switching candidates. We keep utterances between 12 and 40 words — short enough to be natural spoken turns, long enough to contain real switching opportunities. We also exclude utterances where entities dominate — emails, phone numbers, IDs, or URLs that make text half-English by necessity rather than bilingual choice. Finally, we require at least three switchable content words — nouns, verbs, or adjectives that are not entities or product names — to give the generation model enough material to produce a meaningful code-switched version. From here, we tested various strategies for combining languages in a realistic way and ultimately selected a simple persona prompt sent to an LLM (OpenAI/GPT-5) to produce the code-switched text. We then used an LLM verbalization pass to convert the text into its spoken form and used ElevenLabs Multilingual V2 to synthesize the audio. Every utterance is then reviewed by an AI/NLP linguist who is a native speaker of the matrix language; flagged utterances are excluded or regenerated and re-reviewed. The final dataset has 259 Spanish-English records, 298 French-English records, 188 Canadian French-English records, and 173 German-English records Evaluation Methodology We report three metrics per model per language pair, chosen to capture transcription accuracy, meaning preservation, and downstream task performance: Word Error Rate (WER). Along with overall WER per language pair, we report WER by individual language. Semantic WER (SWER). This score represents the rate of errors that are judged as semantically meaningful. Our implementation is largely based on Pipecat's STT benchmark, and we use Gemma-4-31B as our judge. Answer Error Rate (AER). This metric directly captures whether transcription errors propagate into downstream failures. It is a question-answer metric that follows the methodology in Bhushan et al. (IISc/ARTPARK, arXiv 2507.16456). For each utterance, we generate three downstream comprehension questions and measure whether an LLM reading the ASR transcript can answer them correctly. The flow is shown in the diagram below. Findings We evaluated the following models: AssemblyAI / Universal 3-Pro Deepgram / Nova 3 Multilang ElevenLabs / Scribe V2 Google / Gemini 3 Flash Mistral AI / Voxtral Small 24B-2507 Nvidia / Parakeet TDT 0.6b V3 OpenAI / Whisper Large V3 Turbo A. How well do models perform on our benchmark for codeswitching? We analyzed errors along two dimensions: Word-level accuracy, measured through WER. WER is the standard approach: it aligns the ground truth transcript with the model's output and quantifies the distance between them. Although it is simple and widely used, it can't distinguish a minor spelling difference from a completely wrong word. Semantic accuracy, captured through SWER and AER. SWER gives us a holistic view of utterance-level performance, though it reflects a judge model's assessment rather than a direct downstream test. AER, by contrast, is a functional test: for each utterance, three comprehension questions measure whether the most consequential details — case numbers, names, dates, the reason for a request — were preserved in the transcription. The differences between metrics become most meaningful when models diverge across them. WER results (lower is better) ElevenLabs/Scribe V2 and AssemblyAI/Universal-3 Pro are the top two models on transcription accuracy. They are tied on Spanish-English and separated by 0.02-0.13 percentage points across all other language pairs, with Scribe taking a narrow lead on each. Google/Gemini 3 Flash follows closely in every language pair, trailing most on Canadian French-English, where it falls 0.14 points behind Scribe and 0.12 points behind AssemblyAI. Deepgram/Nova-3, Mistral/Voxtral, and Nvidia/Parakeet occupy the middle ranks, each pulling ahead on at least one language pair. Parakeet is the weakest of the three overall but closes the gap on German-English, where it out performs both Nova-3 and Voxtral. OpenAI/Whisper Large V3 Turbo sits at the bottom, with WER ranging from 0.16 to 0.61. While it's a significant drop, it reflects known limitation of Whisper. When called without an explicit language parameter on code-switched audio, Whisper defaults to translating into English rather than transcribing, failing to preserve the language spoken in the audio. SWER and AER results (lower is better) The semantic metrics tell a broadly similar story to the WER, with a few inversions. Scribe V2 remains at the first place, with very low SWER and AER scores. While Assembly AI ranked first or second across language pairs in WER, Gemini 3 Flash consistently outperforms it in AER and pushes AssemblyAI down to third place. The same pattern appears in SWER, although AssemblyAI outperforms Gemini on Spanish-English. As an LALM, Gemini is optimized for language understanding and reasoning, which likely gives it an advantage on meaning-sensitive metrics even where its raw transcription accuracy falls short. A similar shift i

原始來源：Hugging Face Blog ↗

查看原始來源

TechWebAI Agent

網易有道全面向AI轉型全場景Agent矩陣亮相圖博會

{"id":"39ef5947-b77a-4904-bf03-ff6264f08dc4","object":"response","model":"deepseek-v4-flash","output":[],"stop_reason":"max_output_tokens","usage":{"input_tokens":154,"output_tokens":200,"total_tokens":354}}

剛剛閱讀分析

Hugging Face BlogAI Agent

MosaicLeaks: Can your research agent keep a secret?

Back to Articles MosaicLeaks: Can your research agent keep a secret? Enterprise Article Published June 18, 2026 Upvote - Alexander Gurung agurung Follow ServiceNow Rafael Pardinas rafapi-snow Follow ServiceNow TL;DR Deep research agents increasingly combine private local documents with external tools like web retrieval, creating a privacy risk: an agent's external queries may leak sensitive information. MosaicLeaks proposes a new deep-research task with multi-hop questions that interleave public and private information. Across the models we tested, agents frequently leaked private information, and training only for task performance made it worse. We propose a mosaic-leakage-aware RL training method, Privacy-Aware Deep Research (PA-DR), which raises strict chain success (the share of chains

17 小時前閱讀分析