Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech
Over half of the world's population speaks more than one language.

Over half of the world's population speaks more than one language. And for many bilingual speakers, code-switching — seamlessly switching between languages, even mid-sentence — is a natural part of everyday communication. Whether in casual conversations, contact centers, or IT helpdesks, speakers fluidly adapt to whichever language feels most natural in the moment.
Despite the prevalence of bilingual speakers across the world, there has been little work focused on how voice agents handle code-switched speech in enterprise settings. So, when a customer asked us how our voice agents would perform for their largely bilingual customer base who routinely code-switched, we decided to build our own benchmark and dataset to evaluate models. We focused on automatic speech recognition (ASR) — the first step in any voice agent pipeline — because transcription errors propagate forward into every downstream component. In enterprise settings, where a misrouted ticket or misunderstood policy question has real operational consequences, getting the transcript right is an especially important step of the voice agent pipeline.
Our benchmark covers four language pairs that were most relevant for our customer base: Spanish-English, French-English, Canadian French-English, and German-English. It uses the non-English language as the matrix framing, with English embedded at varying lengths. The data covers a wide range of Human Resources (HR) and IT Service management (ITSM) scenarios, including employee inquiries about benefits or payroll, and support requests such as password resets, VPN access, or device troubleshooting. To measure how various models perform, we report three metrics: Word Error Rate (WER), Semantic Word Error Rate (SWER), and Answer Error Rate (AER). We choose these metrics to capture both (1) the models' exact accuracy in transcription, as well as (2) their ability to preserve the meaning of the utterance for downstream tasks.
We release our benchmark and data through our harness for evaluating voice models, AU-Harness. We also provide results from seven ASR systems, including some Large Audio Language Models (LALMs), frontier ASRs, and open-source ASRs. Our main finding is that the cost of codeswitching varies depending on the language-pair and model tested. ElevenLabs Scribe V2, Gemini 3 Flash, and Assembly AI Universal 3-Pro surface as the top models across metrics for the task.
We start with an internal corpus of IT support and HR interactions. To create each code-switched utterance, we begin with parallel user utterances in English and one of our four non-English languages, then filter for good code-switching candidates. We keep utterances between 12 and 40 words — short enough to be natural spoken turns, long enough to contain real switching opportunities. We also exclude utterances where entities dominate — emails, phone numbers, IDs, or URLs that make text half-English by necessity rather than bilingual choice. Finally, we require at least three switchable content words — nouns, verbs, or adjectives that are not entities or product names — to give the generation model enough material to produce a meaningful code-switched version.
Source: Hugging Face