Skip to content
Back to blog

Guide

Building Voice AI for Non-English Languages: A Developer’s Guide

Published: March 2026|Author: Speko Research Team

Data verified: March 2026 · Next scheduled update: June 2026

The English-First Bias in Voice AI Infrastructure

The voice AI stack was built by English-speaking companies, for English-speaking markets. Deepgram trained Nova-3 primarily on English. ElevenLabs’s best voices are English. The entire ecosystem of benchmarks, tutorials, and integration guides assumes English as the baseline. If you are building for a non-English market, you are operating with second-class tooling by default.

This matters more than most developers realize. The accuracy gap between English and non-English performance is not minor. For some providers on some languages, WER can be 2–5x worse than English. A provider that delivers 4% WER on English might give you 18% WER on Hindi. That 18% error rate degrades your LLM’s ability to understand intent, and the damage compounds through the entire pipeline.

There is also a market reality that voice AI investors and most English-first vendors will not tell you: the non-English voice AI opportunity is larger than the English one. There are approximately 380 million native English speakers. There are 600 million Hindi speakers, 1.3 billion Mandarin speakers, 300 million Arabic speakers, and 500 million Spanish speakers. The market is waiting, but the infrastructure is not ready out of the box.

ElevenLabs will never recommend Cartesia for Thai. We will. This guide covers what you actually need to build production voice AI in the languages your users speak. See our multilingual benchmark dataset for the underlying numbers.

The Core Challenge: Language Diversity

The reason non-English voice AI is harder is not a resource problem. More data does not automatically fix it. The challenge is that non-English languages have fundamentally different phonological and morphological properties that English-optimized models handle poorly. Understanding why helps you make better provider decisions.

Tonal Languages (Mandarin, Thai, Vietnamese, Cantonese)

In tonal languages, pitch contour is phonemic — it distinguishes meaning. In Mandarin, the syllable “ma” means mother (flat tone), hemp (rising tone), horse (falling-rising tone), or scold (falling tone), depending on how you say it. An STT model trained primarily on English, where pitch signals emotion and emphasis but not lexical meaning, will systematically misclassify tonal distinctions.

Thai has five tones. Vietnamese has six. The error cascade from a missed tone in Thai can completely change a sentence’s meaning. This is not a minor accuracy concern — it can produce nonsensical output that your LLM has no way to recover from. Google Chirp-3 is currently the strongest option for tonal languages, with architecture specifically designed to capture pitch contour features.

Agglutinative Languages (Turkish, Finnish, Korean, Japanese, Hungarian)

Agglutinative languages build complex words by chaining morphemes. In Turkish, the single word “Muvaffakiyetsizleštiremeyebileceklerimizdenmisšsinizce­sine” encodes a complete clause that requires a full sentence in English. Japanese verb forms attach suffixes for tense, politeness level, negation, conditionality, and aspect — all in a single word.

English STT models are optimized for short, discrete word units. Agglutinative languages produce long acoustic sequences that map to a single semantic unit. Word segmentation — where one word ends and another begins — is a hard problem for models that were not trained on these languages extensively. Deepgram Nova-3 has made meaningful progress on Japanese and Korean, with 6.2% WER on Japanese, which is competitive with purpose-built Japanese STT providers.

Diglossia (Arabic, Tamil, Modern Greek)

Diglossia is the coexistence of a formal written standard and divergent spoken dialects within the same language community. Arabic is the canonical example: Modern Standard Arabic (MSA) is used in formal writing, news, and official contexts, but no one speaks MSA as a native language. Real Arabic speakers speak Egyptian Arabic, Levantine Arabic, Gulf Arabic, Moroccan Arabic, or one of dozens of other dialects that differ substantially in pronunciation, vocabulary, and grammar.

Most Arabic STT models are trained on MSA because it is what exists in large text corpora. Deploying such a model for Egyptian users produces systematic errors. AssemblyAI Universal-3 Pro has the broadest Arabic dialect coverage of any provider we tested, though meaningful gaps remain for less common dialects.

Code-Switching (Hinglish, Spanglish, Taglish)

Real users in multilingual societies do not stay in one language. An educated Indian professional speaking Hindi will naturally insert English technical terms, entire English clauses, and English filler words into their Hindi speech. This is not a sign of limited proficiency — it is how urban, educated speakers actually communicate. The same pattern holds for US Latinos switching between Spanish and English, Filipinos switching between Filipino and English, and dozens of other populations.

Most STT models are designed to transcribe one language per audio clip. Code-switched audio that flips between Hindi and English mid-sentence creates an identity problem for the model: which language’s phonology is the right reference frame? AssemblyAI Universal-3 Pro is the only provider in our benchmark with explicit, tested support for code-switched speech. For Hinglish-speaking users, it is not competitive with other providers — it is categorically better.

Provider Performance by Language

The following breakdown reflects our benchmark testing of Deepgram Nova-3, AssemblyAI Universal-3 Pro, Google Chirp-3, OpenAI Whisper v3, and ElevenLabs and Cartesia for TTS. Numbers are verified as of March 2026. Provider recommendations reflect accuracy as the primary criterion, with latency as a tiebreaker.

Japanese

Japanese is linguistically demanding for Western STT systems: no word boundary markers in writing, three scripts (hiragana, katakana, kanji), complex honorific morphology, and heavy agglutination. Deepgram Nova-3 achieves 6.2% WER on Japanese, which is our benchmark leader and significantly better than English-only providers.

See the Japanese voice AI guide for a complete provider breakdown including TTS naturalness scores.

Hindi

Hindi is spoken by approximately 600 million people and is the primary language for voice AI deployments targeting India. The key complication for India deployments is not Hindi accuracy per se — it is code-switching. Urban Indian users speak Hinglish: Hindi sentence structure with freely inserted English words, phrases, and technical vocabulary. A model that handles pure Hindi well may fail badly on realistic Indian user audio.

See the Hindi voice AI guide for a full provider comparison including code-switching benchmarks.

Arabic

Arabic is a high-stakes language for voice AI: 300+ million native speakers, rapidly growing smartphone penetration in MENA, and strong demand for customer service and educational voice agents. The dialect fragmentation makes it one of the hardest languages to deploy correctly.

See the Arabic voice AI guide for dialect-specific benchmark data and provider recommendations.

Thai

Thai is a tonal language with five tones, complex consonant clusters, and no spaces between words in written form. It presents a combination of the challenges found in Mandarin (tones) and Japanese (word segmentation). Google Chirp-3 is the benchmark leader for Thai, with architecture specifically designed for tonal language accuracy.

Spanish

Spanish has the advantage of being a high-priority language for US-based voice AI providers. Deepgram, AssemblyAI, and ElevenLabs all have strong Spanish support. The primary consideration is variant: Castilian Spanish (Spain), Latin American Spanish, and the many regional variants within Latin America (Mexican Spanish, Argentine Spanish, Caribbean Spanish) differ in pronunciation and vocabulary.

See the Spanish voice AI guide for regional variant benchmark data.

Portuguese (PT-BR)

Brazilian Portuguese has become a high-priority language for US voice AI providers due to Brazil’s large and growing tech market. PT-BR and European Portuguese (PT-PT) are distinct enough that models trained on one perform worse on the other.

Chinese (Mandarin)

Mandarin Chinese presents the tonal language challenge at massive scale. Both Deepgram Nova-3 and Google Chirp-3 have competitive Mandarin support. The choice between them often comes down to infrastructure preferences.

Korean

Korean is an agglutinative language with a consistent phonology and a well-defined script (Hangul). The relatively consistent phonological rules make it more tractable for STT than languages with more variable pronunciation. Deepgram Nova-3 is the benchmark leader.

The 5 Mistakes Developers Make with Non-English Voice AI

These are the patterns we see most often when teams come to us after a failed non-English voice AI deployment.

Mistake 1: Testing only on clean audio

Clean audio — studio recording, native speaker, standard dialect — is what providers use in their benchmarks. It is not what your users produce. Real non-English audio has regional accents, background noise, microphone variability, and dialectal features that differ significantly from the training distribution. A Japanese provider benchmark number measured on formal Tokyo Japanese will not predict performance for Kansai-accented users calling on a noisy train.

The fix: collect 100–200 audio samples from your actual target users before choosing a provider. If you cannot collect real user audio pre-launch, at minimum use regionally diverse audio that matches your target market’s accent, background noise level, and speech style.

Mistake 2: Picking the English winner

Deepgram wins on English. That fact does not transfer to other languages. Provider rankings change significantly by language family. For Hindi, AssemblyAI beats Deepgram. For Thai, Google beats both. For Mandarin, the gap narrows significantly. If you are picking Deepgram for your Hindi deployment because you read a benchmark showing it wins on English, you are making a decision based on irrelevant data.

The fix: run your own benchmark for your specific target language, not a general English benchmark. Use the Speko benchmark tool to compare providers on your specific language and audio samples in minutes, not weeks.

Mistake 3: Ignoring code-switching

Code-switching is the norm, not the exception, for educated users in multilingual markets. Indian software engineers asking a voice AI about their account balance will say something like “Mera balance kya hai and can I increase my credit limit?” — Hindi sentence structure, English technical vocabulary, English relative clause. US Latino consumers will mix Spanish and English constantly within a single conversation.

STT models that are not designed for code-switching will produce systematic errors on this input: either transcribing the English words as garbled Hindi phonology, or switching language models mid-utterance with predictable accuracy loss. AssemblyAI Universal-3 Pro is currently the only provider with explicit, benchmarked code-switching support.

Mistake 4: Not testing TTS naturalness

TTS quality differences are larger and more perceptually obvious for non-English than for English. English TTS from the top providers (ElevenLabs, Cartesia, PlayHT) all sounds acceptably natural to most English listeners. The same is not true for Japanese, Arabic, or Thai: the gap between a natural-sounding TTS voice and a robotic one is immediately apparent to native speakers and directly affects trust and conversion.

Japanese TTS must get pitch accent right. Arabic TTS must handle the Hamza and various diacritics that affect pronunciation. Thai TTS must produce correct tonal patterns. A TTS provider that scores well on English MOS (Mean Opinion Score) benchmarks may score poorly on the language you care about.

The fix: run a TTS naturalness evaluation with 5–10 native speaker evaluators for your target language before committing to a TTS provider. Ask them to rate naturalness and specific linguistic features (correct tones for Mandarin/Thai, correct pitch accent for Japanese, correct dialect features for Arabic).

Mistake 5: Assuming one stack for all languages

The correct STT + TTS combination for Japanese is different from the correct combination for Arabic. There is no single “multilingual” stack that wins across all languages. Teams building products for multiple language markets need to accept that their voice infrastructure may use different providers for different language routes — Deepgram for Spanish, AssemblyAI for Hindi, Google for Thai — with routing logic that selects the appropriate provider based on the user’s language.

This adds operational complexity, but it is the correct trade-off. The alternative is accepting suboptimal accuracy across all languages to avoid the complexity, which in practice means losing users who notice that your voice AI does not understand them as well as it should.

A Practical Benchmarking Framework

If you have the time to run a manual evaluation, here is the framework we recommend. This is what Speko automates, but you can run it manually for smaller-scale evaluations.

Step 1: Collect representative audio samples

Collect 100 audio samples that represent your target users. Include:

If you cannot source 100 real samples pre-launch, use public corpora appropriate to your target language. Common Voice (Mozilla) has multilingual audio with a permissive license. OpenSLR has academic corpora for many languages.

Step 2: Run transcriptions across providers

Send your 100 samples through Deepgram Nova-3, AssemblyAI Universal-3 Pro, Google Chirp-3, and OpenAI Whisper v3. Record the raw transcription output for each provider for each sample.

for each sample in test_set:
  deepgram_output = transcribe(sample, provider=“deepgram”)
  assemblyai_output = transcribe(sample, provider=“assemblyai”)
  google_output = transcribe(sample, provider=“google”)
  whisper_output = transcribe(sample, provider=“whisper”)

Step 3: Calculate WER for each provider

Compare each provider’s output against the ground truth transcription. WER is calculated as:

WER = (Substitutions + Deletions + Insertions) / Total Words in Reference

Note: WER calculation for non-Latin scripts (Arabic, Japanese, Chinese, Hindi in Devanagari) requires appropriate tokenization. Use language-specific tokenizers rather than simple whitespace splitting.

Step 4: Test TTS naturalness

Generate TTS output for 10 representative phrases from each TTS provider you are evaluating. Include:

Have 3–5 native speakers rate each sample on a 1–5 scale for naturalness and intelligibility. A one-day evaluation with native speaker consultants will give you more signal than any automated MOS benchmark.

Step 5: Run a 7-day production pilot with monitoring

Lab benchmarks do not fully predict production performance. Run a controlled pilot with your chosen provider stack for 7 days, with:

The Speko Approach to Non-English Voice AI

The five-step framework above works, but it takes time — typically 2–4 weeks from audio collection to production pilot results. For teams moving fast, that timeline is too slow. Choosing a provider without benchmarking is also unacceptable for non-English deployments, because the cost of picking the wrong provider is measured in user trust and product quality, not just dollars.

Speko automates the benchmarking framework. Upload your audio samples (or use our language-specific test sets for 12+ languages), and we run your audio through all major STT and TTS providers in parallel, returning ranked results with WER, latency, and cost for each provider. For non-English deployments, we benchmark 240+ provider combinations including language-specific configurations (dialect settings, vocabulary boosting, model variants) that are not covered in generic provider benchmarks.

The output is a ranked list of provider stacks for your specific language and audio, not a generic recommendation. If you are building for Hindi-speaking users in Mumbai, you get a stack ranked for that specific case. If you are building for Gulf Arabic, you get a stack ranked for Gulf Arabic audio with Gulf Arabic dialect features.

See the benchmark tool to run a comparison on your target language in under 15 minutes.

Non-English Provider Summary

LanguageBest STTBest TTSKey Challenge
JapaneseDeepgram Nova-3ElevenLabsPitch accent, agglutination
HindiAssemblyAI Universal-3 ProElevenLabsCode-switching (Hinglish)
ArabicAssemblyAI Universal-3 ProElevenLabsDialect fragmentation
ThaiGoogle Chirp-3ElevenLabsTonal language, 5 tones
SpanishDeepgram Nova-3ElevenLabsRegional variant diversity
Portuguese (PT-BR)Deepgram Nova-3ElevenLabsPT-BR vs PT-PT distinction
MandarinDeepgram Nova-3Cartesia / ElevenLabsTonal language, no spaces
KoreanDeepgram Nova-3ElevenLabsHonorific register

Conclusion

Non-English voice AI is genuinely harder than English voice AI. The challenges are real: tonal languages require pitch-aware models, agglutinative languages break naive word segmentation, diglossia creates a gap between training data and real speech, and code-switching is the default behavior for educated users in multilingual markets.

But harder does not mean unsolvable. The right provider stack for Thai is different from the right stack for Spanish, which is different from the right stack for Hindi. Each of those stacks exists today and performs well for production deployments. The problem is not that the technology is missing — it is that the tooling for finding the right stack for your specific language is missing.

The market opportunity is larger for non-English than for English. Every non-English market that gets a high-quality voice AI experience is a market where that product has a durable advantage — because most competitors used the English-first default and delivered a mediocre experience. Building for non-English correctly is a competitive moat disguised as a technical challenge.

Start with the right benchmarks. Benchmark specifically for your language. Use the provider that wins on your audio, not the provider that wins on English. The infrastructure to do this right exists. Use the Speko benchmark tool to find it.

Sources

All data verified as of March 2026. Next scheduled verification: June 2026.

Stop guessing. Start benchmarking.

Independent, data-driven comparisons to help you pick the right voice AI stack.

Get Started