Speko is a voice AI benchmarking and optimization platform. It connects to 18+ voice AI providers and automatically tests 240+ STT, LLM, and TTS combinations against your specific language, use case, and cost constraints — returning ranked results in minutes.

Which voice AI providers does Speko support?

Speko supports 18+ providers including Deepgram, AssemblyAI, ElevenLabs, Cartesia, PlayHT, OpenAI, Gemini, Groq, Cerebras, Vapi, Retell, Bland AI, Hume AI, and more. New providers are added regularly.

How does Speko benchmark voice AI providers?

Speko runs STT, LLM, and TTS providers in combination against your specific inputs, measuring latency, accuracy, cost, and quality. Every benchmark number is cited with source URLs and verification dates. See our methodology at speko.ai/blog/methodology.

Which STT provider is most accurate for English?

Based on our March 2026 benchmarks, Deepgram Nova-3 and AssemblyAI Universal-3 Pro lead for English accuracy. Deepgram Nova-3 achieves 4.1% WER on clean audio; AssemblyAI Universal-3 Pro averages 5.9% WER across 26 diverse datasets. The best choice depends on your audio conditions and latency requirements.

What is the cheapest voice AI stack in 2026?

The lowest-cost production-ready stack is approximately $0.0095/minute, combining Deepgram Nova-3 ($0.0043/min) + Gemini 2.0 Flash ($0.0007/min) + Cartesia Sonic ($0.0045/min). See our full cost breakdown at speko.ai/blog/voice-ai-cost-2026.

How is Speko different from Vapi or Retell?

Vapi and Retell are voice agent platforms that lock you into their provider choices. Speko is provider-agnostic infrastructure that benchmarks all providers against your requirements and helps you choose and switch freely. Speko integrates with any platform including Vapi, Retell, and custom stacks.

Building Voice AI for Non-English Languages: A Developer's Guide

The English-First Bias in Voice AI Infrastructure

The voice AI stack was built by English-speaking companies, for English-speaking markets. Deepgram trained Nova-3 primarily on English. ElevenLabs’s best voices are English. The entire ecosystem of benchmarks, tutorials, and integration guides assumes English as the baseline. If you are building for a non-English market, you are operating with second-class tooling by default.

This matters more than most developers realize. The accuracy gap between English and non-English performance is not minor. For some providers on some languages, WER can be 2–5x worse than English. A provider that delivers 4% WER on English might give you 18% WER on Hindi. That 18% error rate degrades your LLM’s ability to understand intent, and the damage compounds through the entire pipeline.

There is also a market reality that voice AI investors and most English-first vendors will not tell you: the non-English voice AI opportunity is larger than the English one. There are approximately 380 million native English speakers. There are 600 million Hindi speakers, 1.3 billion Mandarin speakers, 300 million Arabic speakers, and 500 million Spanish speakers. The market is waiting, but the infrastructure is not ready out of the box.

ElevenLabs will never recommend Cartesia for Thai. We will. This guide covers what you actually need to build production voice AI in the languages your users speak. See our multilingual benchmark dataset for the underlying numbers.

The Core Challenge: Language Diversity

The reason non-English voice AI is harder is not a resource problem. More data does not automatically fix it. The challenge is that non-English languages have fundamentally different phonological and morphological properties that English-optimized models handle poorly. Understanding why helps you make better provider decisions.

Tonal Languages (Mandarin, Thai, Vietnamese, Cantonese)

In tonal languages, pitch contour is phonemic — it distinguishes meaning. In Mandarin, the syllable “ma” means mother (flat tone), hemp (rising tone), horse (falling-rising tone), or scold (falling tone), depending on how you say it. An STT model trained primarily on English, where pitch signals emotion and emphasis but not lexical meaning, will systematically misclassify tonal distinctions.

Thai has five tones. Vietnamese has six. The error cascade from a missed tone in Thai can completely change a sentence’s meaning. This is not a minor accuracy concern — it can produce nonsensical output that your LLM has no way to recover from. Google Chirp-3 is currently the strongest option for tonal languages, with architecture specifically designed to capture pitch contour features.

Agglutinative Languages (Turkish, Finnish, Korean, Japanese, Hungarian)

Agglutinative languages build complex words by chaining morphemes. In Turkish, the single word “Muvaffakiyetsizleštiremeyebileceklerimizdenmisšsinizcesine” encodes a complete clause that requires a full sentence in English. Japanese verb forms attach suffixes for tense, politeness level, negation, conditionality, and aspect — all in a single word.

English STT models are optimized for short, discrete word units. Agglutinative languages produce long acoustic sequences that map to a single semantic unit. Word segmentation — where one word ends and another begins — is a hard problem for models that were not trained on these languages extensively. Deepgram Nova-3 has made meaningful progress on Japanese and Korean, with 6.2% WER on Japanese, which is competitive with purpose-built Japanese STT providers.

Diglossia (Arabic, Tamil, Modern Greek)

Diglossia is the coexistence of a formal written standard and divergent spoken dialects within the same language community. Arabic is the canonical example: Modern Standard Arabic (MSA) is used in formal writing, news, and official contexts, but no one speaks MSA as a native language. Real Arabic speakers speak Egyptian Arabic, Levantine Arabic, Gulf Arabic, Moroccan Arabic, or one of dozens of other dialects that differ substantially in pronunciation, vocabulary, and grammar.

Most Arabic STT models are trained on MSA because it is what exists in large text corpora. Deploying such a model for Egyptian users produces systematic errors. AssemblyAI Universal-3 Pro has the broadest Arabic dialect coverage of any provider we tested, though meaningful gaps remain for less common dialects.

Code-Switching (Hinglish, Spanglish, Taglish)

Real users in multilingual societies do not stay in one language. An educated Indian professional speaking Hindi will naturally insert English technical terms, entire English clauses, and English filler words into their Hindi speech. This is not a sign of limited proficiency — it is how urban, educated speakers actually communicate. The same pattern holds for US Latinos switching between Spanish and English, Filipinos switching between Filipino and English, and dozens of other populations.

Most STT models are designed to transcribe one language per audio clip. Code-switched audio that flips between Hindi and English mid-sentence creates an identity problem for the model: which language’s phonology is the right reference frame? AssemblyAI Universal-3 Pro is the only provider in our benchmark with explicit, tested support for code-switched speech. For Hinglish-speaking users, it is not competitive with other providers — it is categorically better.

Provider Performance by Language

The following breakdown reflects our benchmark testing of Deepgram Nova-3, AssemblyAI Universal-3 Pro, Google Chirp-3, OpenAI Whisper v3, and ElevenLabs and Cartesia for TTS. Numbers are verified as of March 2026. Provider recommendations reflect accuracy as the primary criterion, with latency as a tiebreaker.

Japanese

Japanese is linguistically demanding for Western STT systems: no word boundary markers in writing, three scripts (hiragana, katakana, kanji), complex honorific morphology, and heavy agglutination. Deepgram Nova-3 achieves 6.2% WER on Japanese, which is our benchmark leader and significantly better than English-only providers.

•STT: Deepgram Nova-3 (6.2% WER). Best for real-time voice agents.
•TTS: ElevenLabs has the most natural-sounding Japanese voices, with correct pitch accent and natural prosody. This is important because Japanese has a pitch accent system that is different from both tonal and stress-accent languages.
•Watch out for: Keigo (formal register). Most STT models are trained on informal Japanese. If your users speak formal business Japanese, benchmark specifically against keigo samples.

See the Japanese voice AI guide for a complete provider breakdown including TTS naturalness scores.

Hindi

Hindi is spoken by approximately 600 million people and is the primary language for voice AI deployments targeting India. The key complication for India deployments is not Hindi accuracy per se — it is code-switching. Urban Indian users speak Hinglish: Hindi sentence structure with freely inserted English words, phrases, and technical vocabulary. A model that handles pure Hindi well may fail badly on realistic Indian user audio.

•STT: AssemblyAI Universal-3 Pro. Its code-switching support is the decisive factor for Indian deployments. Deepgram Nova-3 is strong on clean Hindi but degrades on Hinglish.
•TTS: ElevenLabs has the strongest Hindi voice selection with natural-sounding prosody.
•Regional variation: Hindi spoken in Delhi, Mumbai, and Kolkata sounds meaningfully different. Mumbai Hindi includes Marathi loan words and phonology. Benchmark against audio from your specific target region.

See the Hindi voice AI guide for a full provider comparison including code-switching benchmarks.

Arabic

Arabic is a high-stakes language for voice AI: 300+ million native speakers, rapidly growing smartphone penetration in MENA, and strong demand for customer service and educational voice agents. The dialect fragmentation makes it one of the hardest languages to deploy correctly.

•STT:AssemblyAI Universal-3 Pro leads on Arabic, with the broadest dialect coverage. No provider has complete dialect support — all have gaps for Moroccan Darija, Sudanese Arabic, and other peripheral dialects.
•TTS: ElevenLabs has Arabic voice options, though the selection is smaller than for European languages. Naturalness varies significantly by dialect.
•Critical warning: Do not test only on MSA audio and assume production performance. Egyptian Arabic and Gulf Arabic sound substantially different from MSA. Always benchmark against dialect-specific samples from your target market.

See the Arabic voice AI guide for dialect-specific benchmark data and provider recommendations.

Thai

Thai is a tonal language with five tones, complex consonant clusters, and no spaces between words in written form. It presents a combination of the challenges found in Mandarin (tones) and Japanese (word segmentation). Google Chirp-3 is the benchmark leader for Thai, with architecture specifically designed for tonal language accuracy.

•STT: Google Chirp-3. Its tonal language architecture gives it a meaningful advantage over providers that treat Thai as a lower-priority language.
•TTS: ElevenLabs has Thai voice options. Cartesia does not have competitive Thai TTS. For Thai, ElevenLabs is the default choice.
•This is exactly the case where following a generic “use Deepgram” recommendation will cost you. Google Chirp-3 is the right answer for Thai even though it is not the right answer for English.

Spanish

Spanish has the advantage of being a high-priority language for US-based voice AI providers. Deepgram, AssemblyAI, and ElevenLabs all have strong Spanish support. The primary consideration is variant: Castilian Spanish (Spain), Latin American Spanish, and the many regional variants within Latin America (Mexican Spanish, Argentine Spanish, Caribbean Spanish) differ in pronunciation and vocabulary.

•STT: Deepgram Nova-3 leads on Spanish accuracy. Performance is strong across major Spanish variants.
•TTS: ElevenLabs has the best Spanish voice selection with natural-sounding prosody across regional variants.
•US Latino market: Spanglish (Spanish-English code-switching) is significant for this market. AssemblyAI Universal-3 Pro should be evaluated alongside Deepgram if your users are US Latinos.

See the Spanish voice AI guide for regional variant benchmark data.

Portuguese (PT-BR)

Brazilian Portuguese has become a high-priority language for US voice AI providers due to Brazil’s large and growing tech market. PT-BR and European Portuguese (PT-PT) are distinct enough that models trained on one perform worse on the other.

•STT: Deepgram Nova-3 leads, with performance comparable to Spanish.
•TTS:ElevenLabs has good PT-BR voice selection. Specify PT-BR explicitly; defaulting to a generic “Portuguese” voice may produce PT-PT accent that sounds wrong to Brazilian users.

Chinese (Mandarin)

Mandarin Chinese presents the tonal language challenge at massive scale. Both Deepgram Nova-3 and Google Chirp-3 have competitive Mandarin support. The choice between them often comes down to infrastructure preferences.

•STT: Deepgram Nova-3 (competitive with Google Chirp-3). Google Chirp-3 has a marginal edge on tonal accuracy; Deepgram has a meaningful cost advantage.
•TTS: Cartesia has competitive Mandarin voices. ElevenLabs also has Mandarin support. Cartesia is the right choice if cost is a priority.
•Simplified vs Traditional: For mainland China users (Simplified Chinese), benchmark on mainland accents. Taiwan (Traditional Chinese) uses a distinct Mandarin accent (Guoyu) that differs from mainland Putonghua.

Korean

Korean is an agglutinative language with a consistent phonology and a well-defined script (Hangul). The relatively consistent phonological rules make it more tractable for STT than languages with more variable pronunciation. Deepgram Nova-3 is the benchmark leader.

•STT: Deepgram Nova-3. Strong agglutinative morphology handling.
•TTS: ElevenLabs has Korean voice support with natural-sounding intonation.
•Register sensitivity: Korean has a complex honorific system that affects pronunciation. Formal (jondaemal) and informal (banmal) speech differ significantly. Most voice AI use cases (customer service, commerce) require jondaemal.

The 5 Mistakes Developers Make with Non-English Voice AI

These are the patterns we see most often when teams come to us after a failed non-English voice AI deployment.

Mistake 1: Testing only on clean audio

Clean audio — studio recording, native speaker, standard dialect — is what providers use in their benchmarks. It is not what your users produce. Real non-English audio has regional accents, background noise, microphone variability, and dialectal features that differ significantly from the training distribution. A Japanese provider benchmark number measured on formal Tokyo Japanese will not predict performance for Kansai-accented users calling on a noisy train.

The fix: collect 100–200 audio samples from your actual target users before choosing a provider. If you cannot collect real user audio pre-launch, at minimum use regionally diverse audio that matches your target market’s accent, background noise level, and speech style.

Mistake 2: Picking the English winner

Deepgram wins on English. That fact does not transfer to other languages. Provider rankings change significantly by language family. For Hindi, AssemblyAI beats Deepgram. For Thai, Google beats both. For Mandarin, the gap narrows significantly. If you are picking Deepgram for your Hindi deployment because you read a benchmark showing it wins on English, you are making a decision based on irrelevant data.

The fix: run your own benchmark for your specific target language, not a general English benchmark. Use the Speko benchmark tool to compare providers on your specific language and audio samples in minutes, not weeks.

Mistake 3: Ignoring code-switching

Code-switching is the norm, not the exception, for educated users in multilingual markets. Indian software engineers asking a voice AI about their account balance will say something like “Mera balance kya hai and can I increase my credit limit?” — Hindi sentence structure, English technical vocabulary, English relative clause. US Latino consumers will mix Spanish and English constantly within a single conversation.

STT models that are not designed for code-switching will produce systematic errors on this input: either transcribing the English words as garbled Hindi phonology, or switching language models mid-utterance with predictable accuracy loss. AssemblyAI Universal-3 Pro is currently the only provider with explicit, benchmarked code-switching support.

Mistake 4: Not testing TTS naturalness

TTS quality differences are larger and more perceptually obvious for non-English than for English. English TTS from the top providers (ElevenLabs, Cartesia, PlayHT) all sounds acceptably natural to most English listeners. The same is not true for Japanese, Arabic, or Thai: the gap between a natural-sounding TTS voice and a robotic one is immediately apparent to native speakers and directly affects trust and conversion.

Japanese TTS must get pitch accent right. Arabic TTS must handle the Hamza and various diacritics that affect pronunciation. Thai TTS must produce correct tonal patterns. A TTS provider that scores well on English MOS (Mean Opinion Score) benchmarks may score poorly on the language you care about.

The fix: run a TTS naturalness evaluation with 5–10 native speaker evaluators for your target language before committing to a TTS provider. Ask them to rate naturalness and specific linguistic features (correct tones for Mandarin/Thai, correct pitch accent for Japanese, correct dialect features for Arabic).

Mistake 5: Assuming one stack for all languages

The correct STT + TTS combination for Japanese is different from the correct combination for Arabic. There is no single “multilingual” stack that wins across all languages. Teams building products for multiple language markets need to accept that their voice infrastructure may use different providers for different language routes — Deepgram for Spanish, AssemblyAI for Hindi, Google for Thai — with routing logic that selects the appropriate provider based on the user’s language.

This adds operational complexity, but it is the correct trade-off. The alternative is accepting suboptimal accuracy across all languages to avoid the complexity, which in practice means losing users who notice that your voice AI does not understand them as well as it should.

A Practical Benchmarking Framework

If you have the time to run a manual evaluation, here is the framework we recommend. This is what Speko automates, but you can run it manually for smaller-scale evaluations.

Step 1: Collect representative audio samples

Collect 100 audio samples that represent your target users. Include:

•Multiple speakers (at least 10 different voices)
•Multiple accents within the target language/dialect
•Multiple audio conditions (clean, noisy, phone quality)
•Domain-specific vocabulary relevant to your use case
•Code-switched samples if your users mix languages

If you cannot source 100 real samples pre-launch, use public corpora appropriate to your target language. Common Voice (Mozilla) has multilingual audio with a permissive license. OpenSLR has academic corpora for many languages.

Step 2: Run transcriptions across providers

Send your 100 samples through Deepgram Nova-3, AssemblyAI Universal-3 Pro, Google Chirp-3, and OpenAI Whisper v3. Record the raw transcription output for each provider for each sample.

for each sample in test_set:
  deepgram_output = transcribe(sample, provider=“deepgram”)
  assemblyai_output = transcribe(sample, provider=“assemblyai”)
  google_output = transcribe(sample, provider=“google”)
  whisper_output = transcribe(sample, provider=“whisper”)

Step 3: Calculate WER for each provider

Compare each provider’s output against the ground truth transcription. WER is calculated as:

WER = (Substitutions + Deletions + Insertions) / Total Words in Reference

Note: WER calculation for non-Latin scripts (Arabic, Japanese, Chinese, Hindi in Devanagari) requires appropriate tokenization. Use language-specific tokenizers rather than simple whitespace splitting.

Step 4: Test TTS naturalness

Generate TTS output for 10 representative phrases from each TTS provider you are evaluating. Include:

•Phrases with language-specific phonological features (tones for Mandarin/Thai, pitch accent for Japanese, geminate consonants for Arabic)
•Domain-specific phrases from your use case (product names, industry terminology, numbers and dates in target language format)
•Long sentences with complex prosody (questions, lists, conditional clauses)

Have 3–5 native speakers rate each sample on a 1–5 scale for naturalness and intelligibility. A one-day evaluation with native speaker consultants will give you more signal than any automated MOS benchmark.

Step 5: Run a 7-day production pilot with monitoring

Lab benchmarks do not fully predict production performance. Run a controlled pilot with your chosen provider stack for 7 days, with:

•STT confidence score monitoring (flag any transcriptions below a confidence threshold for manual review)
•Intent classification failure rate (a high misclassification rate often signals STT errors upstream)
•User retry and rephrasing behavior (a leading indicator of user frustration with misunderstood utterances)
•End-to-end latency at P50, P95, and P99 (non-English transcription can be slower for some providers due to different model architectures)

The Speko Approach to Non-English Voice AI

The five-step framework above works, but it takes time — typically 2–4 weeks from audio collection to production pilot results. For teams moving fast, that timeline is too slow. Choosing a provider without benchmarking is also unacceptable for non-English deployments, because the cost of picking the wrong provider is measured in user trust and product quality, not just dollars.

Speko automates the benchmarking framework. Upload your audio samples (or use our language-specific test sets for 12+ languages), and we run your audio through all major STT and TTS providers in parallel, returning ranked results with WER, latency, and cost for each provider. For non-English deployments, we benchmark 240+ provider combinations including language-specific configurations (dialect settings, vocabulary boosting, model variants) that are not covered in generic provider benchmarks.

The output is a ranked list of provider stacks for your specific language and audio, not a generic recommendation. If you are building for Hindi-speaking users in Mumbai, you get a stack ranked for that specific case. If you are building for Gulf Arabic, you get a stack ranked for Gulf Arabic audio with Gulf Arabic dialect features.

See the benchmark tool to run a comparison on your target language in under 15 minutes.

Non-English Provider Summary

Language	Best STT	Best TTS	Key Challenge
Japanese	Deepgram Nova-3	ElevenLabs	Pitch accent, agglutination
Hindi	AssemblyAI Universal-3 Pro	ElevenLabs	Code-switching (Hinglish)
Arabic	AssemblyAI Universal-3 Pro	ElevenLabs	Dialect fragmentation
Thai	Google Chirp-3	ElevenLabs	Tonal language, 5 tones
Spanish	Deepgram Nova-3	ElevenLabs	Regional variant diversity
Portuguese (PT-BR)	Deepgram Nova-3	ElevenLabs	PT-BR vs PT-PT distinction
Mandarin	Deepgram Nova-3	Cartesia / ElevenLabs	Tonal language, no spaces
Korean	Deepgram Nova-3	ElevenLabs	Honorific register

Conclusion

Non-English voice AI is genuinely harder than English voice AI. The challenges are real: tonal languages require pitch-aware models, agglutinative languages break naive word segmentation, diglossia creates a gap between training data and real speech, and code-switching is the default behavior for educated users in multilingual markets.

But harder does not mean unsolvable. The right provider stack for Thai is different from the right stack for Spanish, which is different from the right stack for Hindi. Each of those stacks exists today and performs well for production deployments. The problem is not that the technology is missing — it is that the tooling for finding the right stack for your specific language is missing.

The market opportunity is larger for non-English than for English. Every non-English market that gets a high-quality voice AI experience is a market where that product has a durable advantage — because most competitors used the English-first default and delivered a mediocre experience. Building for non-English correctly is a competitive moat disguised as a technical challenge.

Start with the right benchmarks. Benchmark specifically for your language. Use the provider that wins on your audio, not the provider that wins on English. The infrastructure to do this right exists. Use the Speko benchmark tool to find it.

Sources

All data verified as of March 2026. Next scheduled verification: June 2026.

Back to blog

Building Voice AI for Non-English Languages: A Developer’s Guide