Best Vietnamese Text-to-Speech API (2026): independent benchmark

Northern Vietnamese has six lexical tones, and two of them (ngã and nặng) are defined by glottalization rather than pitch. A pitch-only analysis collapses ngã into sắc and nặng into huyền, so our analyzer adds a reference-free creak axis to resolve them. To our knowledge no other public TTS benchmark measures this.

First run (2026-06-03): xAI / Grok TTS resolved 4 of 6 tones correctly, the best result in the panel. Every system tested struggles with at least one glottalized tone.

Preliminary: n=1 per tone, isolated words, no native Hanoi rater yet. A diagnostic, not a ranking. ElevenLabs v3 (accuracy 0 with glottal contrast 1.079) is an isolated-word artifact; the carrier-frame v2 probe and a native rater are the next step.

Vietnamese TTS measurements

Sorted by tone-identification accuracy. Preliminary diagnostic (n=1 per tone, isolated words): treat the confusion structure as the signal, not the order.

System Tones resolved Tone-ID accuracy Glottal contrast
xAI / Grok TTS 4/6 0.667 -0.026
MiniMax Speech 2.6 HD 3/6 0.5 0.169
GPT-4o mini TTS 1/6 (1 unscored) 0.2 -0.065
ElevenLabs v3 0/6 0 1.079

Per-tone resolution

Each cell shows the tone the analyzer resolved for the expected tone in the column header. Highlighted cells are correct. The ngã and nặng columns are the glottalized tones that pitch-only analysis cannot resolve.

System nganghuyềnsắchỏingãnặng
xAI / Grok TTS ngang huyền sắc hỏi hỏi sắc
MiniMax Speech 2.6 HD sắc huyền huyền hỏi huyền nặng
GPT-4o mini TTS huyền huyền huyền huyền hỏi unscored
ElevenLabs v3 sắc ngang ngang huyền ngang ngang

How we measured

Full interactive panels, audio clips, and the complete methodology: benchmarks.speko.ai

Use the best Vietnamese voice without lock-in

Speko is one API in front of every system on this page: it routes each request to the measured-best provider for your language and fails over automatically when a provider degrades. When the next run reshuffles this table, your integration does not change.

curl
curl -X POST https://api.speko.dev/v1/synthesize \
  -H "Authorization: Bearer $SPEKO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"text": "Xin chào, cảm ơn bạn đã gọi.", "intent": {"language": "vi"}}' \
  --output reply.audio
TypeScript
import { Speko } from '@spekoai/sdk';

const speko = new Speko({ apiKey: process.env.SPEKO_API_KEY! });

const { audio, provider, model } = await speko.synthesize('Xin chào, cảm ơn bạn đã gọi.', {
  language: 'vi',
});
start free read the docs

FAQ

What is the best Vietnamese text-to-speech API?

We do not yet publish a Vietnamese TTS ranking: the first analyzer run (2026-06-03) is n=1 per tone on isolated words with no native Hanoi rater, so it is a diagnostic. In that diagnostic, xAI / Grok TTS resolved the most tones (4 of 6). A carrier-frame probe and a native rater are the next step.

Why is Vietnamese hard for text-to-speech?

Two of the six Northern Vietnamese tones, ngã and nặng, are carried by glottalization (creaky voice), not pitch. A system can hit every pitch contour and still collapse ngã into sắc and nặng into huyền, which changes word meaning.

Is ElevenLabs bad at Vietnamese?

Its 0-of-6 tone score in this run is a textbook isolated-word artifact (its glottal contrast of 1.079 is far above every other system), not a verdict on its Vietnamese. Treat it as unscored until the carrier-frame v2 probe runs.

What about Vietnamese speech-to-text?

That board is measured and ranked: ElevenLabs Scribe v2 leads our Vietnamese STT benchmark at 1.9% WER. See the Vietnamese STT page for the full table.

More language benchmarks

Best Vietnamese STT APIBest Thai TTS APIBest Filipino TTS APISTT benchmarks by languageFull interactive TTS benchmark