Best Vietnamese Text-to-Speech API (2026): independent benchmark
Northern Vietnamese has six lexical tones, and two of them (ngã and nặng) are defined by glottalization rather than pitch. A pitch-only analysis collapses ngã into sắc and nặng into huyền, so our analyzer adds a reference-free creak axis to resolve them. To our knowledge no other public TTS benchmark measures this.
First run (2026-06-03): xAI / Grok TTS resolved 4 of 6 tones correctly, the best result in the panel. Every system tested struggles with at least one glottalized tone.
Vietnamese TTS measurements
Sorted by tone-identification accuracy. Preliminary diagnostic (n=1 per tone, isolated words): treat the confusion structure as the signal, not the order.
| System | Tones resolved | Tone-ID accuracy | Glottal contrast |
|---|---|---|---|
| xAI / Grok TTS | 4/6 | 0.667 | -0.026 |
| MiniMax Speech 2.6 HD | 3/6 | 0.5 | 0.169 |
| GPT-4o mini TTS | 1/6 (1 unscored) | 0.2 | -0.065 |
| ElevenLabs v3 | 0/6 | 0 | 1.079 |
Per-tone resolution
Each cell shows the tone the analyzer resolved for the expected tone in the column header. Highlighted cells are correct. The ngã and nặng columns are the glottalized tones that pitch-only analysis cannot resolve.
| System | ngang | huyền | sắc | hỏi | ngã | nặng |
|---|---|---|---|---|---|---|
| xAI / Grok TTS | ngang | huyền | sắc | hỏi | hỏi | sắc |
| MiniMax Speech 2.6 HD | sắc | huyền | huyền | hỏi | huyền | nặng |
| GPT-4o mini TTS | huyền | huyền | huyền | huyền | hỏi | unscored |
| ElevenLabs v3 | sắc | ngang | ngang | huyền | ngang | ngang |
How we measured
- Probe: the six-way Vietnamese minimal set (ma, mà, má, mả, mã, mạ) synthesized as isolated words, first run 2026-06-03.
- Pitch analysis: F0 contours are matched against the six Northern (Hanoi) tone templates.
- Glottalization: a reference-free creak index breaks the ngã/sắc and nặng/huyền ties relative to each system's own modal ceiling, since those tone pairs are identical in pitch shape.
- Status: preliminary diagnostic. n=1 per tone, isolated words (which carry a sentence-final fall that distorts lexical tone), no native rater yet. Published as a confusion structure, not a ranking.
- Data is synced from the published run at benchmarks.speko.ai (snapshot 2026-06-05).
Full interactive panels, audio clips, and the complete methodology: benchmarks.speko.ai
Use the best Vietnamese voice without lock-in
Speko is one API in front of every system on this page: it routes each request to the measured-best provider for your language and fails over automatically when a provider degrades. When the next run reshuffles this table, your integration does not change.
curl -X POST https://api.speko.dev/v1/synthesize \
-H "Authorization: Bearer $SPEKO_API_KEY" \
-H "Content-Type: application/json" \
-d '{"text": "Xin chào, cảm ơn bạn đã gọi.", "intent": {"language": "vi"}}' \
--output reply.audio import { Speko } from '@spekoai/sdk';
const speko = new Speko({ apiKey: process.env.SPEKO_API_KEY! });
const { audio, provider, model } = await speko.synthesize('Xin chào, cảm ơn bạn đã gọi.', {
language: 'vi',
}); FAQ
What is the best Vietnamese text-to-speech API?
We do not yet publish a Vietnamese TTS ranking: the first analyzer run (2026-06-03) is n=1 per tone on isolated words with no native Hanoi rater, so it is a diagnostic. In that diagnostic, xAI / Grok TTS resolved the most tones (4 of 6). A carrier-frame probe and a native rater are the next step.
Why is Vietnamese hard for text-to-speech?
Two of the six Northern Vietnamese tones, ngã and nặng, are carried by glottalization (creaky voice), not pitch. A system can hit every pitch contour and still collapse ngã into sắc and nặng into huyền, which changes word meaning.
Is ElevenLabs bad at Vietnamese?
Its 0-of-6 tone score in this run is a textbook isolated-word artifact (its glottal contrast of 1.079 is far above every other system), not a verdict on its Vietnamese. Treat it as unscored until the carrier-frame v2 probe runs.
What about Vietnamese speech-to-text?
That board is measured and ranked: ElevenLabs Scribe v2 leads our Vietnamese STT benchmark at 1.9% WER. See the Vietnamese STT page for the full table.