Best Vietnamese Text-to-Speech API (2026): independent benchmark

Northern Vietnamese has six lexical tones, and two of them (ngã and nặng) are defined by glottalization rather than pitch. A pitch-only analysis collapses ngã into sắc and nặng into huyền, so our analyzer adds a reference-free creak axis to resolve them. To our knowledge no other public TTS benchmark measures this.

First run (2026-06-03): xAI / Grok TTS resolved 4 of 6 tones correctly, the best result in the panel. Every system tested struggles with at least one glottalized tone.

Preliminary: n=1 per tone, isolated words, no native Hanoi rater yet. A diagnostic, not a ranking. ElevenLabs v3 (accuracy 0 with glottal contrast 1.079) is an isolated-word artifact; the carrier-frame v2 probe and a native rater are the next step.

Vietnamese TTS measurements

Sorted by tone-identification accuracy. Preliminary diagnostic (n=1 per tone, isolated words): treat the confusion structure as the signal, not the order.

System	Tones resolved	Tone-ID accuracy	Glottal contrast
xAI / Grok TTS	4/6	0.667	-0.026
MiniMax Speech 2.6 HD	3/6	0.5	0.169
GPT-4o mini TTS	1/6 (1 unscored)	0.2	-0.065
ElevenLabs v3	0/6	0	1.079

Per-tone resolution

Each cell shows the tone the analyzer resolved for the expected tone in the column header. Highlighted cells are correct. The ngã and nặng columns are the glottalized tones that pitch-only analysis cannot resolve.

System	ngang	huyền	sắc	hỏi	ngã	nặng
xAI / Grok TTS	ngang	huyền	sắc	hỏi	hỏi	sắc
MiniMax Speech 2.6 HD	sắc	huyền	huyền	hỏi	huyền	nặng
GPT-4o mini TTS	huyền	huyền	huyền	huyền	hỏi	unscored
ElevenLabs v3	sắc	ngang	ngang	huyền	ngang	ngang

How we measured

Probe: the six-way Vietnamese minimal set (ma, mà, má, mả, mã, mạ) synthesized as isolated words, first run 2026-06-03.
Pitch analysis: F0 contours are matched against the six Northern (Hanoi) tone templates.
Glottalization: a reference-free creak index breaks the ngã/sắc and nặng/huyền ties relative to each system's own modal ceiling, since those tone pairs are identical in pitch shape.
Status: preliminary diagnostic. n=1 per tone, isolated words (which carry a sentence-final fall that distorts lexical tone), no native rater yet. Published as a confusion structure, not a ranking.
Data is synced from the published run at benchmarks.speko.ai (snapshot 2026-06-05).

Full interactive panels, audio clips, and the complete methodology: benchmarks.speko.ai

Use the best Vietnamese voice without lock-in

Speko is one API in front of every system on this page: it routes each request to the measured-best provider for your language and fails over automatically when a provider degrades. When the next run reshuffles this table, your integration does not change.

curl

curl -X POST https://api.speko.dev/v1/synthesize \
  -H "Authorization: Bearer $SPEKO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"text": "Xin chào, cảm ơn bạn đã gọi.", "intent": {"language": "vi"}}' \
  --output reply.audio

TypeScript

import { Speko } from '@spekoai/sdk';

const speko = new Speko({ apiKey: process.env.SPEKO_API_KEY! });

const { audio, provider, model } = await speko.synthesize('Xin chào, cảm ơn bạn đã gọi.', {
  language: 'vi',
});

start free read the docs

FAQ

What is the best Vietnamese text-to-speech API?

We do not yet publish a Vietnamese TTS ranking: the first analyzer run (2026-06-03) is n=1 per tone on isolated words with no native Hanoi rater, so it is a diagnostic. In that diagnostic, xAI / Grok TTS resolved the most tones (4 of 6). A carrier-frame probe and a native rater are the next step.

Why is Vietnamese hard for text-to-speech?

Two of the six Northern Vietnamese tones, ngã and nặng, are carried by glottalization (creaky voice), not pitch. A system can hit every pitch contour and still collapse ngã into sắc and nặng into huyền, which changes word meaning.

Is ElevenLabs bad at Vietnamese?

Its 0-of-6 tone score in this run is a textbook isolated-word artifact (its glottal contrast of 1.079 is far above every other system), not a verdict on its Vietnamese. Treat it as unscored until the carrier-frame v2 probe runs.

What about Vietnamese speech-to-text?

That board is measured and ranked: ElevenLabs Scribe v2 leads our Vietnamese STT benchmark at 1.9% WER. See the Vietnamese STT page for the full table.

More language benchmarks

Best Vietnamese STT API Best Thai TTS API Best Filipino TTS API STT benchmarks by language Full interactive TTS benchmark