Best Vietnamese Speech-to-Text API (2026): independent benchmark
Vietnamese speech-to-text accuracy, measured. 30 Vietnamese read-speech clips from FLEURS, every provider routed through the same gateway and scored identically on 2026-06-03. Lower WER is better.
ElevenLabs Scribe v2 posted the lowest Vietnamese WER at 1.9%, ahead of Alibaba Qwen3-ASR at 2.4%. 2 of the 6 providers on our English board do not support Vietnamese at all, and the supported field spreads from 1.9% to 4.7% - picking a provider by its English score alone is a mistake.
Vietnamese STT leaderboard
30 Vietnamese clips (FLEURS), measured 2026-06-03 through the Speko gateway, loudness-normalized to -16 LUFS. WER measured on Vietnamese; latency and list price from the same gateway setup on the English board. Lower is better.
| Provider / model | WER | p50 latency | List price |
|---|---|---|---|
| ElevenLabs Scribe v2 | 1.9% | 1,353 ms | $0.0067/min |
| Alibaba Qwen3-ASR | 2.4% | 2,195 ms | - |
| OpenAI GPT-4o Transcribe | 2.5% | 1,084 ms | $0.006/min |
| xAI Grok STT | 4.7% | 996 ms | - |
| Cartesia Ink-2 | does not support | - | - |
| Gradium | does not support | - | - |
How we measured
- Dataset: 30 Vietnamese read-speech clips from FLEURS (30 clips per language across the wedge).
- Scoring: mean word error rate (WER), lower is better. Audio is loudness-normalized to -16 LUFS before scoring so input-gain handling does not contaminate the accuracy column.
- Every provider is measured the same way: through the Speko gateway (POST /v1/transcribe, provider pinned), from a single location.
- Latency and list price columns come from the same gateway setup measured on the English board (n=50); the WER column is measured on Vietnamese audio.
- Run date: 2026-06-03.
Full interactive table, every territory, and the complete methodology: benchmarks.speko.ai
Use the winner without lock-in
The best Vietnamese provider today is one benchmark run away from being second best. Speko is one API in front of every provider on this table: it routes each request to the measured-best provider for your language and fails over automatically when a provider degrades. No per-vendor integration, no migration when the leaderboard flips.
curl -X POST https://api.speko.dev/v1/transcribe \
-H "Authorization: Bearer $SPEKO_API_KEY" \
-H "Content-Type: audio/wav" \
-H "x-speko-intent: {\"language\":\"vi\"}" \
--data-binary @call.wav import { Speko } from '@spekoai/sdk';
import { readFile } from 'node:fs/promises';
const speko = new Speko({ apiKey: process.env.SPEKO_API_KEY! });
const audio = await readFile('./call.wav');
const { text, provider, confidence } = await speko.transcribe(audio, {
language: 'vi',
}); FAQ
What is the most accurate Vietnamese speech-to-text API?
On Speko's 2026-06-03 FLEURS benchmark (30 Vietnamese clips), ElevenLabs Scribe v2 posted the lowest WER at 1.9%, followed by Alibaba Qwen3-ASR at 2.4%.
Does ElevenLabs support Vietnamese speech-to-text?
Yes. ElevenLabs Scribe v2 scored 1.9% WER on our Vietnamese run, the best result on the board.
Which providers do not support Vietnamese transcription?
Cartesia Ink-2 and Gradium are English-only on our board: on Vietnamese input they return text in the wrong script (roughly 76-100% error), so we mark them "does not support" instead of publishing a misleading number.
How was Vietnamese STT accuracy measured?
30 Vietnamese read-speech clips from FLEURS, loudness-normalized to -16 LUFS, sent through the Speko gateway with the provider pinned, and scored as word error rate on 2026-06-03. Support is checked first: a provider is only benchmarked on a language it actually transcribes in the native script.