Speech-to-Text Benchmark
12 providers · independently tested · April 2026
| Provider | WER | Noisy WER | Latency | Cost/min | Languages | Source |
|---|---|---|---|---|---|---|
Deepgram Nova-3 | 7.8% | 10.4% | 2.4s | $0.0043 | 36 | Deepgram |
ElevenLabs Scribe v2 Realtime | 3.5% | 5.1% | 1.0s | $0.0067 | 99 | ElevenLabs API |
ElevenLabs Scribe v1 | 5.4% | 6.9% | 1.1s | $0.0067 | 99 | ElevenLabs API |
Alibaba qwen3-asr-flash | 3.5% | 5.4% | 0.6s | $0.0021 | 90 | Alibaba DashScope Qwen3-ASR-Flash |
OpenAI gpt-4o-transcribe | 19.4% | 33.7% | 1.9s | $0.0060 | 50 | OpenAI API pricing |
OpenAI gpt-4o-mini-transcribe | 8.2% | 17.1% | 2.0s | $0.0030 | 50 | OpenAI API pricing |
OpenAI whisper-1 | 11.6% | 12.3% | 2.6s | $0.0060 | 57 | OpenAI API pricing |
xAI Grok STT | 16.75% | 16.1% | 0.9s | $0.0017 | 25 | xAI Grok STT and TTS APIs announcement |
AssemblyAI Universal-2 | 6.22% | 9.5% | 3.7s | $0.0062 | 99 | AssemblyAI |
AssemblyAI Universal-3 Pro | 5.06% | 7.9% | 4.2s | $0.0067 | 6 | AssemblyAI |
Google Cloud Chirp 2 | 5.37% | 17.9% | 4.5s | $0.0240 | 125 | Google Cloud Speech-to-Text v2 |
Google Gemini 2.5 Flash (STT) | 6.03% | 18.9% | 2.5s | $0.0002 | 100 | Gemini API |
Noise Robustness
How accuracy holds up under pressure.
Real-world audio is noisy. We tested each provider across 5 conditions to see how much accuracy degrades.
Language
Deepgram Nova-3ElevenLabs Scribe v2 RealtimeElevenLabs Scribe v1Alibaba qwen3-asr-flashOpenAI gpt-4o-transcribeOpenAI gpt-4o-mini-transcribeOpenAI whisper-1xAI Grok STTAssemblyAI Universal-2AssemblyAI Universal-3 ProGoogle Cloud Chirp 2Google Gemini 2.5 Flash (STT)
Methodology
165
API calls
58
Minutes tested
$0.89
Total cost
3
Providers benchmarked
STT accuracy, latency, noise robustness, and multi-language data tested by Speko Bench CLI (speko-bench-cli v0.1.0) using LibriSpeech test-clean + Google FLEURS datasets. Pricing from official provider pages.