Most Developers Pick Deepgram by Default. They Should Not.
Speech-to-text is the first component in any cascaded voice AI pipeline, and it sets the ceiling for everything downstream. A 10% word error rate on your STT layer means your LLM is working with garbage 10% of the time. Every hallucination, every misunderstanding, every failed intent classification — trace it back far enough and you will often find a bad transcription at the root.
Despite this, most developers choose their STT provider the same way they choose a cloud region: pick the default and move on. For STT, the default is almost always Deepgram. Deepgram has dominant mindshare in the developer community, excellent documentation, and a well-designed API. Those are real advantages. But they are not the same as being the best provider for your specific audio, language, and use case.
This post is the benchmark data we wish existed when we started building voice infrastructure. We tested Deepgram, AssemblyAI, OpenAI Whisper, Google Chirp-3, Azure Fast Transcription, and Groq Whisper across accuracy, latency, cost, and language support. The results are more nuanced than any single provider’s marketing will tell you. See our full voice AI benchmark dataset for the underlying numbers.
The Contenders
Six providers are worth evaluating for production voice AI in 2026. Here is what each brings to the table.
Deepgram (Nova-3)
Deepgram is the benchmark leader for English accuracy and the most developer-friendly API in the space. Nova-3 is their current flagship transcription model, improving on Nova-2 with better handling of specialized vocabulary, medical and legal terminology, and noisy call center audio. Their streaming implementation is mature, with production-ready WebSocket infrastructure and client SDKs in every major language. Nova-2 remains available at a lower price point for cost-sensitive deployments. Deepgram also offers Aura-2 for TTS if you want to keep your voice stack on a single vendor.
AssemblyAI (Universal-3 Pro)
AssemblyAI’s Universal-3 Pro is the most rigorously benchmarked STT model available, tested across 26 diverse datasets including noisy environments, accented speech, and technical domain audio. Their streaming product is competitive and their feature surface goes beyond raw transcription: sentiment analysis, auto chapters, speaker diarization, and PII redaction are all built in. Universal-3 Pro has the strongest story for diverse real-world audio, particularly for non-English and code-switched speech. See our Deepgram vs AssemblyAI head-to-head benchmark for a detailed breakdown.
OpenAI Whisper (v3, Whisper-1 API)
OpenAI’s Whisper models are the gold standard for language coverage — 99 languages, no other provider comes close. Whisper large-v3 achieves strong accuracy on English and most major world languages. The limitation is architecture: Whisper is a batch transcription model with no native streaming. The Whisper-1 API endpoint adds 2–8 seconds of latency because it waits for a complete audio segment before transcribing. For real-time voice agents, this disqualifies Whisper as a primary STT layer. For batch transcription, post-call analytics, and language-diverse use cases, it remains the default choice.
Google STT (Chirp-3, Cloud Speech)
Google Chirp-3 is Google’s latest STT model, part of the Gemini ecosystem. It has strong tonal language performance (particularly Mandarin, Thai, and Vietnamese) and broad global language coverage. If your infrastructure already lives in GCP, Google STT integrates cleanly with IAM, audit logging, and the rest of the Google Cloud stack. The per-minute cost is higher than Deepgram and AssemblyAI, which makes it harder to justify unless GCP integration is a hard requirement.
Azure Speech (Fast Transcription)
Azure Fast Transcription is Microsoft’s optimized STT endpoint, distinct from their standard batch transcription service. It delivers lower latency than standard Azure Speech while maintaining accuracy comparable to Google Chirp-3. The primary reason to choose Azure STT is Microsoft ecosystem integration: if your compliance stack, identity management, or data residency requirements tie you to Azure, this is your path of least resistance.
Groq (Whisper-Large-v3-turbo)
Groq runs Whisper large-v3-turbo on their custom LPU (Language Processing Unit) hardware, achieving transcription latency under 300ms for most audio. This is not a new model — it is standard Whisper running on faster silicon. The accuracy matches OpenAI Whisper v3 at approximately the same WER. The value proposition is purely latency: if you need Whisper-quality multilingual transcription with sub-300ms response times, Groq is currently the only option. The tradeoff is higher WER compared to purpose-built streaming providers like Deepgram and AssemblyAI.
Accuracy Rankings (Word Error Rate)
Word Error Rate (WER) measures the percentage of words transcribed incorrectly. Lower is better. We report two WER figures: clean audio (studio recording, minimal noise, native speaker) and diverse audio (noisy environments, accented speech, domain-specific vocabulary). The gap between these two numbers tells you how well a provider handles real-world conditions.
| Provider | Model | Clean WER | Diverse WER | Multilingual |
|---|---|---|---|---|
| Deepgram | Nova-3 | 4.1% | ~8.0% | Strong (30+ languages) |
| AssemblyAI | Universal-3 Pro | 5.9% | 5.9% | Strong (code-switching) |
| Chirp-3 | 7.2% | ~10.0% | Excellent (tonal languages) | |
| Azure | Fast Transcription | 7.8% | ~11.5% | Good (100+ languages) |
| OpenAI | Whisper v3 | 8.1% | ~9.5% | Best (99 languages) |
| Groq | Whisper-Large-v3-turbo | ~8.5% | ~10.5% | Good (same as Whisper v3) |
The most interesting number in this table is AssemblyAI’s: a 5.9% WER that stays flat from clean to diverse audio. Most providers show a significant accuracy drop when conditions get harder. AssemblyAI Universal-3 Pro was benchmarked specifically across 26 diverse datasets that include noise, accents, and domain-specific vocabulary. Its flat accuracy curve is the reason it outperforms Deepgram on diverse audio even though Deepgram has a lower clean audio WER.
Deepgram Nova-3’s 4.1% clean WER is the best in class for clean, English-first audio. The ~8% diverse WER is still competitive. For call center audio with controlled microphones and trained agents, Deepgram Nova-3 is the strongest choice.
Latency Rankings
For real-time voice agents, STT latency directly determines how fast your agent can begin processing a user’s utterance. We measure time-to-first-token (TTFT) for streaming providers and total transcription time for batch providers. All latency figures are measured at P50 with typical voice agent payloads (5–15 second utterances).
| Provider | Model | Streaming Latency | Non-Streaming |
|---|---|---|---|
| Groq | Whisper-Large-v3-turbo | ~200ms | <300ms |
| Deepgram | Nova-3 | ~240ms | ~400ms |
| AssemblyAI | Universal-3 | ~280ms | ~500ms |
| Chirp-3 | ~350ms | ~600ms | |
| Azure | Fast Transcription | ~380ms | ~650ms |
| OpenAI | Whisper-1 API | N/A (batch only) | 2–8s |
Groq’s sub-200ms streaming latency is real and meaningful. The LPU hardware delivers lower inference latency than GPU-based providers across the board. The tradeoff is accuracy: Groq Whisper runs at ~8.5% WER versus Deepgram’s 4.1%. For use cases where latency matters more than transcription accuracy — think live captioning or real-time command detection — Groq is worth evaluating.
OpenAI’s 2–8 second latency is a hard disqualifier for real-time voice agents. The Whisper-1 API is a batch endpoint masquerading as a real-time service. Do not use it in a latency-sensitive pipeline. Use it for post-call transcription, where its language coverage advantage actually matters.
Cost Comparison
STT cost is typically measured per minute of audio processed. At the scale of a production voice agent (thousands of minutes per day), these differences compound quickly. The delta between the cheapest and most expensive provider in this table is 4x — significant, but far smaller than the spread on the TTS or LLM side of the stack.
For a complete breakdown of total voice stack costs, see The Real Cost of Voice AI in 2026.
| Provider | Model | Price / min | Price / hour |
|---|---|---|---|
| Deepgram | Nova-3 | $0.0043 | $0.258 |
| AssemblyAI | Universal-3 Pro | $0.0065 | $0.390 |
| OpenAI | Whisper-1 API | $0.006 | $0.360 |
| Groq | Whisper-Large-v3-turbo | Varies | Varies |
| Azure | Fast Transcription | $0.012 | $0.720 |
| Chirp-3 | $0.016 | $0.960 |
Deepgram’s $0.0043/min is the lowest list price in the STT market for a production-grade streaming provider. At 10,000 minutes per day (a moderate production volume), that is $43/day versus $65/day for AssemblyAI or $160/day for Google. The difference is meaningful at scale, but not at the early stage where you should be optimizing for accuracy and developer experience first.
Google and Azure premium pricing reflects ecosystem integration costs, not raw model quality. If you are paying for Google STT, you are essentially paying for GCP-native integration and SLA guarantees from a hyperscaler. For most independent voice AI deployments, that premium is hard to justify.
When to Use Each Provider
The “best” STT provider depends on what you are optimizing for. Here is a direct decision framework.
Use Deepgram when:
- •You are building a real-time voice agent with English as the primary language. Nova-3’s 4.1% WER and 240ms streaming latency is the best accuracy-latency combination available.
- •Cost is a priority. At $0.0043/min, Deepgram is the most cost-efficient streaming STT provider. This advantage compounds significantly at scale.
- •Your audio comes from controlled environments: call center headsets, phone lines, or professional microphones. This is where Deepgram’s clean-audio WER advantage is most pronounced.
- •You need specialized vocabulary. Nova-3’s keyword boosting feature lets you weight domain-specific terms (medical procedures, product names, legal terminology) to improve transcription accuracy.
Explore the Deepgram integration guide for setup instructions and benchmark data for your specific use case.
Use AssemblyAI when:
- •Your audio is diverse: user-generated content, consumer microphones, multi-accent environments, or noisy locations. Universal-3 Pro’s flat accuracy curve across 26 diverse datasets makes it the safest bet for unpredictable audio.
- •You need more than transcription. AssemblyAI’s built-in sentiment analysis, speaker diarization, auto chapters, and PII redaction reduce the number of downstream processing steps.
- •Your use case involves code-switching (users mixing two languages in the same utterance). Universal-3 Pro has the best handling of code-switched speech among the providers we tested.
- •You are processing Hindi, Arabic, or other languages where Deepgram’s advantage shrinks. AssemblyAI’s Universal-3 Pro leads on several non-English languages.
Explore the AssemblyAI integration guide for setup instructions and benchmark data.
Use OpenAI Whisper when:
- •You need maximum language coverage. 99 languages, with strong accuracy even on low-resource languages, is Whisper’s defining advantage. No other provider is close.
- •Your workload is batch transcription: post-call analysis, meeting notes, podcast transcription. The 2–8 second latency is irrelevant for async workflows.
- •You are already deeply in the OpenAI ecosystem and want a single vendor for LLM + STT.
Use Google Chirp-3 when:
- •You are already running on GCP and need seamless IAM, VPC, and audit log integration.
- •Your primary language is a tonal Asian language (Mandarin, Thai, Vietnamese). Google’s tonal language handling is the strongest in this category.
- •Enterprise SLAs and Google-backed support are requirements your procurement team needs to satisfy.
Use Azure Fast Transcription when:
- •Your infrastructure is Azure-first and switching to a non-Microsoft STT provider requires security reviews, procurement cycles, or compliance approvals you want to avoid.
- •You need Microsoft’s enterprise SLA, data residency commitments, and Entra ID integration.
- •You are building on Microsoft Teams, Azure Communication Services, or other Microsoft collaboration products.
Use Groq when:
- •Latency is your absolute first priority and you can accept the accuracy tradeoff. Sub-200ms streaming latency is real.
- •You need Whisper-quality multilingual transcription at near-real-time speeds — for live captioning, real-time translation pipelines, or command detection systems.
- •You are already using Groq for LLM inference and want to consolidate on a single vendor for both STT and generation latency optimization.
Language Support Comparison
Provider rankings shift significantly when you move beyond English. The provider with the best English WER is often not the right choice for non-English deployments. Here is where each provider leads by language family.
| Language | Top STT Pick | Notes |
|---|---|---|
| English | Deepgram Nova-3 | 4.1% WER, best streaming latency |
| Japanese | Deepgram Nova-3 | 6.2% WER, strong on agglutinative morphology |
| Spanish | Deepgram Nova-3 | Strong; PT-BR comparable performance |
| Portuguese (PT-BR) | Deepgram Nova-3 | Comparable to Spanish performance |
| Chinese (Mandarin) | Deepgram Nova-3 | Google Chirp-3 competitive; tonal handling matters |
| Korean | Deepgram Nova-3 | Good agglutinative handling |
| Hindi | AssemblyAI Universal-3 Pro | Best code-switching (Hinglish) support |
| Arabic | AssemblyAI Universal-3 Pro | Dialect coverage limited but leads the field |
| Thai | Google Chirp-3 | Best tonal language support; critical for Thai |
| 99+ languages | OpenAI Whisper v3 | Broadest coverage; batch only |
The language-specific rankings illustrate why the “just use Deepgram” default is dangerous for multilingual products. Deepgram leads on most Western European and East Asian languages, but AssemblyAI leads for South Asian and Semitic languages. Google leads for Southeast Asian tonal languages. No single provider dominates across all language families.
If you are building a multilingual product, you should be benchmarking each provider against your specific target languages, not relying on aggregate WER figures. See the Deepgram vs AssemblyAI per-language benchmark for language-specific accuracy data.
The Bottom Line
Three providers dominate for different reasons:
- •Deepgram Nova-3 wins on English accuracy and cost. If you are building English-first, real-time voice agents at any meaningful scale, Deepgram is the default answer.
- •AssemblyAI Universal-3 Pro wins on diverse audio and feature richness. If your audio comes from the real world — consumer devices, noisy environments, multi-accent users — Universal-3 Pro’s flat accuracy curve is the right choice.
- •OpenAI Whisper v3 wins on language coverage. 99 languages, best-in-class for low-resource languages, but batch-only. Use it for async workloads or post-call analytics.
Google and Azure are valid choices for teams inside those ecosystems. Groq is valid when latency is the absolute constraint. Neither is the right default for a greenfield voice agent deployment.
The optimal choice depends on your specific audio: what language, what environment, what domain vocabulary, what latency target. Speko benchmarks all six providers against your audio in 15 minutes so you do not have to make this decision on marketing claims. See our benchmark tool to run a comparison on your own data.
Sources
All data verified as of March 2026. Next scheduled verification: June 2026.