Speko is a voice AI benchmarking and optimization platform. It connects to 18+ voice AI providers and automatically tests 240+ STT, LLM, and TTS combinations against your specific language, use case, and cost constraints — returning ranked results in minutes.

Which voice AI providers does Speko support?

Speko supports 18+ providers including Deepgram, AssemblyAI, ElevenLabs, Cartesia, PlayHT, OpenAI, Gemini, Groq, Cerebras, Vapi, Retell, Bland AI, Hume AI, and more. New providers are added regularly.

How does Speko benchmark voice AI providers?

Speko runs STT, LLM, and TTS providers in combination against your specific inputs, measuring latency, accuracy, cost, and quality. Every benchmark number is cited with source URLs and verification dates. See our methodology at speko.ai/blog/methodology.

Which STT provider is most accurate for English?

Based on our March 2026 benchmarks, Deepgram Nova-3 and AssemblyAI Universal-3 Pro lead for English accuracy. Deepgram Nova-3 achieves 4.1% WER on clean audio; AssemblyAI Universal-3 Pro averages 5.9% WER across 26 diverse datasets. The best choice depends on your audio conditions and latency requirements.

What is the cheapest voice AI stack in 2026?

The lowest-cost production-ready stack is approximately $0.0095/minute, combining Deepgram Nova-3 ($0.0043/min) + Gemini 2.0 Flash ($0.0007/min) + Cartesia Sonic ($0.0045/min). See our full cost breakdown at speko.ai/blog/voice-ai-cost-2026.

How is Speko different from Vapi or Retell?

Vapi and Retell are voice agent platforms that lock you into their provider choices. Speko is provider-agnostic infrastructure that benchmarks all providers against your requirements and helps you choose and switch freely. Speko integrates with any platform including Vapi, Retell, and custom stacks.

The Best STT Providers in 2026: Ranked by Accuracy, Latency, and Cost

The Best STT Providers in 2026, Ranked

The best speech-to-text API in 2026 is Deepgram Nova-3 for English accuracy (5.9% WER streaming), AssemblyAI Universal-3 Pro for cost efficiency ($0.0025/min, the lowest in the market), and Google Chirp-3 for multilingual support (125+ languages). According to Speko’s benchmark of 6 major STT providers, there is no single best provider — the right choice depends on whether you optimize for accuracy, latency, cost, or language coverage. The price spread across providers is 7.2x, from $0.0025/min (AssemblyAI) to $0.018/min (Azure).

Most developers default to Deepgram without evaluating alternatives. Deepgram has dominant mindshare, excellent documentation, and a well-designed API. Those are real advantages. But they are not the same as being the best provider for your specific audio, language, and use case. STT sets the ceiling for everything downstream in a voice AI pipeline: a 10% word error rate means your LLM is working with garbage 10% of the time.

This post ranks every production-ready STT provider across accuracy, latency, cost, and language support using verified March 2026 data. See our full voice AI benchmark dataset for the underlying numbers.

The Contenders

Six providers are worth evaluating for production voice AI in 2026. Here is what each brings to the table.

Deepgram (Nova-3)

Deepgram is the benchmark leader for English accuracy and the most developer-friendly API in the space. Nova-3 is their current flagship transcription model, improving on Nova-2 with better handling of specialized vocabulary, medical and legal terminology, and noisy call center audio. Their streaming implementation is mature, with production-ready WebSocket infrastructure and client SDKs in every major language. Nova-2 remains available at a lower price point for cost-sensitive deployments. Deepgram also offers Aura-2 for TTS if you want to keep your voice stack on a single vendor.

AssemblyAI (Universal-3 Pro)

AssemblyAI’s Universal-3 Pro is the most rigorously benchmarked STT model available, tested across 26 diverse datasets including noisy environments, accented speech, and technical domain audio. Their streaming product is competitive and their feature surface goes beyond raw transcription: sentiment analysis, auto chapters, speaker diarization, and PII redaction are all built in. Universal-3 Pro has the strongest story for diverse real-world audio, particularly for non-English and code-switched speech. See our Deepgram vs AssemblyAI head-to-head benchmark for a detailed breakdown.

OpenAI Whisper (v3, Whisper-1 API)

OpenAI’s Whisper models are the gold standard for language coverage — 99 languages, no other provider comes close. Whisper large-v3 achieves strong accuracy on English and most major world languages. The limitation is architecture: Whisper is a batch transcription model with no native streaming. The Whisper-1 API endpoint adds 2–8 seconds of latency because it waits for a complete audio segment before transcribing. For real-time voice agents, this disqualifies Whisper as a primary STT layer. For batch transcription, post-call analytics, and language-diverse use cases, it remains the default choice.

Google STT (Chirp-3, Cloud Speech)

Google Chirp-3 is Google’s latest STT model, part of the Gemini ecosystem. It has strong tonal language performance (particularly Mandarin, Thai, and Vietnamese) and broad global language coverage. If your infrastructure already lives in GCP, Google STT integrates cleanly with IAM, audit logging, and the rest of the Google Cloud stack. The per-minute cost is higher than Deepgram and AssemblyAI, which makes it harder to justify unless GCP integration is a hard requirement.

Azure Speech (Fast Transcription)

Azure Fast Transcription is Microsoft’s optimized STT endpoint, distinct from their standard batch transcription service. It delivers lower latency than standard Azure Speech while maintaining accuracy comparable to Google Chirp-3. The primary reason to choose Azure STT is Microsoft ecosystem integration: if your compliance stack, identity management, or data residency requirements tie you to Azure, this is your path of least resistance.

Groq (Whisper-Large-v3-turbo)

Groq runs Whisper large-v3-turbo on their custom LPU (Language Processing Unit) hardware, achieving transcription latency under 300ms for most audio. This is not a new model — it is standard Whisper running on faster silicon. The accuracy matches OpenAI Whisper v3 at approximately the same WER. The value proposition is purely latency: if you need Whisper-quality multilingual transcription with sub-300ms response times, Groq is currently the only option. The tradeoff is higher WER compared to purpose-built streaming providers like Deepgram and AssemblyAI.

Accuracy Rankings (Word Error Rate)

Word Error Rate (WER) measures the percentage of words transcribed incorrectly. Lower is better. We report two WER figures: clean audio (studio recording, minimal noise, native speaker) and diverse audio (noisy environments, accented speech, domain-specific vocabulary). The gap between these two numbers tells you how well a provider handles real-world conditions.

Provider	Model	Clean WER	Diverse WER	Multilingual
Deepgram	Nova-3	4.1%	~8.0%	Strong (30+ languages)
AssemblyAI	Universal-3 Pro	5.9%	5.9%	Strong (code-switching)
Google	Chirp-3	7.2%	~10.0%	Excellent (tonal languages)
Azure	Fast Transcription	7.8%	~11.5%	Good (100+ languages)
OpenAI	Whisper v3	8.1%	~9.5%	Best (99 languages)
Groq	Whisper-Large-v3-turbo	~8.5%	~10.5%	Good (same as Whisper v3)

The most interesting number in this table is AssemblyAI’s: a 5.9% WER that stays flat from clean to diverse audio. Most providers show a significant accuracy drop when conditions get harder. AssemblyAI Universal-3 Pro was benchmarked specifically across 26 diverse datasets that include noise, accents, and domain-specific vocabulary. Its flat accuracy curve is the reason it outperforms Deepgram on diverse audio even though Deepgram has a lower clean audio WER.

Deepgram Nova-3’s 4.1% clean WER is the best in class for clean, English-first audio. The ~8% diverse WER is still competitive. For call center audio with controlled microphones and trained agents, Deepgram Nova-3 is the strongest choice.

Latency Rankings

For real-time voice agents, STT latency directly determines how fast your agent can begin processing a user’s utterance. We measure time-to-first-token (TTFT) for streaming providers and total transcription time for batch providers. All latency figures are measured at P50 with typical voice agent payloads (5–15 second utterances).

Provider	Model	Streaming Latency	Non-Streaming
Groq	Whisper-Large-v3-turbo	~200ms	<300ms
Deepgram	Nova-3	~240ms	~400ms
AssemblyAI	Universal-3	~280ms	~500ms
Google	Chirp-3	~350ms	~600ms
Azure	Fast Transcription	~380ms	~650ms
OpenAI	Whisper-1 API	N/A (batch only)	2–8s

Groq’s sub-200ms streaming latency is real and meaningful. The LPU hardware delivers lower inference latency than GPU-based providers across the board. The tradeoff is accuracy: Groq Whisper runs at ~8.5% WER versus Deepgram’s 4.1%. For use cases where latency matters more than transcription accuracy — think live captioning or real-time command detection — Groq is worth evaluating.

OpenAI’s 2–8 second latency is a hard disqualifier for real-time voice agents. The Whisper-1 API is a batch endpoint masquerading as a real-time service. Do not use it in a latency-sensitive pipeline. Use it for post-call transcription, where its language coverage advantage actually matters.

Cost Comparison

STT cost is typically measured per minute of audio processed. At the scale of a production voice agent (thousands of minutes per day), these differences compound quickly. The delta between the cheapest and most expensive provider in this table is 4x — significant, but far smaller than the spread on the TTS or LLM side of the stack.

For a complete breakdown of total voice stack costs, see The Real Cost of Voice AI in 2026.

Provider	Model	Price / min	Price / hour
Deepgram	Nova-3	$0.0043	$0.258
AssemblyAI	Universal-3 Pro	$0.0065	$0.390
OpenAI	Whisper-1 API	$0.006	$0.360
Groq	Whisper-Large-v3-turbo	Varies	Varies
Azure	Fast Transcription	$0.012	$0.720
Google	Chirp-3	$0.016	$0.960

Deepgram’s $0.0043/min is the lowest list price in the STT market for a production-grade streaming provider. At 10,000 minutes per day (a moderate production volume), that is $43/day versus $65/day for AssemblyAI or $160/day for Google. The difference is meaningful at scale, but not at the early stage where you should be optimizing for accuracy and developer experience first.

Google and Azure premium pricing reflects ecosystem integration costs, not raw model quality. If you are paying for Google STT, you are essentially paying for GCP-native integration and SLA guarantees from a hyperscaler. For most independent voice AI deployments, that premium is hard to justify.

When to Use Each Provider

The “best” STT provider depends on what you are optimizing for. Here is a direct decision framework.

Use Deepgram when:

•You are building a real-time voice agent with English as the primary language. Nova-3’s 4.1% WER and 240ms streaming latency is the best accuracy-latency combination available.
•Cost is a priority. At $0.0043/min, Deepgram is the most cost-efficient streaming STT provider. This advantage compounds significantly at scale.
•Your audio comes from controlled environments: call center headsets, phone lines, or professional microphones. This is where Deepgram’s clean-audio WER advantage is most pronounced.
•You need specialized vocabulary. Nova-3’s keyword boosting feature lets you weight domain-specific terms (medical procedures, product names, legal terminology) to improve transcription accuracy.

Explore the Deepgram integration guide for setup instructions and benchmark data for your specific use case.

Use AssemblyAI when:

•Your audio is diverse: user-generated content, consumer microphones, multi-accent environments, or noisy locations. Universal-3 Pro’s flat accuracy curve across 26 diverse datasets makes it the safest bet for unpredictable audio.
•You need more than transcription. AssemblyAI’s built-in sentiment analysis, speaker diarization, auto chapters, and PII redaction reduce the number of downstream processing steps.
•Your use case involves code-switching (users mixing two languages in the same utterance). Universal-3 Pro has the best handling of code-switched speech among the providers we tested.
•You are processing Hindi, Arabic, or other languages where Deepgram’s advantage shrinks. AssemblyAI’s Universal-3 Pro leads on several non-English languages.

Explore the AssemblyAI integration guide for setup instructions and benchmark data.

Use OpenAI Whisper when:

•You need maximum language coverage. 99 languages, with strong accuracy even on low-resource languages, is Whisper’s defining advantage. No other provider is close.
•Your workload is batch transcription: post-call analysis, meeting notes, podcast transcription. The 2–8 second latency is irrelevant for async workflows.
•You are already deeply in the OpenAI ecosystem and want a single vendor for LLM + STT.

Use Google Chirp-3 when:

•You are already running on GCP and need seamless IAM, VPC, and audit log integration.
•Your primary language is a tonal Asian language (Mandarin, Thai, Vietnamese). Google’s tonal language handling is the strongest in this category.
•Enterprise SLAs and Google-backed support are requirements your procurement team needs to satisfy.

Use Azure Fast Transcription when:

•Your infrastructure is Azure-first and switching to a non-Microsoft STT provider requires security reviews, procurement cycles, or compliance approvals you want to avoid.
•You need Microsoft’s enterprise SLA, data residency commitments, and Entra ID integration.
•You are building on Microsoft Teams, Azure Communication Services, or other Microsoft collaboration products.

Use Groq when:

•Latency is your absolute first priority and you can accept the accuracy tradeoff. Sub-200ms streaming latency is real.
•You need Whisper-quality multilingual transcription at near-real-time speeds — for live captioning, real-time translation pipelines, or command detection systems.
•You are already using Groq for LLM inference and want to consolidate on a single vendor for both STT and generation latency optimization.

Language Support Comparison

Provider rankings shift significantly when you move beyond English. The provider with the best English WER is often not the right choice for non-English deployments. Here is where each provider leads by language family.

Language	Top STT Pick	Notes
English	Deepgram Nova-3	4.1% WER, best streaming latency
Japanese	Deepgram Nova-3	6.2% WER, strong on agglutinative morphology
Spanish	Deepgram Nova-3	Strong; PT-BR comparable performance
Portuguese (PT-BR)	Deepgram Nova-3	Comparable to Spanish performance
Chinese (Mandarin)	Deepgram Nova-3	Google Chirp-3 competitive; tonal handling matters
Korean	Deepgram Nova-3	Good agglutinative handling
Hindi	AssemblyAI Universal-3 Pro	Best code-switching (Hinglish) support
Arabic	AssemblyAI Universal-3 Pro	Dialect coverage limited but leads the field
Thai	Google Chirp-3	Best tonal language support; critical for Thai
99+ languages	OpenAI Whisper v3	Broadest coverage; batch only

The language-specific rankings illustrate why the “just use Deepgram” default is dangerous for multilingual products. Deepgram leads on most Western European and East Asian languages, but AssemblyAI leads for South Asian and Semitic languages. Google leads for Southeast Asian tonal languages. No single provider dominates across all language families.

If you are building a multilingual product, you should be benchmarking each provider against your specific target languages, not relying on aggregate WER figures. See the Deepgram vs AssemblyAI per-language benchmark for language-specific accuracy data.

The Bottom Line

Three providers dominate for different reasons:

•Deepgram Nova-3 wins on English accuracy and cost. If you are building English-first, real-time voice agents at any meaningful scale, Deepgram is the default answer.
•AssemblyAI Universal-3 Pro wins on diverse audio and feature richness. If your audio comes from the real world — consumer devices, noisy environments, multi-accent users — Universal-3 Pro’s flat accuracy curve is the right choice.
•OpenAI Whisper v3 wins on language coverage. 99 languages, best-in-class for low-resource languages, but batch-only. Use it for async workloads or post-call analytics.

Google and Azure are valid choices for teams inside those ecosystems. Groq is valid when latency is the absolute constraint. Neither is the right default for a greenfield voice agent deployment.

The optimal choice depends on your specific audio: what language, what environment, what domain vocabulary, what latency target. Speko benchmarks all six providers against your audio in 15 minutes so you do not have to make this decision on marketing claims. See our benchmark tool to run a comparison on your own data.

Sources

All data verified as of March 2026. Next scheduled verification: June 2026.

Back to blog