Speko is a voice AI benchmarking and optimization platform. It connects to 18+ voice AI providers and automatically tests 240+ STT, LLM, and TTS combinations against your specific language, use case, and cost constraints — returning ranked results in minutes.

Which voice AI providers does Speko support?

Speko supports 18+ providers including Deepgram, AssemblyAI, ElevenLabs, Cartesia, PlayHT, OpenAI, Gemini, Groq, Cerebras, Vapi, Retell, Bland AI, Hume AI, and more. New providers are added regularly.

How does Speko benchmark voice AI providers?

Speko runs STT, LLM, and TTS providers in combination against your specific inputs, measuring latency, accuracy, cost, and quality. Every benchmark number is cited with source URLs and verification dates. See our methodology at speko.ai/blog/methodology.

Which STT provider is most accurate for English?

Based on our March 2026 benchmarks, Deepgram Nova-3 and AssemblyAI Universal-3 Pro lead for English accuracy. Deepgram Nova-3 achieves 4.1% WER on clean audio; AssemblyAI Universal-3 Pro averages 5.9% WER across 26 diverse datasets. The best choice depends on your audio conditions and latency requirements.

What is the cheapest voice AI stack in 2026?

The lowest-cost production-ready stack is approximately $0.0095/minute, combining Deepgram Nova-3 ($0.0043/min) + Gemini 2.0 Flash ($0.0007/min) + Cartesia Sonic ($0.0045/min). See our full cost breakdown at speko.ai/blog/voice-ai-cost-2026.

How is Speko different from Vapi or Retell?

Vapi and Retell are voice agent platforms that lock you into their provider choices. Speko is provider-agnostic infrastructure that benchmarks all providers against your requirements and helps you choose and switch freely. Speko integrates with any platform including Vapi, Retell, and custom stacks.

The Real Cost of Voice AI in 2026: What Every Builder Should Know

The Number on the Pricing Page Is Not Your Cost

Most voice AI pricing pages show you one number. The real cost is 3-6x higher. Vapi advertises $0.05/min as its platform fee. But once you add STT, TTS, LLM inference, and telephony, the effective cost lands between $0.30 and $0.33 per minute. That is not a rounding error. That is the difference between a viable product and a money pit.

This post breaks down every cost component in a voice AI stack using verified 2026 pricing data. Every number comes from official pricing pages and API documentation, verified as of March 4, 2026. No estimates, no “approximately.” If you are building a voice agent, this is what you will actually pay.

Speech-to-Text: $0.0025 to $0.018 per Minute

STT is typically the cheapest component in a voice AI stack, but the range is wider than most people expect. The cheapest option, AssemblyAI Universal-3 Pro at $0.0025/min, is 7.2x cheaper than Azure’s Speech-to-Text at $0.018/min.

Provider	Model	Cost/min	Billing
AssemblyAI	Universal-3 Pro	$0.0025	Per-second
OpenAI	Whisper	$0.006	Per-minute
Google Cloud	Speech-to-Text	$0.006	Per-15s block
Deepgram	Nova-3	$0.0077	Per-second
Azure	Speech-to-Text	$0.018	Per-minute

The billing model matters more than most builders realize. Per-second billing (Deepgram, AssemblyAI) saves 30–40% compared to block or per-minute billing for typical voice agent conversations. In a voice agent, individual utterances are short, often 3 to 8 seconds. With per-minute billing, a 4-second utterance costs you a full minute. With per-second billing, you pay for 4 seconds.

Google Cloud’s 15-second block billing sits between the two. That 4-second utterance rounds up to 15 seconds, not 60, but you are still paying for 11 seconds of silence.

Text-to-Speech: $0.006 to $0.160 per Minute

TTS has the widest cost spread of any component. At the low end, Cartesia Sonic-3 costs $0.006/min. At the high end, Google Cloud Studio premium voices reach $0.160/min. That is a 26x spread.

Provider	Model	Cost/min	TTFB
Cartesia	Sonic-3	$0.006	40–90ms
Azure	Neural TTS	$0.0125	—
Google Cloud	WaveNet	$0.014	—
ElevenLabs	Flash v2.5	$0.015	100–200ms
Deepgram	Aura-2	$0.018	90–200ms
PlayHT	Standard	$0.020	—
Rime	Arcana v3	$0.030	—
ElevenLabs	Multilingual v2	$0.030	—
Google Cloud	Studio (premium)	$0.160	—

Cartesia deserves attention. At $0.006/min, it is roughly one-fifth the cost of ElevenLabs, and its TTFB of 40–90ms is the fastest in the industry. For latency-sensitive voice agents where time to first audio byte matters, Cartesia is currently the best combination of speed and cost.

The Google Cloud Studio premium tier at $0.160/min is an outlier. Most builders will never need studio-quality voices at that price point. But if your use case requires the highest available quality for pre-recorded or low-volume synthesis, the option exists.

LLM Inference: $0.00075 to $0.135 per Minute

LLM cost is the most variable component in a voice AI stack. At the low end, Gemini 2.0 Flash costs $0.00075/min in a voice conversation context. At the high end, Claude 3.5 Sonnet runs $0.135/min. That is a 180x spread.

Provider	Model	Cost/min (voice)	Note
Google	Gemini 2.0 Flash	$0.00075	$0.10/M in, $0.40/M out
OpenAI	GPT-4o mini	$0.0015	$0.15/M in, $0.60/M out
Groq	Llama 4 Maverick	$0.002	1,200 tok/s, sub-100ms TTFT
Cerebras	WSE-3 (Llama-70B)	$0.005	Up to 4,000 tok/s
OpenAI	GPT-4.1	$0.015	$2.00/M in, $8.00/M out
Anthropic	Claude 3.5 Sonnet	$0.135	$15/M in, $75/M out

The cost estimates above assume a typical voice conversation context of 1,000–2,000 tokens per minute (500–1,000 input, 500–1,000 output), plus system prompt overhead of 200–500 tokens sent with each turn.

What matters here: in a budget stack, the LLM is roughly 10% of total per-minute cost. In an ultra-premium stack using Claude 3.5 Sonnet, the LLM becomes 78% of the total. Your LLM choice has a larger impact on total cost than any other single component.

Platform Fees: The Hidden Multiplier

If you use a voice agent platform instead of building your own pipeline, the platform fee is often the largest single line item, and the advertised number rarely tells the full story.

Platform	Base fee/min	Effective cost/min	Model
Vapi	$0.05	$0.315	Platform + separate providers
Retell AI	$0.07	$0.13	All-inclusive base + add-ons
Bland AI	$0.11	$0.14	Monthly plan + usage

Vapi’s unbundled model is the most expensive in practice. Its $0.05/min base fee is just the beginning. You still pay separately for STT, TTS, LLM, and telephony, resulting in 5 separate invoices and an effective cost of $0.30–$0.33/min. At 10,000 minutes per month, that is $1,440+ versus Retell’s $700 for the same volume. Retell is 51% cheaper at that scale.

For enterprise operations, Vapi customers should budget $40,000 to $70,000 per year for stable operations. That is a real number, not a worst-case estimate.

Telephony: $0.01 to $0.02 per Minute

If your voice agent connects to the phone network (PSTN), telephony adds $0.01–$0.02/min on top of everything else. Twilio charges $0.0085–$0.022/min for inbound US calls and $0.013–$0.030/min for outbound. SignalWire is slightly cheaper at $0.005–$0.015/min inbound.

Telephony costs are relatively small compared to other components, but they add up. For a WebRTC-only deployment (no phone network), this cost disappears entirely. Phone numbers run about $2.00/month from either provider.

Four Stack Scenarios: $0.0095 to $0.1727 per Minute

Here is what a complete cascaded pipeline (STT + LLM + TTS) costs at four different quality tiers. These are component costs only; they exclude telephony and platform fees.

Tier	STT	LLM	TTS	Total/min
Budget	AssemblyAI ($0.0025)	Gemini 2.0 Flash ($0.001)	Cartesia ($0.006)	$0.0095
Quality	Deepgram ($0.0077)	GPT-4o mini ($0.002)	PlayHT ($0.02)	$0.0297
Premium	Deepgram ($0.0077)	GPT-4.1 ($0.015)	ElevenLabs ($0.015)	$0.0377
Ultra	Deepgram ($0.0077)	Claude 3.5 Sonnet ($0.135)	ElevenLabs v2 ($0.03)	$0.1727

The budget stack at $0.0095/min is remarkably cheap, under a penny per minute for full voice AI. At 10,000 minutes per month, that is $95. The ultra-premium stack at $0.1727/min would cost $1,727 for the same volume. The 18x gap between them is almost entirely driven by the LLM choice.

The Speech-to-Speech Alternative

Speech-to-speech (S2S) models skip the STT and TTS steps entirely, processing audio end-to-end in a single model. The cost picture is dramatically different from cascaded pipelines.

Provider	Model	Cost/min	Note
Google	Gemini 2.0 Flash Live	$0.00165	Cheapest S2S option available
Google	Gemini 2.5 Flash Live	$0.01125	30 HD voices, 24 languages
OpenAI	Realtime API (mini)	$0.084	$10/M in + $20/M out audio tokens
OpenAI	Realtime API (GPT-4o)	$0.30	$100/M in + $200/M out audio tokens

Gemini 2.0 Flash Live at $0.00165/min is cheaper than even the budget cascaded stack ($0.0095/min). That is a striking number. It means S2S is no longer the “expensive option,” at least not with Google’s pricing. There is a 182x spread between Gemini 2.0 Flash Live and OpenAI Realtime GPT-4o.

The critical caveat: S2S costs scale with conversation length due to context accumulation. Audio tokens accumulate throughout a conversation, so a 10-minute call costs more than 10x a 1-minute call. For short interactions, S2S can be extremely cost-effective. For long conversations, the math changes. We cover this in detail in our S2S vs Cascaded architecture comparison.

The Deflation Trend: Everything Is Getting Cheaper

Voice AI costs are falling fast across every component. Here is how pricing has moved from 2024 to 2025/2026:

Component	2024 Cost	2025/26 Cost	Change
STT	$0.012–$0.024/min	$0.004–$0.010/min	-50% to -67%
TTS	$0.020–$0.040/min	$0.006–$0.030/min	-25% to -70%
LLM	$0.020–$0.050/min	$0.001–$0.020/min	-60% to -95%
Platform	$0.12/min	$0.07–$0.10/min	-17% to -42%

LLM costs have fallen the most, down 60% to 95%. This is driven by new model releases (Gemini 2.0 Flash, GPT-4o mini) and intense competition, with 485+ models now available on OpenRouter alone. TTS deflation is driven by new entrants like Cartesia undercutting incumbents. STT deflation comes from model improvements and per-second billing replacing per-minute billing.

The implication: what costs $0.17/min today in an ultra-premium configuration will likely cost under $0.05/min within 18 months if these trends continue. If you are making build-vs-buy decisions, factor in that the “buy” option is getting cheaper faster than most projections assume.

What 10,000 Minutes Actually Costs

To make these numbers concrete, here is the monthly bill for 10,000 minutes of voice agent usage at each tier. We include a $0.015/min telephony estimate (midpoint of Twilio/SignalWire range) and note platform fees separately.

Tier	Components	+ Telephony	Monthly (10K min)
Budget	$0.0095	$0.0245	$245
Quality	$0.0297	$0.0447	$447
Premium	$0.0377	$0.0527	$527
Ultra	$0.1727	$0.1877	$1,877
S2S (Gemini Live)	$0.00165	$0.01665	$167

These figures exclude platform fees. If you add a platform like Vapi ($0.05/min), Retell ($0.07/min), or Bland ($0.11/min), the monthly bill increases by $500 to $1,100 for 10,000 minutes. On Vapi with a quality-optimized stack, your total monthly spend would be approximately $1,440 for 10,000 minutes.

Gemini 2.0 Flash Live as a standalone S2S solution delivers the lowest total cost at $167/month for 10,000 minutes including telephony. This assumes short conversations where context accumulation remains manageable. For longer calls, multiply accordingly.

Practical Recommendations

Based on this data, here is what we recommend for builders at different stages:

•Prototyping: Start with the budget stack (AssemblyAI + Gemini 2.0 Flash + Cartesia) at $0.0095/min. At this cost, you can run thousands of test calls for under $100. Do not optimize prematurely.
•Production:Move to the quality-optimized stack (Deepgram + GPT-4o mini + PlayHT) at $0.0297/min. Deepgram’s 150–300ms latency and per-second billing make it the production workhorse. GPT-4o mini offers the best quality-to-cost ratio for general voice agents.
•Cost-constrained: Consider Gemini 2.0 Flash Live ($0.00165/min) for S2S if cost is your primary constraint and you can accept the tradeoffs around debugging, compliance, and context accumulation costs.
•Always:Calculate your fully loaded cost, not headline prices. Include telephony, platform fees, and realistic per-minute rates for your actual usage patterns. The difference between advertised and effective cost is often 3–6x.

For the full benchmark data behind these numbers, see our Voice AI Benchmark Report 2026.

Sources

All pricing data verified as of March 4, 2026. Next scheduled verification: June 2026.

•Deepgram Pricing
•AssemblyAI Pricing Breakdown (Brass Transcripts)
•OpenAI API Pricing
•Google Cloud STT Pricing
•Azure Speech Pricing
•Cartesia Pricing
•ElevenLabs Pricing
•PlayHT Pricing
•Rime Pricing
•Google Cloud TTS Pricing
•Google AI Gemini Pricing
•Vapi Pricing
•Retell AI Pricing
•Bland AI Billing
•Twilio Voice Pricing
•SignalWire AI Agent Pricing
•Retell Voice AI Platform Pricing Comparison
•Softcery Voice AI Cost Calculator

Back to blog