The Number on the Pricing Page Is Not Your Cost
Most voice AI pricing pages show you one number. The real cost is 3-6x higher. Vapi advertises $0.05/min as its platform fee. But once you add STT, TTS, LLM inference, and telephony, the effective cost lands between $0.30 and $0.33 per minute. That is not a rounding error. That is the difference between a viable product and a money pit.
This post breaks down every cost component in a voice AI stack using verified 2026 pricing data. Every number comes from official pricing pages and API documentation, verified as of March 4, 2026. No estimates, no “approximately.” If you are building a voice agent, this is what you will actually pay.
Speech-to-Text: $0.0025 to $0.018 per Minute
STT is typically the cheapest component in a voice AI stack, but the range is wider than most people expect. The cheapest option, AssemblyAI Universal-3 Pro at $0.0025/min, is 7.2x cheaper than Azure’s Speech-to-Text at $0.018/min.
| Provider | Model | Cost/min | Billing |
|---|---|---|---|
| AssemblyAI | Universal-3 Pro | $0.0025 | Per-second |
| OpenAI | Whisper | $0.006 | Per-minute |
| Google Cloud | Speech-to-Text | $0.006 | Per-15s block |
| Deepgram | Nova-3 | $0.0077 | Per-second |
| Azure | Speech-to-Text | $0.018 | Per-minute |
The billing model matters more than most builders realize. Per-second billing (Deepgram, AssemblyAI) saves 30–40% compared to block or per-minute billing for typical voice agent conversations. In a voice agent, individual utterances are short, often 3 to 8 seconds. With per-minute billing, a 4-second utterance costs you a full minute. With per-second billing, you pay for 4 seconds.
Google Cloud’s 15-second block billing sits between the two. That 4-second utterance rounds up to 15 seconds, not 60, but you are still paying for 11 seconds of silence.
Text-to-Speech: $0.006 to $0.160 per Minute
TTS has the widest cost spread of any component. At the low end, Cartesia Sonic-3 costs $0.006/min. At the high end, Google Cloud Studio premium voices reach $0.160/min. That is a 26x spread.
| Provider | Model | Cost/min | TTFB |
|---|---|---|---|
| Cartesia | Sonic-3 | $0.006 | 40–90ms |
| Azure | Neural TTS | $0.0125 | — |
| Google Cloud | WaveNet | $0.014 | — |
| ElevenLabs | Flash v2.5 | $0.015 | 100–200ms |
| Deepgram | Aura-2 | $0.018 | 90–200ms |
| PlayHT | Standard | $0.020 | — |
| Rime | Arcana v3 | $0.030 | — |
| ElevenLabs | Multilingual v2 | $0.030 | — |
| Google Cloud | Studio (premium) | $0.160 | — |
Cartesia deserves attention. At $0.006/min, it is roughly one-fifth the cost of ElevenLabs, and its TTFB of 40–90ms is the fastest in the industry. For latency-sensitive voice agents where time to first audio byte matters, Cartesia is currently the best combination of speed and cost.
The Google Cloud Studio premium tier at $0.160/min is an outlier. Most builders will never need studio-quality voices at that price point. But if your use case requires the highest available quality for pre-recorded or low-volume synthesis, the option exists.
LLM Inference: $0.00075 to $0.135 per Minute
LLM cost is the most variable component in a voice AI stack. At the low end, Gemini 2.0 Flash costs $0.00075/min in a voice conversation context. At the high end, Claude 3.5 Sonnet runs $0.135/min. That is a 180x spread.
| Provider | Model | Cost/min (voice) | Note |
|---|---|---|---|
| Gemini 2.0 Flash | $0.00075 | $0.10/M in, $0.40/M out | |
| OpenAI | GPT-4o mini | $0.0015 | $0.15/M in, $0.60/M out |
| Groq | Llama 4 Maverick | $0.002 | 1,200 tok/s, sub-100ms TTFT |
| Cerebras | WSE-3 (Llama-70B) | $0.005 | Up to 4,000 tok/s |
| OpenAI | GPT-4.1 | $0.015 | $2.00/M in, $8.00/M out |
| Anthropic | Claude 3.5 Sonnet | $0.135 | $15/M in, $75/M out |
The cost estimates above assume a typical voice conversation context of 1,000–2,000 tokens per minute (500–1,000 input, 500–1,000 output), plus system prompt overhead of 200–500 tokens sent with each turn.
What matters here: in a budget stack, the LLM is roughly 10% of total per-minute cost. In an ultra-premium stack using Claude 3.5 Sonnet, the LLM becomes 78% of the total. Your LLM choice has a larger impact on total cost than any other single component.
Platform Fees: The Hidden Multiplier
If you use a voice agent platform instead of building your own pipeline, the platform fee is often the largest single line item, and the advertised number rarely tells the full story.
| Platform | Base fee/min | Effective cost/min | Model |
|---|---|---|---|
| Vapi | $0.05 | $0.315 | Platform + separate providers |
| Retell AI | $0.07 | $0.13 | All-inclusive base + add-ons |
| Bland AI | $0.11 | $0.14 | Monthly plan + usage |
Vapi’s unbundled model is the most expensive in practice. Its $0.05/min base fee is just the beginning. You still pay separately for STT, TTS, LLM, and telephony, resulting in 5 separate invoices and an effective cost of $0.30–$0.33/min. At 10,000 minutes per month, that is $1,440+ versus Retell’s $700 for the same volume. Retell is 51% cheaper at that scale.
For enterprise operations, Vapi customers should budget $40,000 to $70,000 per year for stable operations. That is a real number, not a worst-case estimate.
Telephony: $0.01 to $0.02 per Minute
If your voice agent connects to the phone network (PSTN), telephony adds $0.01–$0.02/min on top of everything else. Twilio charges $0.0085–$0.022/min for inbound US calls and $0.013–$0.030/min for outbound. SignalWire is slightly cheaper at $0.005–$0.015/min inbound.
Telephony costs are relatively small compared to other components, but they add up. For a WebRTC-only deployment (no phone network), this cost disappears entirely. Phone numbers run about $2.00/month from either provider.
Four Stack Scenarios: $0.0095 to $0.1727 per Minute
Here is what a complete cascaded pipeline (STT + LLM + TTS) costs at four different quality tiers. These are component costs only; they exclude telephony and platform fees.
| Tier | STT | LLM | TTS | Total/min |
|---|---|---|---|---|
| Budget | AssemblyAI ($0.0025) | Gemini 2.0 Flash ($0.001) | Cartesia ($0.006) | $0.0095 |
| Quality | Deepgram ($0.0077) | GPT-4o mini ($0.002) | PlayHT ($0.02) | $0.0297 |
| Premium | Deepgram ($0.0077) | GPT-4.1 ($0.015) | ElevenLabs ($0.015) | $0.0377 |
| Ultra | Deepgram ($0.0077) | Claude 3.5 Sonnet ($0.135) | ElevenLabs v2 ($0.03) | $0.1727 |
The budget stack at $0.0095/min is remarkably cheap, under a penny per minute for full voice AI. At 10,000 minutes per month, that is $95. The ultra-premium stack at $0.1727/min would cost $1,727 for the same volume. The 18x gap between them is almost entirely driven by the LLM choice.
The Speech-to-Speech Alternative
Speech-to-speech (S2S) models skip the STT and TTS steps entirely, processing audio end-to-end in a single model. The cost picture is dramatically different from cascaded pipelines.
| Provider | Model | Cost/min | Note |
|---|---|---|---|
| Gemini 2.0 Flash Live | $0.00165 | Cheapest S2S option available | |
| Gemini 2.5 Flash Live | $0.01125 | 30 HD voices, 24 languages | |
| OpenAI | Realtime API (mini) | $0.084 | $10/M in + $20/M out audio tokens |
| OpenAI | Realtime API (GPT-4o) | $0.30 | $100/M in + $200/M out audio tokens |
Gemini 2.0 Flash Live at $0.00165/min is cheaper than even the budget cascaded stack ($0.0095/min). That is a striking number. It means S2S is no longer the “expensive option,” at least not with Google’s pricing. There is a 182x spread between Gemini 2.0 Flash Live and OpenAI Realtime GPT-4o.
The critical caveat: S2S costs scale with conversation length due to context accumulation. Audio tokens accumulate throughout a conversation, so a 10-minute call costs more than 10x a 1-minute call. For short interactions, S2S can be extremely cost-effective. For long conversations, the math changes. We cover this in detail in our S2S vs Cascaded architecture comparison.
The Deflation Trend: Everything Is Getting Cheaper
Voice AI costs are falling fast across every component. Here is how pricing has moved from 2024 to 2025/2026:
| Component | 2024 Cost | 2025/26 Cost | Change |
|---|---|---|---|
| STT | $0.012–$0.024/min | $0.004–$0.010/min | -50% to -67% |
| TTS | $0.020–$0.040/min | $0.006–$0.030/min | -25% to -70% |
| LLM | $0.020–$0.050/min | $0.001–$0.020/min | -60% to -95% |
| Platform | $0.12/min | $0.07–$0.10/min | -17% to -42% |
LLM costs have fallen the most, down 60% to 95%. This is driven by new model releases (Gemini 2.0 Flash, GPT-4o mini) and intense competition, with 485+ models now available on OpenRouter alone. TTS deflation is driven by new entrants like Cartesia undercutting incumbents. STT deflation comes from model improvements and per-second billing replacing per-minute billing.
The implication: what costs $0.17/min today in an ultra-premium configuration will likely cost under $0.05/min within 18 months if these trends continue. If you are making build-vs-buy decisions, factor in that the “buy” option is getting cheaper faster than most projections assume.
What 10,000 Minutes Actually Costs
To make these numbers concrete, here is the monthly bill for 10,000 minutes of voice agent usage at each tier. We include a $0.015/min telephony estimate (midpoint of Twilio/SignalWire range) and note platform fees separately.
| Tier | Components | + Telephony | Monthly (10K min) |
|---|---|---|---|
| Budget | $0.0095 | $0.0245 | $245 |
| Quality | $0.0297 | $0.0447 | $447 |
| Premium | $0.0377 | $0.0527 | $527 |
| Ultra | $0.1727 | $0.1877 | $1,877 |
| S2S (Gemini Live) | $0.00165 | $0.01665 | $167 |
These figures exclude platform fees. If you add a platform like Vapi ($0.05/min), Retell ($0.07/min), or Bland ($0.11/min), the monthly bill increases by $500 to $1,100 for 10,000 minutes. On Vapi with a quality-optimized stack, your total monthly spend would be approximately $1,440 for 10,000 minutes.
Gemini 2.0 Flash Live as a standalone S2S solution delivers the lowest total cost at $167/month for 10,000 minutes including telephony. This assumes short conversations where context accumulation remains manageable. For longer calls, multiply accordingly.
Practical Recommendations
Based on this data, here is what we recommend for builders at different stages:
- •Prototyping: Start with the budget stack (AssemblyAI + Gemini 2.0 Flash + Cartesia) at $0.0095/min. At this cost, you can run thousands of test calls for under $100. Do not optimize prematurely.
- •Production:Move to the quality-optimized stack (Deepgram + GPT-4o mini + PlayHT) at $0.0297/min. Deepgram’s 150–300ms latency and per-second billing make it the production workhorse. GPT-4o mini offers the best quality-to-cost ratio for general voice agents.
- •Cost-constrained: Consider Gemini 2.0 Flash Live ($0.00165/min) for S2S if cost is your primary constraint and you can accept the tradeoffs around debugging, compliance, and context accumulation costs.
- •Always:Calculate your fully loaded cost, not headline prices. Include telephony, platform fees, and realistic per-minute rates for your actual usage patterns. The difference between advertised and effective cost is often 3–6x.
For the full benchmark data behind these numbers, see our Voice AI Benchmark Report 2026.
Sources
All pricing data verified as of March 4, 2026. Next scheduled verification: June 2026.
- •Deepgram Pricing
- •AssemblyAI Pricing Breakdown (Brass Transcripts)
- •OpenAI API Pricing
- •Google Cloud STT Pricing
- •Azure Speech Pricing
- •Cartesia Pricing
- •ElevenLabs Pricing
- •PlayHT Pricing
- •Rime Pricing
- •Google Cloud TTS Pricing
- •Google AI Gemini Pricing
- •Vapi Pricing
- •Retell AI Pricing
- •Bland AI Billing
- •Twilio Voice Pricing
- •SignalWire AI Agent Pricing
- •Retell Voice AI Platform Pricing Comparison
- •Softcery Voice AI Cost Calculator