Speko is a voice AI benchmarking and optimization platform. It connects to 18+ voice AI providers and automatically tests 240+ STT, LLM, and TTS combinations against your specific language, use case, and cost constraints — returning ranked results in minutes.

Which voice AI providers does Speko support?

Speko supports 18+ providers including Deepgram, AssemblyAI, ElevenLabs, Cartesia, PlayHT, OpenAI, Gemini, Groq, Cerebras, Vapi, Retell, Bland AI, Hume AI, and more. New providers are added regularly.

How does Speko benchmark voice AI providers?

Speko runs STT, LLM, and TTS providers in combination against your specific inputs, measuring latency, accuracy, cost, and quality. Every benchmark number is cited with source URLs and verification dates. See our methodology at speko.ai/blog/methodology.

Which STT provider is most accurate for English?

Based on our March 2026 benchmarks, Deepgram Nova-3 and AssemblyAI Universal-3 Pro lead for English accuracy. Deepgram Nova-3 achieves 4.1% WER on clean audio; AssemblyAI Universal-3 Pro averages 5.9% WER across 26 diverse datasets. The best choice depends on your audio conditions and latency requirements.

What is the cheapest voice AI stack in 2026?

The lowest-cost production-ready stack is approximately $0.0095/minute, combining Deepgram Nova-3 ($0.0043/min) + Gemini 2.0 Flash ($0.0007/min) + Cartesia Sonic ($0.0045/min). See our full cost breakdown at speko.ai/blog/voice-ai-cost-2026.

How is Speko different from Vapi or Retell?

Vapi and Retell are voice agent platforms that lock you into their provider choices. Speko is provider-agnostic infrastructure that benchmarks all providers against your requirements and helps you choose and switch freely. Speko integrates with any platform including Vapi, Retell, and custom stacks.

Speech-to-Speech vs Cascaded Pipelines: Which Architecture for Voice AI?

Speech-to-Speech vs Cascaded: Which Voice AI Architecture to Use

Use cascaded pipelines (STT → LLM → TTS) when you need text logs, compliance auditing, or component-level control. Use speech-to-speech (S2S) when latency and emotional nuance are your primary constraints. According to Speko’s benchmark data, S2S achieves an 85% latency reduction over non-streaming cascaded pipelines, but cascaded costs are predictable ($0.0095–$0.17/min) while S2S costs range 182x from $0.00165/min (Gemini 2.0 Flash Live) to $0.30/min (OpenAI Realtime GPT-4o) and grow with conversation length. As of 2026, cascaded dominates enterprise deployments because of debuggability, compliance, and provider flexibility (5+ STT, 7+ TTS, dozens of LLMs vs. only OpenAI and Google for S2S).

The cascaded approach has dominated production since voice agents became practical, offering control and predictable costs. The S2S approach is newer, faster, and increasingly cost-competitive — but it comes with real operational tradeoffs that most marketing materials do not mention.

This post compares both architectures using verified 2026 data. Every number comes from our benchmark dataset. We will cover how each works, what each costs, and when to use which.

How Cascaded Works

A cascaded pipeline processes voice in three discrete steps:

Audio → STT → Text → LLM → Text → TTS → Audio

The user speaks. A speech-to-text model transcribes the audio into text. That text goes to an LLM, which generates a response as text. The text response goes to a text-to-speech model, which synthesizes audio. The user hears the response.

Each component operates independently. You can swap Deepgram for AssemblyAI in the STT slot without touching anything else. You can switch from GPT-4o mini to Gemini 2.0 Flash for the LLM. You can replace ElevenLabs with Cartesia for TTS. This modularity is the defining advantage of cascaded pipelines.

The text intermediate between each step is critical. It enables logging every word the user says and every word the agent responds with. It enables compliance auditing, analytics, content moderation, and debugging. When something goes wrong in a cascaded pipeline, you can inspect the exact text at each stage and identify which component failed.

Latency:A naive cascaded pipeline has 2–4 seconds of total latency. But with streaming optimizations (where STT streams partial transcripts to the LLM, and the LLM streams tokens to TTS as they are generated), sub-1-second end-to-end latency is achievable in production today.

Cost: Cascaded costs range from $0.0095/min (budget) to $0.1727/min (ultra-premium), and they are consistent regardless of conversation length. A 10-minute call costs exactly 10x a 1-minute call.

How Speech-to-Speech Works

A speech-to-speech model processes audio directly:

Audio → Single Model → Audio

There is no intermediate text representation. The model takes raw audio input and produces raw audio output. OpenAI’s Realtime API and Google’s Gemini Live are the two main production S2S options available today.

Because the model works directly with audio, it preserves information that text-based pipelines discard. Tone, emotion, pacing, emphasis, and hesitation all survive the processing step. The model can respond not just to what the user said, but to how they said it. Gemini 2.5 Flash Live takes this further with “affective dialogue,” interpreting tone, emotion, and pace from raw audio and adjusting its response accordingly.

Latency:S2S achieves an 85% reduction in latency compared to non-streaming cascaded pipelines in benchmarks. By eliminating the STT and TTS steps entirely, the only latency is the model’s own inference time. Both OpenAI Realtime and Gemini Live are designed for sub-second response times.

Cost:S2S pricing ranges from $0.00165/min (Gemini 2.0 Flash Live) to $0.30/min (OpenAI Realtime GPT-4o). But here is the catch: costs increase with conversation length because audio tokens accumulate in the model’s context window. A 10-minute call costs significantly more than 10x a 1-minute call.

Head-to-Head Comparison

Dimension	Cascaded	S2S
Latency	2–4s naive, sub-1s with streaming	85% reduction vs non-streaming cascaded
Cost predictability	Fixed per-minute, conversation-length independent	Increases with conversation length (context accumulation)
Debugging	Text intermediates at every stage	Audio in, audio out; limited inspection
Emotional nuance	Lost during text conversion	Preserved: tone, pacing, prosody
Component flexibility	Swap any component independently	Locked to one provider’s model
Compliance	Text logs for auditing, PII redaction	Requires additional transcription step
Provider options	5+ STT, 7+ TTS, dozens of LLMs	OpenAI Realtime, Gemini Live

The cascaded approach wins on control, transparency, and operational maturity. S2S wins on latency and emotional expressiveness. Neither advantage is trivial.

Cost Comparison with Real Numbers

The cost story is more nuanced than “S2S is expensive.” It depends entirely on which S2S model you use and how long your conversations run.

Stack	Type	Cost/min	5-min call
Gemini 2.0 Flash Live	S2S	$0.00165	$0.008
Budget cascaded	Cascaded	$0.0095	$0.048
Gemini 2.5 Flash Live	S2S	$0.01125	$0.056
Quality cascaded	Cascaded	$0.0297	$0.149
Premium cascaded	Cascaded	$0.0377	$0.189
OpenAI Realtime (mini)	S2S	$0.084	$0.420
Ultra cascaded	Cascaded	$0.1727	$0.864
OpenAI Realtime (GPT-4o)	S2S	$0.30	$1.50

For a 5-minute call, Gemini 2.0 Flash Live costs $0.008. Budget cascaded costs $0.048. OpenAI Realtime GPT-4o costs $1.50. The cheapest S2S option is 6x cheaper than the cheapest cascaded option. But the most expensive S2S option is 31x more expensive than premium cascaded.

The 182x spread between Gemini 2.0 Flash Live ($0.00165/min) and OpenAI Realtime GPT-4o ($0.30/min) is the widest price gap in the entire voice AI stack. Choosing the wrong S2S provider is the single most expensive mistake you can make.

Remember that S2S costs increase with conversation length due to context accumulation. The 5-minute call figures above assume linear scaling, but in practice, S2S costs grow faster than linearly as audio tokens accumulate. For long-running customer support calls (15–30 minutes), this effect becomes significant and should be modeled explicitly before committing to S2S in production.

Current Production Reality (2026)

As of early 2026, cascaded pipelines dominate enterprise voice AI deployments. This is not because S2S is bad. It is because cascaded pipelines solve problems that enterprises care about deeply: control, debuggability, and compliance.

When a voice agent says something incorrect to a customer, the first question from the product team is “what did it actually say?” In a cascaded pipeline, you have the exact text at every stage. In an S2S pipeline, you have audio in and audio out, and reconstructing what happened requires running a separate transcription step after the fact.

Compliance adds another dimension. In regulated industries (healthcare, finance, insurance), you may need to redact PII, log conversations for auditing, or prove that certain disclosures were made. Text intermediates make all of this straightforward. Audio-only pipelines require additional tooling.

S2S adoption remains limited by three factors: operational challenges (debugging audio-to-audio systems is harder), cost unpredictability (context accumulation makes long calls expensive), and fewer provider options (you are choosing between OpenAI and Google, whereas cascaded gives you 5+ STT providers, 7+ TTS providers, and dozens of LLMs).

That said, Google’s Gemini Live is changing the equation. At $0.00165/min for Gemini 2.0 Flash Live, the cost barrier is gone. If Google delivers on production reliability and the operational tooling catches up, S2S adoption could accelerate significantly in the second half of 2026. Gemini 2.5 Flash Live at $0.01125/min adds 30 HD voices in 24 languages with affective dialogue, interpreting tone, emotion, and pace from raw audio, which starts to address the quality gap as well.

Decision Framework

Based on the data, here is when to use each architecture:

Use cascaded if:

•You need text logs for debugging, analytics, or compliance. This is non-negotiable in regulated industries.
•You want to swap components independently. Today’s cheapest TTS may not be tomorrow’s. Cascaded lets you switch without re-architecting.
•Cost predictability matters. A 30-minute call on cascaded costs exactly 30x a 1-minute call. On S2S, it costs more due to context accumulation.
•Your conversations are long. For calls over 10 minutes, cascaded costs remain linear while S2S costs grow.

Use S2S if:

•Emotional nuance matters for your use case. S2S preserves prosody, tone, and pacing that cascaded pipelines discard during text conversion.
•Latency is your primary constraint. S2S achieves an 85% latency reduction compared to non-streaming cascaded pipelines.
•You are using Gemini and cost is a priority. At $0.00165/min, Gemini 2.0 Flash Live is cheaper than even the cheapest cascaded stack.
•Your conversations are short. For sub-5-minute interactions, S2S context accumulation is minimal and cost advantages are clear.

Consider a hybrid approach:

Some teams are experimenting with using S2S for the initial response (where latency matters most, because the user is waiting) and switching to cascaded processing for complex tasks that require text-based reasoning, tool use, or structured output. This gets you the latency benefit of S2S where it matters most, with the control of cascaded where you need it.

Hybrid approaches add architectural complexity. They are worth considering if you have the engineering bandwidth, but they are not a requirement. For most teams shipping their first voice agent, picking one architecture and optimizing within it will produce better results than trying to bridge both.

The Bottom Line

If you are building a voice agent today and need production reliability, start with cascaded. The tooling is mature, the costs are predictable, and you can debug problems with text logs. A streaming cascaded pipeline with Deepgram + GPT-4o mini + Cartesia gives you sub-1-second latency at $0.0297/min, which is good enough for most production use cases.

If you are optimizing for cost or emotional expressiveness, evaluate Gemini 2.0 Flash Live seriously. At $0.00165/min, it eliminates cost as a concern entirely for short-to-medium conversations. But go in with your eyes open about the operational tradeoffs: limited debugging, fewer provider options, and cost growth on long calls.

The architecture decision is not permanent. As S2S tooling matures and costs continue falling across the board, the calculus will shift. Build with clean abstractions now, and you can migrate later when the tradeoffs change.

For the complete benchmark data referenced in this post, see our Voice AI Benchmark Report 2026. For a detailed breakdown of every cost component, see The Real Cost of Voice AI in 2026.

Sources

All data verified as of March 4, 2026. Next scheduled verification: June 2026.

Back to blog