Skip to content
Back to blog

Architecture

Speech-to-Speech vs Cascaded Pipelines: Which Architecture for Voice AI?

Published: March 2026|Author: Speko Research Team

Data verified: March 4, 2026 · Next scheduled update: June 2026

Two Ways to Build a Voice AI Pipeline

There are two ways to build a voice AI pipeline. The first converts speech to text, processes the text with an LLM, then converts the text back to speech. The second processes audio directly, end to end, in a single model. Neither is strictly better.

The cascaded approach (STT → LLM → TTS) has dominated production deployments since voice agents became practical. It offers control, debuggability, and predictable costs. The speech-to-speech (S2S) approach is newer, faster, and increasingly cost-competitive but it comes with real operational tradeoffs that most marketing materials do not mention.

This post compares both architectures using verified 2026 data. Every number comes from our benchmark dataset. We will cover how each works, what each costs, and when to use which.

How Cascaded Works

A cascaded pipeline processes voice in three discrete steps:

Audio → STT → Text → LLM → Text → TTS → Audio

The user speaks. A speech-to-text model transcribes the audio into text. That text goes to an LLM, which generates a response as text. The text response goes to a text-to-speech model, which synthesizes audio. The user hears the response.

Each component operates independently. You can swap Deepgram for AssemblyAI in the STT slot without touching anything else. You can switch from GPT-4o mini to Gemini 2.0 Flash for the LLM. You can replace ElevenLabs with Cartesia for TTS. This modularity is the defining advantage of cascaded pipelines.

The text intermediate between each step is critical. It enables logging every word the user says and every word the agent responds with. It enables compliance auditing, analytics, content moderation, and debugging. When something goes wrong in a cascaded pipeline, you can inspect the exact text at each stage and identify which component failed.

Latency:A naive cascaded pipeline has 2–4 seconds of total latency. But with streaming optimizations (where STT streams partial transcripts to the LLM, and the LLM streams tokens to TTS as they are generated), sub-1-second end-to-end latency is achievable in production today.

Cost: Cascaded costs range from $0.0095/min (budget) to $0.1727/min (ultra-premium), and they are consistent regardless of conversation length. A 10-minute call costs exactly 10x a 1-minute call.

How Speech-to-Speech Works

A speech-to-speech model processes audio directly:

Audio → Single Model → Audio

There is no intermediate text representation. The model takes raw audio input and produces raw audio output. OpenAI’s Realtime API and Google’s Gemini Live are the two main production S2S options available today.

Because the model works directly with audio, it preserves information that text-based pipelines discard. Tone, emotion, pacing, emphasis, and hesitation all survive the processing step. The model can respond not just to what the user said, but to how they said it. Gemini 2.5 Flash Live takes this further with “affective dialogue,” interpreting tone, emotion, and pace from raw audio and adjusting its response accordingly.

Latency:S2S achieves an 85% reduction in latency compared to non-streaming cascaded pipelines in benchmarks. By eliminating the STT and TTS steps entirely, the only latency is the model’s own inference time. Both OpenAI Realtime and Gemini Live are designed for sub-second response times.

Cost:S2S pricing ranges from $0.00165/min (Gemini 2.0 Flash Live) to $0.30/min (OpenAI Realtime GPT-4o). But here is the catch: costs increase with conversation length because audio tokens accumulate in the model’s context window. A 10-minute call costs significantly more than 10x a 1-minute call.

Head-to-Head Comparison

DimensionCascadedS2S
Latency2–4s naive, sub-1s with streaming85% reduction vs non-streaming cascaded
Cost predictabilityFixed per-minute, conversation-length independentIncreases with conversation length (context accumulation)
DebuggingText intermediates at every stageAudio in, audio out; limited inspection
Emotional nuanceLost during text conversionPreserved: tone, pacing, prosody
Component flexibilitySwap any component independentlyLocked to one provider’s model
ComplianceText logs for auditing, PII redactionRequires additional transcription step
Provider options5+ STT, 7+ TTS, dozens of LLMsOpenAI Realtime, Gemini Live

The cascaded approach wins on control, transparency, and operational maturity. S2S wins on latency and emotional expressiveness. Neither advantage is trivial.

Cost Comparison with Real Numbers

The cost story is more nuanced than “S2S is expensive.” It depends entirely on which S2S model you use and how long your conversations run.

StackTypeCost/min5-min call
Gemini 2.0 Flash LiveS2S$0.00165$0.008
Budget cascadedCascaded$0.0095$0.048
Gemini 2.5 Flash LiveS2S$0.01125$0.056
Quality cascadedCascaded$0.0297$0.149
Premium cascadedCascaded$0.0377$0.189
OpenAI Realtime (mini)S2S$0.084$0.420
Ultra cascadedCascaded$0.1727$0.864
OpenAI Realtime (GPT-4o)S2S$0.30$1.50

For a 5-minute call, Gemini 2.0 Flash Live costs $0.008. Budget cascaded costs $0.048. OpenAI Realtime GPT-4o costs $1.50. The cheapest S2S option is 6x cheaper than the cheapest cascaded option. But the most expensive S2S option is 31x more expensive than premium cascaded.

The 182x spread between Gemini 2.0 Flash Live ($0.00165/min) and OpenAI Realtime GPT-4o ($0.30/min) is the widest price gap in the entire voice AI stack. Choosing the wrong S2S provider is the single most expensive mistake you can make.

Remember that S2S costs increase with conversation length due to context accumulation. The 5-minute call figures above assume linear scaling, but in practice, S2S costs grow faster than linearly as audio tokens accumulate. For long-running customer support calls (15–30 minutes), this effect becomes significant and should be modeled explicitly before committing to S2S in production.

Current Production Reality (2026)

As of early 2026, cascaded pipelines dominate enterprise voice AI deployments. This is not because S2S is bad. It is because cascaded pipelines solve problems that enterprises care about deeply: control, debuggability, and compliance.

When a voice agent says something incorrect to a customer, the first question from the product team is “what did it actually say?” In a cascaded pipeline, you have the exact text at every stage. In an S2S pipeline, you have audio in and audio out, and reconstructing what happened requires running a separate transcription step after the fact.

Compliance adds another dimension. In regulated industries (healthcare, finance, insurance), you may need to redact PII, log conversations for auditing, or prove that certain disclosures were made. Text intermediates make all of this straightforward. Audio-only pipelines require additional tooling.

S2S adoption remains limited by three factors: operational challenges (debugging audio-to-audio systems is harder), cost unpredictability (context accumulation makes long calls expensive), and fewer provider options (you are choosing between OpenAI and Google, whereas cascaded gives you 5+ STT providers, 7+ TTS providers, and dozens of LLMs).

That said, Google’s Gemini Live is changing the equation. At $0.00165/min for Gemini 2.0 Flash Live, the cost barrier is gone. If Google delivers on production reliability and the operational tooling catches up, S2S adoption could accelerate significantly in the second half of 2026. Gemini 2.5 Flash Live at $0.01125/min adds 30 HD voices in 24 languages with affective dialogue, interpreting tone, emotion, and pace from raw audio, which starts to address the quality gap as well.

Decision Framework

Based on the data, here is when to use each architecture:

Use cascaded if:

Use S2S if:

Consider a hybrid approach:

Some teams are experimenting with using S2S for the initial response (where latency matters most, because the user is waiting) and switching to cascaded processing for complex tasks that require text-based reasoning, tool use, or structured output. This gets you the latency benefit of S2S where it matters most, with the control of cascaded where you need it.

Hybrid approaches add architectural complexity. They are worth considering if you have the engineering bandwidth, but they are not a requirement. For most teams shipping their first voice agent, picking one architecture and optimizing within it will produce better results than trying to bridge both.

The Bottom Line

If you are building a voice agent today and need production reliability, start with cascaded. The tooling is mature, the costs are predictable, and you can debug problems with text logs. A streaming cascaded pipeline with Deepgram + GPT-4o mini + Cartesia gives you sub-1-second latency at $0.0297/min, which is good enough for most production use cases.

If you are optimizing for cost or emotional expressiveness, evaluate Gemini 2.0 Flash Live seriously. At $0.00165/min, it eliminates cost as a concern entirely for short-to-medium conversations. But go in with your eyes open about the operational tradeoffs: limited debugging, fewer provider options, and cost growth on long calls.

The architecture decision is not permanent. As S2S tooling matures and costs continue falling across the board, the calculus will shift. Build with clean abstractions now, and you can migrate later when the tradeoffs change.

For the complete benchmark data referenced in this post, see our Voice AI Benchmark Report 2026. For a detailed breakdown of every cost component, see The Real Cost of Voice AI in 2026.

Sources

All data verified as of March 4, 2026. Next scheduled verification: June 2026.

Stop guessing. Start benchmarking.

Independent, data-driven comparisons to help you pick the right voice AI stack.

Get Started