Humans tolerate about 200 ms of silence in natural conversation. Research on conversational AI usability finds that users unconsciously perceive delays past ~300 ms and consciously notice them past 500 ms, with abandonment spiking beyond one second. OpenAI’s own Realtime telemetry shows the same behavioral cliff: under ~550 ms people talk normally; over 800 ms they start repeating themselves. If your voice agent feels off and you don’t know why, it is almost always latency. This is the playbook we use at Speko to push a cascaded pipeline under 500 ms voice-to-voice — and where the architecture itself has to change when you cannot.

The pipeline, with a latency budget

A cascaded voice agent has six serial hops. Every one of them charges you milliseconds:

[Mic] -> [VAD / endpointing] -> [STT streaming] -> [LLM] -> [TTS streaming] -> [Speaker]
   ~20ms        300-500ms           150-300ms      200-500ms      75-200ms        ~30ms

Add them up honestly and you are already at 775 ms to 1,650 ms before the first word of synthesized audio leaves your server. Sub-500 ms voice-to-voice is not achieved by making one stage fast — it is achieved by overlapping stages and cutting the two biggest offenders: endpointing and LLM time-to-first-token.

Endpointing is the silent killer

The biggest latency hit in almost every shipped agent is not the model. It is the VAD waiting to declare the user done talking. Most frameworks ship with a 500-800 ms silence threshold, and reducing it naively causes the agent to barge in on natural pauses. This is where teams burn the most budget. Turn-taking research across production platforms converges on a 300-500 ms silence threshold as the safe floor — go lower and you clip the user; stay higher and you lose half your sub-500 ms budget to dead air.

Two tactics buy you back ~150-250 ms here. First, use a semantic endpointer, not a silence-only VAD — predict end-of-turn from the partial transcript’s syntactic completeness, so a user trailing off mid-sentence waits longer than one who just said “okay, book it.” Second, start the LLM request on partial transcripts rather than waiting for the final. You will occasionally throw away a speculative completion, but you shave 200-400 ms off the critical path because the LLM’s TTFT now runs during the endpointing window, not after it.

STT: streaming beats batch by an order of magnitude

A batch STT call on a 10-second utterance returns when the utterance ends — adding real-time-factor seconds on top. Streaming STT returns interim transcripts every 100-250 ms. Deepgram Nova-3 reports sub-300 ms time-to-first-token on streaming, with US-region deployments hitting ~150 ms. Switching from batch to streaming is typically a 500-1,500 ms win on its own, and it is the prerequisite for speculative LLM calls.

Second-order STT tactics: co-locate your STT and LLM in the same region (cross-region round-trips add 60-120 ms per hop), disable unnecessary features like language detection when you already know the language, and use a WebSocket transport over a fresh HTTP connection per turn — the TLS handshake alone costs 100-200 ms on cold connections.

LLM: TTFT is the number that matters, not tokens/sec

For voice, throughput is almost irrelevant once you are streaming. What matters is time-to-first-token. Gemini 2.5 Flash-Lite reports ~240 ms TTFT; Flash full reports ~370 ms. Claude Haiku is in the same band on Anthropic-native infra. GPT-4o-mini and GPT-4.1-mini land around 300-500 ms depending on region and prompt size.

The win here is not “pick a faster model” — it is: keep your system prompt small (every 1,000 tokens of prompt adds measurable prefill time), cache it aggressively (prompt caching on Anthropic and OpenAI cuts prefill by 60-80% on repeat turns), and stream the output token-by-token into the TTS as they arrive. Do not wait for the full response. The TTS should start speaking the first clause while the LLM is still generating the second.

TTS: sentence-chunked streaming, not utterance

ElevenLabs Flash reports ~75 ms model inference and 150-200 ms end-to-end TTFB on optimal paths. Cartesia Sonic and PlayHT Turbo sit in the same range. The anti-pattern we see most often: teams call the non-streaming TTS endpoint, wait for the full MP3, then play it. That single decision adds 500-2,000 ms depending on response length, and it is the most common reason a voice agent feels sluggish when every individual metric looks fine in isolation.

Stream TTS by clause or by the first punctuation boundary. Send the first chunk the moment the LLM emits a comma or period. Pair this with WebRTC (not HTTP long-poll) for the audio path — WebRTC’s jitter buffer and Opus encoding shave another 80-150 ms off perceived delay versus chunked HTTP.

When cascaded hits its floor: go end-to-end

A well-tuned cascaded stack can hit 450-700 ms voice-to-voice in-region. Below that, you are fighting physics. Kyutai’s Moshi reports 160 ms theoretical and ~200 ms practical end-to-end latency because it collapses STT+LLM+TTS into a single speech-native model with full-duplex audio streaming. OpenAI Realtime and Gemini Live sit between these two regimes — OpenAI reports ~480-520 ms voice-to-voice from the US. If your use case needs interruption handling and sub-400 ms, cascaded is not the right architecture.

What Speko recommends

For a conversational agent, budget this way: 300 ms endpointing, 200 ms STT+LLM overlap, 150 ms TTS TTFB, 50 ms network and playback — a 700 ms ceiling, 500 ms target. For command-style agents where the user expects a beat of thinking, 800 ms is fine. Below 400 ms voice-to-voice, stop optimizing cascaded and evaluate an S2S model.

We benchmark every major STT, TTS, and S2S provider on exactly these numbers, measured end-to-end in production conditions.

Sources: Deepgram latency docs, ElevenLabs latency optimization, OpenAI Realtime API: The Missing Manual, Moshi paper (Kyutai), Vapi speech latency guide, TringTring 500ms threshold analysis.

Cutting voice agent latency to sub-500ms — a practical playbook