Skip to content
Back to blog

Voice Agent Engineering · Empirical

We Tried to Break Four Voice Agents with a Cough. We Failed.

Published: April 2026|Author: Speko Research Team

The canonical voice-agent-is-broken anecdote goes like this: a user coughs, the agent thinks they barged in, it stops mid-sentence, apologizes, and the thread is lost. This is the story everyone tells at conferences. So we tried to reproduce it. We fed the same 400 millisecond cough — attenuated 6 dB, fired 1.2 seconds into the agent’s response — into four production stacks at once, in parallel LiveKit rooms, and recorded the listener’s side of each call.

Nothing broke. Four for four absorbed the cough and kept answering. Listen:

OpenAI Realtime — stock VAD

working

gpt-realtime with default server VAD (threshold 0.5, silence 500 ms). Cough fires ~1.2 s into the agent's response.

Waveform for OpenAI Realtime — stock VAD

Agent absorbs the cough and keeps speaking.

OpenAI Realtime — tuned VAD

working

Same model, VAD bumped to threshold 0.65, silence_duration_ms 800.

Waveform for OpenAI Realtime — tuned VAD

No observable difference from stock at this stimulus level.

Cascaded — Deepgram Nova-3

working

STT: Deepgram Nova-3 · LLM: GPT-4o-mini · TTS: ElevenLabs Flash v2.5.

Waveform for Cascaded — Deepgram Nova-3

Cascaded stack with a different STT also holds together.

Cascaded — ElevenLabs Scribe v2

working

Same pipeline, STT swapped to ElevenLabs Scribe v2 Realtime.

Waveform for Cascaded — ElevenLabs Scribe v2

Four-for-four. Nothing broke at a 400 ms cough, -6 dB, 1.2 s into the response.

Same caller audio, same cough stimulus, same timing. Four different stacks. Zero false interrupts. The cough is ducked -15 dB under the agent’s voice in the mix so you can hear what the listener would have heard on the other end of the call.

This isn’t an indictment of the anecdote. It is an artifact of how aggressively modern voice-agent platforms have engineered around it. The reason four stacks with very different internals all shrugged off the same cough is that every one of them ships with the same stack of defenses bolted on: hysteresis on the voice-activity detector, an acoustic echo canceller with a double-talk detector, a minimum-duration gate before an interrupt fires. Take any one of those out and the cough breaks the agent instantly. Leave them in — which everyone does now — and you have to work much harder than a single cough to cause a false interrupt.

The three defenses

VAD with hysteresis. The voice-activity detector doesn’t flip to “user is speaking” on a single frame. Silero (or OpenAI’s server VAD) requires ~80 ms of continuous high-confidence speech frames before declaring an interrupt. A 400 ms cough has the wrong envelope — a sharp attack and immediate decay — so it doesn’t sustain the threshold. WebRTC’s 2011 GMM VAD would have fired on it. Modern DNN VADs don’t.

Acoustic echo cancellation with double-talk detection. The agent’s own TTS coming through the caller’s speaker and looping back through the mic would trigger a self-interrupt every few seconds without AEC. Every real voice-agent platform ships this on by default — WebRTC’s AEC3 in the browser, server-side AEC in LiveKit/Daily, proprietary implementations in Retell and Vapi. The cough survived it because the cough came from the mic, not the reference signal; AEC had nothing to subtract.

Interrupt debounce before the TTS cancels. Even if the VAD fires, cascaded platforms wait for the agent’s transcription to return a non-empty token before canceling TTS. That round-trip through STT + the provider’s turn-taking model adds another 200-400 ms buffer. A cough transcribes to silence or garbage, so the interrupt doesn’t actually land.

Each defense handles a different category of false positive. Remove the VAD hysteresis and every sharp noise triggers. Remove the AEC and the agent talks over itself. Remove the debounce and you get interrupts on throat-clears. Everyone building voice agents in 2026 ships all three, which is why you have to work much harder than a single cough to actually break the system.

The minimum viable stack

If you’re building on top of OpenAI Realtime, Gemini Live, Retell, Vapi, or any LiveKit-based stack, you inherit these defenses. If you’re rolling raw voice infrastructure, you need all of them yourself:

AEC with double-talk detection removes echo before noise suppression, Silero VAD gates the mic, and the interrupt path cancels TTS within one audio frame while preserving the unspoken tail for the next turn.

We benchmark these behaviors across the major providers at speko.ai — latency, false-trigger rates, and recovery quality by use case. If you are shipping voice and getting burned by false interrupts, that is where to start.

Stop guessing. Start benchmarking.

Independent, data-driven comparisons to help you pick the right voice AI stack.

Get Started