Streaming vs batch stt — when each one wins
Most of you shouldn't be using streaming STT. Four questions to answer honestly before you open another WebSocket.
Most of you shouldn’t be using streaming STT. Here’s why.
Somewhere along the way, “voice AI” became shorthand for “real-time,” and real-time became shorthand for “open a WebSocket.” So every new voice app starts with a streaming endpoint, a partial-transcript render loop, and a whole category of failure modes that didn’t need to exist. Half of those apps are transcribing meetings, generating show notes, or indexing call recordings for compliance. None of that work benefits from a 300ms partial. All of it pays a tax for pretending it does.
The tax is real, and it’s bigger than developers assume. On Deepgram’s own Nova-3 benchmarks, the batch model lands around 5.26% WER while the streaming model sits at 6.84% — a 1.58-point absolute gap that compounds dramatically on hard audio. A broader 2024 cross-provider study (Kuhn et al.) pinned the typical delta between batch and streaming at roughly 9.37% vs 10.9% WER — same direction, and worse once you stack domain noise on top. The reason is mechanical: streaming models have to commit to a token from left-context-only, whereas batch sees the whole utterance and reasons globally. You can’t beam-search audio you haven’t heard yet.
The cost tax shows up in the invoice, too. AssemblyAI charges $0.37/hr for async and $0.47/hr for real-time streaming — roughly a 27% streaming premium. OpenAI’s classic Whisper endpoint runs $0.36/hr for batch file transcription; the Realtime API operates on token pricing that lands anywhere from $0.38 to $1.15/hr depending on whether you’re also generating audio output. OpenAI’s Batch API takes another 50% off if you can tolerate a 24-hour turnaround. Deepgram’s per-second billing softens the hit, but Nova-3 streaming still runs about $0.0077/min on pay-as-you-go. Multiply those deltas across a million minutes a month and the premium you’re paying for “real-time” on a workload nobody’s watching in real time becomes a line item your CFO will find.
So when is streaming worth it? Answer four questions honestly.
One: is a human waiting on the output inside a single turn? If the user will see a partial transcript in under two seconds and make a decision from it — a voice agent answering, a captioned live stream, a dictation UI — streaming is the correct choice. If the output lands in a dashboard, an email, a Notion doc, or a compliance archive, nobody is waiting. Batch it.
Two: can your UX absorb transcript instability? Streaming models emit partials that get rewritten as context accumulates — the literature tracks this as Unstable Partial Word Ratio (UPWR) and Unstable Partial Segment Ratio (UPSR), and both spike on conversational audio. If your frontend shows raw partials and downstream code (LLM calls, intent classification, database writes) fires on them, you’re going to hit issues: duplicated actions, half-heard commands, LLMs hallucinating around words that later disappear. Either debounce aggressively, wait for is_final, or run batch. Services like Amazon Transcribe’s partial-results stabilization help, but they cap how much stability you can buy.
Three: is the network stable enough? Streaming STT is a persistent WebSocket. Every reconnection is a torn context window and often a dropped final. If your users are on mobile networks in South Asia, on airplane wifi, or in call-center VPNs with aggressive idle timeouts, your streaming reliability is not going to match your provider’s SLA — it’s going to match the worst link in the path. Batch uploads retry cleanly. Websockets don’t.
Four: do you need aggressive contextual biasing? Streaming keyword/term boosting exists (Deepgram, AssemblyAI, Google all support it), but the biasing window is smaller and the penalty for a missed domain term is permanent — there’s no second pass. Batch lets you run a specialized model, inject a long keyword list, re-decode with a domain LM, or chain a correction pass. For medical, legal, or compliance transcripts, that second pass is often the difference between 94% and 98% accuracy.
If you answered “no” to the first question, stop reading and go batch. For the set where you answered “yes” — a live voice agent, a live captioning product, a dictation tool — streaming is the unambiguous win, and the WER and cost premium are simply the price of the category.
There’s a third pattern most teams miss: the hybrid. Stream for the live UX, then re-run the same audio through batch overnight for the compliance copy, the analytics pipeline, or the LLM summarization step. You pay the streaming premium only on the first pass, where it buys you user-perceived latency, and you pay batch rates on the archival pass, where accuracy matters more than speed. For call centers this is close to mandatory: the agent-assist needs sub-second partials, but the QA team reading the transcript next week needs the good one. Running both is cheaper than tuning streaming to do a job it wasn’t built for.
The short version: streaming STT is a specialized tool, not a default. If a human isn’t inside the loop waiting on tokens, batch is faster to build, cheaper to run, more accurate, and more reliable. The field has been conditioned to reach for streaming because the demos are louder, but in production most voice workloads are quietly batch-shaped.
We built Speko’s routing layer around exactly this decision — matching workloads to the right mode across providers using benchmarked WER, latency, and cost per use case, not marketing claims.