Evaluating voice ai quality in production — beyond WER

Your benchmark says 5% WER. Your users say the agent can't understand them. Both are correct. The metrics that actually predict production failures.

Your benchmark says 5% WER. Your users say the agent can’t understand them. Both are correct.

This is the most common failure mode we see when teams ship voice AI into production: the offline transcription scorecard looks clean, the customer calls say otherwise, and nobody can reconcile the gap. The gap is not a measurement error. It is a metric error. Word Error Rate was designed in the 1990s to compare speech recognition systems on read speech, and it treats every word the same. It does not know that your user said “do not refill my prescription” and your model heard “now refill my prescription.” Both transcripts are one word off. One of them triggers a lawsuit.

If you are running a production voice agent, WER is a useful regression signal and a terrible quality signal. You need more.

Why WER alone misleads in production

WER computes a normalized edit distance between a reference transcript and a hypothesis. That framing hides three problems that matter in production.

The first is word weighting. A missed article (“the”) and a flipped negation (“not” to “now”) both cost one substitution. Downstream, the article is noise and the negation is the entire meaning. Gladia’s WER explainer and Deepgram’s production-metrics guide both make this point: WER treats semantically critical words and filler words identically, which is almost never what your application wants.

The second is entity collapse. A phone number with one wrong digit is unusable. A drug name misspelled by one character is dangerous. On a 200-word call, that single entity error shows up as a 0.5% WER delta that no one notices — and a 100% failure rate on the actual task. Deepgram formalizes this as Missed Entity Rate, scoring accuracy only on high-value tokens like proper nouns, numbers, dates, and domain terms. In our internal runs, a provider that looked 2 points better on aggregate WER lost by more than 10 points on entity-WER for medical terminology.

The third is hallucination. WER cannot tell you that your model invented a sentence that was never spoken. Whisper-family models, still the most widely deployed open-source STT, do this at a rate that should stop anyone shipping them into a regulated workflow. The Koenecke et al. 2024 study “Careless Whisper” found that roughly 1.2% of Whisper transcriptions on control-group audio contained fully fabricated text — and 1.7% on audio from speakers with aphasia. Of those hallucinations, 38% contained explicit harms: invented violence, made-up associations, or false claims of authority. A 1% fabrication rate on millions of daily calls is not a rounding error. It is a class-action filing waiting to be indexed.

The metrics that actually predict production failures

Once you accept that WER is a starting point, the interesting question becomes which additional signals to instrument. Three families matter.

Entity-level accuracy is the first. Instead of averaging errors across the whole transcript, you score accuracy on the tokens that drive the downstream task: names, numbers, product SKUs, dates, medical terms. You can compute this with a lightweight NER pass over both reference and hypothesis, or tag a sample of production calls and run a weekly scorecard. When teams do this, the correlation between entity-WER and user-reported failures is usually much tighter than the correlation with aggregate WER.

Semantic similarity is the second. BERTScore and SemDist compare the meaning of the hypothesis to the reference using contextual embeddings. They are forgiving of paraphrase (good) and harsh on negation flips and meaning changes (also good). They are not a replacement for WER — they are the metric that catches the failure mode WER is blind to, where a low-edit-distance transcript means the opposite of what was said.

Task-level correctness is the third, and the one that matters most for voice agents. If the job of the transcript is to populate a CRM field, book an appointment, or trigger a tool call, the only metric that counts is whether the downstream system produced the right action. An LLM-as-judge grader comparing the executed intent against a labeled expected intent is now a standard pattern. Hamming’s voice-agent evaluation guide lays out the taxonomy: task success rate, entity accuracy, intent accuracy, and user-frustration proxies — re-asks, barge-ins, abandonment — sit above WER, not alongside it.

TTS has the same problem, one layer down

On the synthesis side, raw MOS is as misleading as raw WER. Systems now routinely clear 4.5 MOS on read prompts and still sound uncanny on long-form domain content. The modern answer is distributional metrics: TTSDS2 compares generated speech against real speech across prosody, speaker identity, and intelligibility factors, and was the only metric out of 16 evaluated that correlated above 0.50 Spearman with human judgments across every domain and language tested. UTMOSv2, which won 7 of 16 tracks at the VoiceMOS Challenge 2024, covers per-utterance naturalness where TTSDS2 covers system-level distributional realism. Stacking both, plus a triple-STT round-trip WER to catch synthesis errors that destroy intelligibility, is the approach we run inside the Speko benchmark.

The minimum viable monitoring stack

If I had to pick three metrics for a team shipping voice AI tomorrow, it would be these. For STT: entity-WER on the top 50 entities in your domain, computed on a rolling sample of production calls. For semantics: BERTScore or an LLM-as-judge grader on full turns, flagging negation flips and meaning drift. For behavior: a user-frustration composite — re-ask rate, barge-in rate, call abandonment — piped into the same dashboard as your model metrics. WER stays in the stack as a regression signal, not a quality signal.

Everything else — hallucination audits, MOS, latency percentiles, cost — is additive. But those three, watched together, catch the failures that WER alone will never surface.