Skip to content
Back to blog

Methodology

How We Benchmark Voice AI: Our Methodology

Published: March 2026|Author: Speko Research Team

Last full audit: March 4, 2026 · Next scheduled update: June 2026

Why We Built This

The voice AI market is moving at a pace that makes informed decision-making genuinely difficult. New providers launch every month. Existing ones update pricing and models quarterly. Everyone publishes numbers that make them look great, and almost nobody publishes the numbers that would help you compare them fairly against each other.

We have talked to dozens of engineering teams building voice agents, and the story is always the same: they spend two to four weeks evaluating providers, cobbling together data from pricing pages, blog posts, and Discord channels, only to realize the metrics they collected are not even comparable. One provider quotes latency as time-to-first-byte. Another quotes it as end-to-end round trip. A third just says “low latency” and leaves it at that.

We built Speko’s benchmark reports because we wanted a single, honest source of truth. Not marketing-optimized highlights. Not cherry-picked demos. Just verified data, normalized to common units, with every source linked and dated.

This page explains exactly how we do it, including what we measure, what we do not, and where our data comes from. We believe transparency about our process is the only way to earn your trust.

Data Sources

Let us be upfront about something: we do not run our own benchmarks. Not yet. What we do is curate, verify, and normalize publicly available data from the following sources:

Every data point in our benchmark reports links to its source with a “last verified” date. When we say Cartesia’s TTFB is 40–90ms, that comes from their official documentation, verified on March 4, 2026. When we say Deepgram Nova-3’s WER is 6.84% streaming, that figure comes from Deepgram’s published data, cross-referenced against AssemblyAI’s independent benchmark which reported 8.1% on batch processing.

We actively cross-reference claims against independent data wherever available. When provider-published numbers conflict with third-party results, we report both and note the discrepancy. You deserve to see the full picture, not a curated one.

What We Measure

Our benchmark reports currently track four core metrics across providers. Each metric was chosen because it directly impacts the end-user experience of a voice AI application.

TTFB (Time to First Byte)

Time to First Byte measures the gap between when the first audio packet is sent to a provider and when the first response byte comes back. For TTS providers, this is the single most important latency metric because it determines how long the user waits before hearing a response.

In conversational AI, TTFB under 200ms feels instantaneous. Between 200ms and 500ms, users start to notice the delay. Above 500ms, the conversation feels broken. This is why we set our target at under 200ms for conversational use cases and flag providers that consistently exceed it.

We report TTFB as a range (e.g., 40–90ms for Cartesia Sonic-3) because latency varies based on region, input length, and server load. These ranges come from provider documentation and, where available, community-reported figures.

TTFT (Time to First Token)

Time to First Token is particularly important for LLM inference in voice pipelines. It measures the total time from when a request is sent to when the first token of the response is returned, including connection setup, model loading, and initial processing.

In a cascaded voice pipeline (STT → LLM → TTS), TTFT adds directly to the user-perceived latency. A sub-200ms TTFT is excellent and approaches human reaction speed. Under 500ms is good for most applications. Anything above that starts to feel sluggish in real-time conversation.

For context, Groq reports sub-100ms TTFT on their LPU hardware, while Cerebras targets 80–150ms with speculative decoding. These are the fastest production options available as of March 2026.

WER (Word Error Rate)

Word Error Rate is the percentage of words that are incorrectly transcribed by an STT provider. It is calculated as (substitutions + insertions + deletions) / total reference words. Lower is better. Under 10% is generally considered production-ready.

We should be honest about WER’s limitations. It is an imperfect metric: it treats all errors equally (missing “the” counts the same as missing a proper noun), it does not account for punctuation or formatting, and it can be gamed by testing on easy datasets. Despite these flaws, WER remains the industry standard for STT accuracy because there is no widely adopted alternative that is better.

We also note the difference between streaming WER and batch WER. Streaming WER is typically higher because the model has less context to work with. For example, Deepgram Nova-3 reports 6.84% WER in streaming mode versus 8.1% in batch processing (per AssemblyAI’s independent benchmark). This difference matters for real-time voice applications.

Cost per Minute

This is where things get surprisingly complicated. Voice AI providers bill in at least four different units: per second (Deepgram, AssemblyAI), per 15-second block (Google Cloud), per minute (Azure, OpenAI Whisper), and per character (most TTS providers). Some charge per audio token (OpenAI Realtime API). Some have flat monthly plans with usage caps (Bland AI, PlayHT).

We normalize everything to cost per minute of processed audio at pay-as-you-go rates. This gives you a fair, apples-to-apples comparison. For character-based TTS pricing, we estimate based on average speech rate (approximately 150 words or 750 characters per minute of output audio).

We deliberately exclude enterprise discounts, committed-use agreements, and volume tiers from our headline numbers. These vary wildly between customers and would make the comparison meaningless. Where notable discounts exist, we include them as a cost note (e.g., “$0.0065 on Growth plan ($4K/yr min)” for Deepgram).

What About MOS (Mean Opinion Score)?

Mean Opinion Score is the gold standard for evaluating voice quality in TTS. It is a subjective rating (1–5) collected from human listeners evaluating naturalness, clarity, and expressiveness. We do not currently include MOS in our benchmarks for two reasons.

First, running proper MOS tests requires a statistically significant panel of human evaluators listening to standardized test sets across all providers. This is expensive and time-consuming to do right, and doing it poorly would be worse than not doing it at all. Second, most providers do not publish MOS scores, and the ones that do use different evaluation sets, making cross-provider comparison unreliable.

MOS testing is on our roadmap for later in 2026. When we add it, we will publish our full evaluation protocol alongside the results.

How We Compare Providers

Fair comparison requires consistent normalization. Here is our approach:

What We Don’t Do (Yet)

Honesty about our limitations is as important as the data itself. Here is what our benchmarks currently do not cover:

We plan to address each of these gaps over time. When we do, we will publish our testing methodology alongside the results, just as we are doing here.

Our Commitment

We hold ourselves to a specific set of commitments regarding how we maintain and publish this data:

Found an error or have updated data? Email us at bek@speko.ai. We take corrections seriously and appreciate the help.

Limitations & Disclaimers

We want to be explicit about what our benchmarks are and what they are not.

All data reflects publicly available information as of the verification date. Actual performance may vary based on workload, region, and configuration. We are not affiliated with any provider listed.

Provider-published numbers are, by nature, marketing-optimized. When a provider says their STT has “5.9% WER,” that is likely measured on a favorable dataset under ideal conditions. Your real-world WER will almost certainly be higher, especially with accented speech, background noise, domain-specific terminology, or non-English languages.

Similarly, published latency figures represent best-case scenarios. Your actual latency depends on your geographic region, the provider’s current load, your network conditions, and the complexity of your request. A provider that claims 100ms TTFB might deliver 300ms from your data center during peak hours.

Cost estimates in our full-stack scenarios assume typical voice agent conversation patterns (average utterance length, turn-taking frequency). Your actual costs may differ based on your specific use case. High-volume applications should negotiate enterprise pricing directly with providers rather than relying on our pay-as-you-go figures.

This methodology and the data it produces are our best effort at providing useful, honest information to the voice AI developer community. It is not perfect, and we are continuously working to improve it.

For our complete benchmark data, see the Voice AI Benchmark Report 2026.

Stop guessing. Start benchmarking.

Independent, data-driven comparisons to help you pick the right voice AI stack.

Get Started