Speko is a voice AI benchmarking and optimization platform. It connects to 18+ voice AI providers and automatically tests 240+ STT, LLM, and TTS combinations against your specific language, use case, and cost constraints — returning ranked results in minutes.

Which voice AI providers does Speko support?

Speko supports 18+ providers including Deepgram, AssemblyAI, ElevenLabs, Cartesia, PlayHT, OpenAI, Gemini, Groq, Cerebras, Vapi, Retell, Bland AI, Hume AI, and more. New providers are added regularly.

How does Speko benchmark voice AI providers?

Speko runs STT, LLM, and TTS providers in combination against your specific inputs, measuring latency, accuracy, cost, and quality. Every benchmark number is cited with source URLs and verification dates. See our methodology at speko.ai/blog/methodology.

Which STT provider is most accurate for English?

Based on our March 2026 benchmarks, Deepgram Nova-3 and AssemblyAI Universal-3 Pro lead for English accuracy. Deepgram Nova-3 achieves 4.1% WER on clean audio; AssemblyAI Universal-3 Pro averages 5.9% WER across 26 diverse datasets. The best choice depends on your audio conditions and latency requirements.

What is the cheapest voice AI stack in 2026?

The lowest-cost production-ready stack is approximately $0.0095/minute, combining Deepgram Nova-3 ($0.0043/min) + Gemini 2.0 Flash ($0.0007/min) + Cartesia Sonic ($0.0045/min). See our full cost breakdown at speko.ai/blog/voice-ai-cost-2026.

How is Speko different from Vapi or Retell?

Vapi and Retell are voice agent platforms that lock you into their provider choices. Speko is provider-agnostic infrastructure that benchmarks all providers against your requirements and helps you choose and switch freely. Speko integrates with any platform including Vapi, Retell, and custom stacks.

How We Benchmark Voice AI: Our Methodology

How Speko Benchmarks Voice AI Providers

Speko benchmarks voice AI providers by curating, cross-referencing, and normalizing data from official pricing pages, API documentation, provider-published benchmarks, independent third-party evaluations, and peer-reviewed research. According to Speko’s analysis, provider-reported metrics are often incomparable: one quotes latency as time-to-first-byte, another as end-to-end round trip, and a third just says “low latency.” Speko normalizes all metrics to common units (cost per minute, WER percentage, TTFB in milliseconds) with every data point linked to its source and dated. The benchmark dataset is fully audited quarterly, with the last audit completed March 4, 2026.

We built these benchmarks because engineering teams spend two to four weeks evaluating providers, cobbling together data from pricing pages, blog posts, and Discord channels, only to realize the metrics they collected are not even comparable. We wanted a single, honest source of truth. Not marketing-optimized highlights. Not cherry-picked demos. Just verified data with every source linked and dated.

This page explains exactly how we do it, including what we measure, what we do not, and where our data comes from. We believe transparency about our process is the only way to earn your trust.

Data Sources

Let us be upfront about something: we do not run our own benchmarks. Not yet. What we do is curate, verify, and normalize publicly available data from the following sources:

•Official pricing pages: the primary source for all cost data. We screenshot these on the verification date as a record.
•API documentation: for technical specifications like supported languages, billing models, and published latency targets.
•Provider-published benchmarks: when a provider publishes their own accuracy or latency figures, we include them with a clear note that these are self-reported.
•Independent third-party benchmarks: academic papers, community-run evaluations (like AssemblyAI’s cross-provider STT comparison), and open-source benchmark suites.
•Peer-reviewed research: for foundational metrics like WER methodology and VAD evaluation protocols.

Every data point in our benchmark reports links to its source with a “last verified” date. When we say Cartesia’s TTFB is 40–90ms, that comes from their official documentation, verified on March 4, 2026. When we say Deepgram Nova-3’s WER is 6.84% streaming, that figure comes from Deepgram’s published data, cross-referenced against AssemblyAI’s independent benchmark which reported 8.1% on batch processing.

We actively cross-reference claims against independent data wherever available. When provider-published numbers conflict with third-party results, we report both and note the discrepancy. You deserve to see the full picture, not a curated one.

What We Measure

Our benchmark reports currently track four core metrics across providers. Each metric was chosen because it directly impacts the end-user experience of a voice AI application.

TTFB (Time to First Byte)

Time to First Byte measures the gap between when the first audio packet is sent to a provider and when the first response byte comes back. For TTS providers, this is the single most important latency metric because it determines how long the user waits before hearing a response.

In conversational AI, TTFB under 200ms feels instantaneous. Between 200ms and 500ms, users start to notice the delay. Above 500ms, the conversation feels broken. This is why we set our target at under 200ms for conversational use cases and flag providers that consistently exceed it.

We report TTFB as a range (e.g., 40–90ms for Cartesia Sonic-3) because latency varies based on region, input length, and server load. These ranges come from provider documentation and, where available, community-reported figures.

TTFT (Time to First Token)

Time to First Token is particularly important for LLM inference in voice pipelines. It measures the total time from when a request is sent to when the first token of the response is returned, including connection setup, model loading, and initial processing.

In a cascaded voice pipeline (STT → LLM → TTS), TTFT adds directly to the user-perceived latency. A sub-200ms TTFT is excellent and approaches human reaction speed. Under 500ms is good for most applications. Anything above that starts to feel sluggish in real-time conversation.

For context, Groq reports sub-100ms TTFT on their LPU hardware, while Cerebras targets 80–150ms with speculative decoding. These are the fastest production options available as of March 2026.

WER (Word Error Rate)

Word Error Rate is the percentage of words that are incorrectly transcribed by an STT provider. It is calculated as (substitutions + insertions + deletions) / total reference words. Lower is better. Under 10% is generally considered production-ready.

We should be honest about WER’s limitations. It is an imperfect metric: it treats all errors equally (missing “the” counts the same as missing a proper noun), it does not account for punctuation or formatting, and it can be gamed by testing on easy datasets. Despite these flaws, WER remains the industry standard for STT accuracy because there is no widely adopted alternative that is better.

We also note the difference between streaming WER and batch WER. Streaming WER is typically higher because the model has less context to work with. For example, Deepgram Nova-3 reports 6.84% WER in streaming mode versus 8.1% in batch processing (per AssemblyAI’s independent benchmark). This difference matters for real-time voice applications.

Cost per Minute

This is where things get surprisingly complicated. Voice AI providers bill in at least four different units: per second (Deepgram, AssemblyAI), per 15-second block (Google Cloud), per minute (Azure, OpenAI Whisper), and per character (most TTS providers). Some charge per audio token (OpenAI Realtime API). Some have flat monthly plans with usage caps (Bland AI, PlayHT).

We normalize everything to cost per minute of processed audio at pay-as-you-go rates. This gives you a fair, apples-to-apples comparison. For character-based TTS pricing, we estimate based on average speech rate (approximately 150 words or 750 characters per minute of output audio).

We deliberately exclude enterprise discounts, committed-use agreements, and volume tiers from our headline numbers. These vary wildly between customers and would make the comparison meaningless. Where notable discounts exist, we include them as a cost note (e.g., “$0.0065 on Growth plan ($4K/yr min)” for Deepgram).

What About MOS (Mean Opinion Score)?

Mean Opinion Score is the gold standard for evaluating voice quality in TTS. It is a subjective rating (1–5) collected from human listeners evaluating naturalness, clarity, and expressiveness. We do not currently include MOS in our benchmarks for two reasons.

First, running proper MOS tests requires a statistically significant panel of human evaluators listening to standardized test sets across all providers. This is expensive and time-consuming to do right, and doing it poorly would be worse than not doing it at all. Second, most providers do not publish MOS scores, and the ones that do use different evaluation sets, making cross-provider comparison unreliable.

MOS testing is on our roadmap for later in 2026. When we add it, we will publish our full evaluation protocol alongside the results.

How We Compare Providers

Fair comparison requires consistent normalization. Here is our approach:

•Cost normalization: all costs converted to per-minute rates regardless of native billing unit. Per-character TTS pricing is converted using standard speech rate assumptions. Per-15-second-block billing (Google Cloud) is noted because it increases effective cost for short utterances.
•Baseline pricing only: we use pay-as-you-go rates without enterprise agreements, volume discounts, or committed-use contracts. This is the price you will actually pay when starting out.
•Billing model transparency: we explicitly note each provider’s billing model (per-second, per-15s-block, per-minute, per-character) because the billing unit itself impacts your real cost. Per-second billing can save 30–40% compared to per-minute rounding for typical voice agent conversations where utterances are short.
•Full-stack cost scenarios: beyond individual provider costs, we calculate realistic full-stack costs (STT + LLM + TTS) for budget, mid-tier, premium, and ultra-premium configurations. Our budget stack (AssemblyAI + Gemini 2.0 Flash + Cartesia) comes to $0.0095/min. A premium stack (Deepgram + GPT-4.1 + ElevenLabs) runs $0.0377/min.
•Conditions acknowledgment: we recognize that providers’ own benchmarks may use different conditions than third-party evaluations. Network conditions, audio quality, language, accent, and background noise all affect real-world performance. Our reported figures reflect provider-optimal conditions unless otherwise noted.

What We Don’t Do (Yet)

Honesty about our limitations is as important as the data itself. Here is what our benchmarks currently do not cover:

•Live latency testing: we do not run our own real-time latency tests against provider APIs. All latency figures come from published data. We plan to add automated, multi-region latency testing in V2 of our benchmark infrastructure.
•P95/P99 latency percentiles: no major voice AI provider currently publishes tail latency data. Median latency tells you how the system usually performs. P95/P99 tells you how it performs under stress. Until providers start publishing these numbers (or we run our own tests), this gap exists across the industry, not just in our reports.
•Voice quality A/B testing: as discussed above, MOS testing is planned but not yet implemented. For now, voice quality assessment is subjective and we do not include it in our quantitative comparisons.
•Load testing: published latency and accuracy figures may degrade under high concurrency. We have no way to verify performance at scale without running our own infrastructure, which is part of our long-term roadmap.
•Non-English language accuracy: most published WER figures are for English. Multilingual accuracy varies significantly by provider and language pair, and reliable cross-provider data is scarce.

We plan to address each of these gaps over time. When we do, we will publish our testing methodology alongside the results, just as we are doing here.

Our Commitment

We hold ourselves to a specific set of commitments regarding how we maintain and publish this data:

•Quarterly full audit: every three months, we re-verify every data point in our benchmark reports against current provider documentation. Our last full audit was March 4, 2026. The next is scheduled for June 2026.
•48-hour correction window: if you report an error in our data, we commit to verifying and publishing a correction within 48 hours. We do not wait for the next quarterly audit to fix mistakes.
•No affiliations: we are not affiliated with, sponsored by, or financially connected to any provider we benchmark. Speko is an independent research and tooling company. Our revenue comes from helping developers integrate voice AI, not from referral fees or partnerships with providers.
•Source links on every data point: every number in our reports links to the original source so you can verify it yourself. If a source goes offline or changes, we update accordingly and note the change.

Found an error or have updated data? Email us at bek@speko.ai. We take corrections seriously and appreciate the help.

Limitations & Disclaimers

We want to be explicit about what our benchmarks are and what they are not.

All data reflects publicly available information as of the verification date. Actual performance may vary based on workload, region, and configuration. We are not affiliated with any provider listed.

Provider-published numbers are, by nature, marketing-optimized. When a provider says their STT has “5.9% WER,” that is likely measured on a favorable dataset under ideal conditions. Your real-world WER will almost certainly be higher, especially with accented speech, background noise, domain-specific terminology, or non-English languages.

Similarly, published latency figures represent best-case scenarios. Your actual latency depends on your geographic region, the provider’s current load, your network conditions, and the complexity of your request. A provider that claims 100ms TTFB might deliver 300ms from your data center during peak hours.

Cost estimates in our full-stack scenarios assume typical voice agent conversation patterns (average utterance length, turn-taking frequency). Your actual costs may differ based on your specific use case. High-volume applications should negotiate enterprise pricing directly with providers rather than relying on our pay-as-you-go figures.

This methodology and the data it produces are our best effort at providing useful, honest information to the voice AI developer community. It is not perfect, and we are continuously working to improve it.

For our complete benchmark data, see the Voice AI Benchmark Report 2026.

Back to blog