Speko is a voice AI benchmarking and optimization platform. It connects to 18+ voice AI providers and automatically tests 240+ STT, LLM, and TTS combinations against your specific language, use case, and cost constraints — returning ranked results in minutes.

Which voice AI providers does Speko support?

Speko supports 18+ providers including Deepgram, AssemblyAI, ElevenLabs, Cartesia, PlayHT, OpenAI, Gemini, Groq, Cerebras, Vapi, Retell, Bland AI, Hume AI, and more. New providers are added regularly.

How does Speko benchmark voice AI providers?

Speko runs STT, LLM, and TTS providers in combination against your specific inputs, measuring latency, accuracy, cost, and quality. Every benchmark number is cited with source URLs and verification dates. See our methodology at speko.ai/blog/methodology.

Which STT provider is most accurate for English?

Based on our March 2026 benchmarks, Deepgram Nova-3 and AssemblyAI Universal-3 Pro lead for English accuracy. Deepgram Nova-3 achieves 4.1% WER on clean audio; AssemblyAI Universal-3 Pro averages 5.9% WER across 26 diverse datasets. The best choice depends on your audio conditions and latency requirements.

What is the cheapest voice AI stack in 2026?

The lowest-cost production-ready stack is approximately $0.0095/minute, combining Deepgram Nova-3 ($0.0043/min) + Gemini 2.0 Flash ($0.0007/min) + Cartesia Sonic ($0.0045/min). See our full cost breakdown at speko.ai/blog/voice-ai-cost-2026.

How is Speko different from Vapi or Retell?

Vapi and Retell are voice agent platforms that lock you into their provider choices. Speko is provider-agnostic infrastructure that benchmarks all providers against your requirements and helps you choose and switch freely. Speko integrates with any platform including Vapi, Retell, and custom stacks.

Voice AI Is Not As Global As You Think

If you are building a voice agent and you want to know what latency your users will actually experience, there is a question no provider answers in their marketing: where are you measuring from? Every latency number published on a vendor’s website, every “150 ms TTFP” claim in a launch announcement, quietly assumes you and your users live near their primary data center. Usually that data center is in the United States.

We wanted to know how much that assumption costs. So we shipped the same 35 audio clips through four streaming STT providers from three Google Cloud regions — us-east4 (Virginia), europe-west3 (Frankfurt), and asia-southeast1 (Singapore) — in parallel, and measured time-to-first-partial (TTFP) from each vantage.

The result: three of four providers are fastest from the US, slower from Europe, slowest from Asia. The fourth is geography-invariant. And one of the US-primaries is faster from Europe than from its own home region — a counterintuitive finding that says more about inference fleet load than about network routing.

What the providers actually say

Before we ship our numbers, we should establish what each vendor publicly claims about their infrastructure. We pulled this directly from their own docs and announcements.

Deepgram

Until recently, Deepgram was US-hosted by default. Their own 2026 announcement is literally titled “Deepgram Goes Global” — implying, correctly, that they were not before. The post describes a newly-launched EU-hosted endpoint in early access and a roadmap covering Asia-Pacific, EMEA, and LATAM. For most customers today, Deepgram is a US-region service.

ElevenLabs Scribe v2

Multi-region data residency exists for US, EU, and India — but it is gated behind an Enterprise contract. Every developer on the self-serve API hits the default path, which is US-hosted. The choice to offer multi-region at all is real. The choice to hide it behind a sales gate means the typical Scribe v2 integration routes to US inference regardless of where the end user lives.

OpenAI GPT-4o Transcribe

Microsoft’s own developer Q&A confirms gpt-4o-transcribe is “hosted in regions like East US 2” with no region-locked data residency support. For compliance- regulated EU customers, OpenAI has been explicit: this model does not yet offer EU inference residency. Global standard deployment only.

Alibaba qwen3-asr-flash

Alibaba publishes endpoints in Singapore, US Virginia, Beijing, Hong Kong, and Frankfurt. Their International Deployment Mode explicitly states data storage and access points are in Singapore, with inference compute dynamically scheduled globally. This is not marketing spin — it is a published architectural choice. The default developer path routes through a nearby POP regardless of where you stand on Earth.

Three of four providers say they are global. Only one architected for it.

What the numbers actually show

Here is the 12-cell matrix from our multi-region probe. Same audio, same probe code, same timing convention. Only the GCP region changed.

Provider	Vantage	TTFP p50	Δ vs US-East
Deepgram Nova-3	us-east4	2059 ms	baseline
Deepgram Nova-3	europe-west3	2170 ms	+111 ms (+5.4%)
Deepgram Nova-3	asia-southeast1	2201 ms	+142 ms (+6.9%)
ElevenLabs Scribe v2	us-east4	2120 ms	baseline
ElevenLabs Scribe v2	europe-west3	2215 ms	+95 ms (+4.5%)
ElevenLabs Scribe v2	asia-southeast1	2319 ms	+199 ms (+9.4%)
OpenAI GPT-4o Transcribe	us-east4	835 ms	baseline
OpenAI GPT-4o Transcribe	europe-west3	775 ms	−60 ms (−7.2%)
OpenAI GPT-4o Transcribe	asia-southeast1	916 ms	+81 ms (+9.7%)
Alibaba qwen3-asr-flash	us-east4	6889 ms	baseline
Alibaba qwen3-asr-flash	europe-west3	6906 ms	+17 ms (+0.2%)
Alibaba qwen3-asr-flash	asia-southeast1	6953 ms	+64 ms (+0.9%)

The three surprises

Deepgram and Scribe v2 validate their own marketing

Both providers are fastest from us-east4 and degrade with distance. Deepgram’s Asia penalty is +142 ms. Scribe v2’s is +199 ms. For a live voice-agent deployment, that pushes the first-partial past the conversational-latency threshold where users start feeling lag. The numbers agree with the docs: these are US-first systems.

OpenAI is faster from Europe than from its own home region

On paper, gpt-4o-transcribe runs in East US 2. Our measurement from Frankfurt is 60 ms faster than from Virginia. This is strange until you remember that streaming-mode TTFP measures the gap from “first byte sent” to “first transcript delta received” — which includes TCP setup, TLS handshake, and chunked-upload overhead before any inference happens. A better- provisioned edge PoP or a less-loaded regional ingress can produce exactly this shape. What it means for builders: an EU voice agent on OpenAI gets better latency than a US one, even though OpenAI’s inference is nominally US-hosted. That contradicts the marketing, but the data says yes.

Alibaba is the only provider whose latency does not depend on your location

From all three continents, Alibaba qwen3-asr-flash lands within a 64-millisecond band on p50 TTFP. Frankfurt to Singapore, coast to coast — it does not matter. This is what global architecture actually looks like at the network layer: either distributed POPs, aggressive anycast, or compute-scheduling that routes to the nearest GPU fleet. Their docs say all three.

The catch: Alibaba’s TTFP is 6.9 seconds. Three times slower than Deepgram, eight times slower than OpenAI. Whatever they are doing in their commit pipeline dominates network time so completely that geography becomes invisible. That is a different kind of problem — but it means Alibaba is the only provider in this benchmark where an APAC voice agent does not pay a geography tax.

“Inference compute resources are dynamically scheduled globally.”Alibaba Cloud DashScope documentation

What to do with this information

If you are building a voice agent today, here is the short version:

US deployments. Use OpenAI GPT-4o Transcribe for partial-driven UX (live captions, typing indicators). Use Deepgram Nova-3 or ElevenLabs Scribe v2 for finalization-driven UX (barge-in, turn-taking). All three are fast in us-east. Pick on TTF behavior, not TTFP.

EU deployments.OpenAI GPT-4o is the sleeper winner — 775 ms p50 TTFP from Frankfurt, faster than Deepgram, faster than Scribe v2, and faster than OpenAI’s own US vantage. ElevenLabs offers EU data residency if you need it for compliance, but you pay a ~100 ms tax vs OpenAI.

APAC deployments. You have a real problem with the US-primaries. Deepgram adds 142 ms, Scribe v2 adds 199 ms, OpenAI adds 81 ms. Alibaba qwen3-asr-flash is the only provider in our benchmark with genuine Asia-Pacific parity. The tradeoff is its 7-second TTFP makes it unsuitable for tight conversational loops, but if the use case is post-call transcription or batch-like streaming, it is the honest choice.

Everywhere else.No provider we tested has inference in Latin America, Africa, South Asia, or the Middle East. If your users are there, you are paying an untested geography tax on every turn. We cannot help you yet — and neither can the provider’s marketing page.

The bigger picture

Voice AI is a thirty-billion-dollar industry marketing itself as global. Almost none of it is. The infrastructure is centered on three US cities and a handful of European capitals. For a buyer in Jakarta, São Paulo, Istanbul, or Nairobi, the published latency numbers are aspirational. The real latency is the one we just measured — and even our numbers understate it, because we probed from inside Google Cloud regions with decent peering. The end user on mobile LTE in Medan or Curitiba has it worse.

The providers know this. Their expansion roadmaps are public. Deepgram is building out EU now; APAC is “coming.” ElevenLabs has multi-region, but behind enterprise sales. OpenAI has nothing announced for gpt-4o-transcribe residency. Only Alibaba shipped global-first, and the tradeoff — a three-fold TTFP penalty — is a different kind of cost.

At Speko, we keep measuring because the gap between what the marketing says and what the network does is the single largest source of unhappy voice-agent buyers. The region dropdown on speko.ai/benchmark/stt is live as of today. Flip between US-East, EU-West, and APAC to see what your users will actually experience — not what the provider’s homepage claims.

Methodology

Probes: four streaming STT providers (Deepgram Nova-3, ElevenLabs Scribe v2 Realtime, OpenAI GPT-4o Transcribe, Alibaba qwen3-asr-flash-realtime) were tested via their WebSocket or HTTP streaming endpoints from Cloud Run jobs deployed in three GCP regions: us-east4, europe-west3, and asia-southeast1.

Corpus: 35 clips per provider per region, drawn from our trial-long corpus (60–120 second clips, 7 languages: en, zh, ja, ko, id, th, vi). All clips streamed at real-time pacing (20 ms PCM16 chunks).

Timing: TTFP = time from first audio byte sent to first non-empty partial transcript received. Measured in the client, not on the server.

Caveat we won’t hide: this run executed five probes in parallel per Cloud Run container to cut wall clock. Parallel WebSockets sharing one NIC may inflate absolute values versus a serial run — our Deepgram p50 is ~2 seconds here versus ~1.1 seconds in a sequential smoke earlier the same day. The relative deltas between regions are trustworthy (same contention in every region). The absolute values are a ceiling, not the floor. A serial re-run is on the roadmap.

Voice AI Is Not As Global As You Think.