Skip to content
Back to blog

Voice AI Infrastructure · Empirical

Voice AI Is Not As Global As You Think.

We sent the same audio through four leading streaming STT providers from three continents. Three of the four market themselves as global. Only one actually is.

Published: April 2026|Author: Speko Research Team
Dark globe at night with three glowing orange nodes at North Virginia, Frankfurt, and Singapore connected by arcs
Streaming TTFP (p50) measured from three GCP Cloud Run regions · April 24, 2026 · n≈35 per cell

If you are building a voice agent and you want to know what latency your users will actually experience, there is a question no provider answers in their marketing: where are you measuring from? Every latency number published on a vendor’s website, every “150 ms TTFP” claim in a launch announcement, quietly assumes you and your users live near their primary data center. Usually that data center is in the United States.

We wanted to know how much that assumption costs. So we shipped the same 35 audio clips through four streaming STT providers from three Google Cloud regions — us-east4 (Virginia), europe-west3 (Frankfurt), and asia-southeast1 (Singapore) — in parallel, and measured time-to-first-partial (TTFP) from each vantage.

The result: three of four providers are fastest from the US, slower from Europe, slowest from Asia. The fourth is geography-invariant. And one of the US-primaries is faster from Europe than from its own home region — a counterintuitive finding that says more about inference fleet load than about network routing.

What the providers actually say

Before we ship our numbers, we should establish what each vendor publicly claims about their infrastructure. We pulled this directly from their own docs and announcements.

Deepgram

Until recently, Deepgram was US-hosted by default. Their own 2026 announcement is literally titled “Deepgram Goes Global” — implying, correctly, that they were not before. The post describes a newly-launched EU-hosted endpoint in early access and a roadmap covering Asia-Pacific, EMEA, and LATAM. For most customers today, Deepgram is a US-region service.

ElevenLabs Scribe v2

Multi-region data residency exists for US, EU, and India — but it is gated behind an Enterprise contract. Every developer on the self-serve API hits the default path, which is US-hosted. The choice to offer multi-region at all is real. The choice to hide it behind a sales gate means the typical Scribe v2 integration routes to US inference regardless of where the end user lives.

OpenAI GPT-4o Transcribe

Microsoft’s own developer Q&A confirms gpt-4o-transcribe is “hosted in regions like East US 2” with no region-locked data residency support. For compliance- regulated EU customers, OpenAI has been explicit: this model does not yet offer EU inference residency. Global standard deployment only.

Alibaba qwen3-asr-flash

Alibaba publishes endpoints in Singapore, US Virginia, Beijing, Hong Kong, and Frankfurt. Their International Deployment Mode explicitly states data storage and access points are in Singapore, with inference compute dynamically scheduled globally. This is not marketing spin — it is a published architectural choice. The default developer path routes through a nearby POP regardless of where you stand on Earth.

Three of four providers say they are global. Only one architected for it.

What the numbers actually show

Here is the 12-cell matrix from our multi-region probe. Same audio, same probe code, same timing convention. Only the GCP region changed.

ProviderVantageTTFP p50Δ vs US-East
Deepgram Nova-3us-east42059 msbaseline
Deepgram Nova-3europe-west32170 ms+111 ms (+5.4%)
Deepgram Nova-3asia-southeast12201 ms+142 ms (+6.9%)
ElevenLabs Scribe v2us-east42120 msbaseline
ElevenLabs Scribe v2europe-west32215 ms+95 ms (+4.5%)
ElevenLabs Scribe v2asia-southeast12319 ms+199 ms (+9.4%)
OpenAI GPT-4o Transcribeus-east4835 msbaseline
OpenAI GPT-4o Transcribeeurope-west3775 ms−60 ms (−7.2%)
OpenAI GPT-4o Transcribeasia-southeast1916 ms+81 ms (+9.7%)
Alibaba qwen3-asr-flashus-east46889 msbaseline
Alibaba qwen3-asr-flasheurope-west36906 ms+17 ms (+0.2%)
Alibaba qwen3-asr-flashasia-southeast16953 ms+64 ms (+0.9%)

The three surprises

Deepgram and Scribe v2 validate their own marketing

Both providers are fastest from us-east4 and degrade with distance. Deepgram’s Asia penalty is +142 ms. Scribe v2’s is +199 ms. For a live voice-agent deployment, that pushes the first-partial past the conversational-latency threshold where users start feeling lag. The numbers agree with the docs: these are US-first systems.

OpenAI is faster from Europe than from its own home region

On paper, gpt-4o-transcribe runs in East US 2. Our measurement from Frankfurt is 60 ms faster than from Virginia. This is strange until you remember that streaming-mode TTFP measures the gap from “first byte sent” to “first transcript delta received” — which includes TCP setup, TLS handshake, and chunked-upload overhead before any inference happens. A better- provisioned edge PoP or a less-loaded regional ingress can produce exactly this shape. What it means for builders: an EU voice agent on OpenAI gets better latency than a US one, even though OpenAI’s inference is nominally US-hosted. That contradicts the marketing, but the data says yes.

Alibaba is the only provider whose latency does not depend on your location

From all three continents, Alibaba qwen3-asr-flash lands within a 64-millisecond band on p50 TTFP. Frankfurt to Singapore, coast to coast — it does not matter. This is what global architecture actually looks like at the network layer: either distributed POPs, aggressive anycast, or compute-scheduling that routes to the nearest GPU fleet. Their docs say all three.

The catch: Alibaba’s TTFP is 6.9 seconds. Three times slower than Deepgram, eight times slower than OpenAI. Whatever they are doing in their commit pipeline dominates network time so completely that geography becomes invisible. That is a different kind of problem — but it means Alibaba is the only provider in this benchmark where an APAC voice agent does not pay a geography tax.

“Inference compute resources are dynamically scheduled globally.”Alibaba Cloud DashScope documentation

What to do with this information

If you are building a voice agent today, here is the short version:

US deployments. Use OpenAI GPT-4o Transcribe for partial-driven UX (live captions, typing indicators). Use Deepgram Nova-3 or ElevenLabs Scribe v2 for finalization-driven UX (barge-in, turn-taking). All three are fast in us-east. Pick on TTF behavior, not TTFP.

EU deployments.OpenAI GPT-4o is the sleeper winner — 775 ms p50 TTFP from Frankfurt, faster than Deepgram, faster than Scribe v2, and faster than OpenAI’s own US vantage. ElevenLabs offers EU data residency if you need it for compliance, but you pay a ~100 ms tax vs OpenAI.

APAC deployments. You have a real problem with the US-primaries. Deepgram adds 142 ms, Scribe v2 adds 199 ms, OpenAI adds 81 ms. Alibaba qwen3-asr-flash is the only provider in our benchmark with genuine Asia-Pacific parity. The tradeoff is its 7-second TTFP makes it unsuitable for tight conversational loops, but if the use case is post-call transcription or batch-like streaming, it is the honest choice.

Everywhere else.No provider we tested has inference in Latin America, Africa, South Asia, or the Middle East. If your users are there, you are paying an untested geography tax on every turn. We cannot help you yet — and neither can the provider’s marketing page.

The bigger picture

Voice AI is a thirty-billion-dollar industry marketing itself as global. Almost none of it is. The infrastructure is centered on three US cities and a handful of European capitals. For a buyer in Jakarta, São Paulo, Istanbul, or Nairobi, the published latency numbers are aspirational. The real latency is the one we just measured — and even our numbers understate it, because we probed from inside Google Cloud regions with decent peering. The end user on mobile LTE in Medan or Curitiba has it worse.

The providers know this. Their expansion roadmaps are public. Deepgram is building out EU now; APAC is “coming.” ElevenLabs has multi-region, but behind enterprise sales. OpenAI has nothing announced for gpt-4o-transcribe residency. Only Alibaba shipped global-first, and the tradeoff — a three-fold TTFP penalty — is a different kind of cost.

At Speko, we keep measuring because the gap between what the marketing says and what the network does is the single largest source of unhappy voice-agent buyers. The region dropdown on speko.ai/benchmark/stt is live as of today. Flip between US-East, EU-West, and APAC to see what your users will actually experience — not what the provider’s homepage claims.

Methodology

Probes: four streaming STT providers (Deepgram Nova-3, ElevenLabs Scribe v2 Realtime, OpenAI GPT-4o Transcribe, Alibaba qwen3-asr-flash-realtime) were tested via their WebSocket or HTTP streaming endpoints from Cloud Run jobs deployed in three GCP regions: us-east4, europe-west3, and asia-southeast1.

Corpus: 35 clips per provider per region, drawn from our trial-long corpus (60–120 second clips, 7 languages: en, zh, ja, ko, id, th, vi). All clips streamed at real-time pacing (20 ms PCM16 chunks).

Timing: TTFP = time from first audio byte sent to first non-empty partial transcript received. Measured in the client, not on the server.

Caveat we won’t hide: this run executed five probes in parallel per Cloud Run container to cut wall clock. Parallel WebSockets sharing one NIC may inflate absolute values versus a serial run — our Deepgram p50 is ~2 seconds here versus ~1.1 seconds in a sequential smoke earlier the same day. The relative deltas between regions are trustworthy (same contention in every region). The absolute values are a ceiling, not the floor. A serial re-run is on the roadmap.

Stop guessing. Start benchmarking.

Independent, data-driven comparisons to help you pick the right voice AI stack.

Get Started