Best English Speech-to-Text API (2026): independent benchmark

English speech-to-text accuracy, measured. 50 English read-speech clips from FLEURS, every provider routed through the same gateway and scored identically on 2026-06-03. Lower WER is better.

OpenAI GPT-4o Transcribe posted the lowest WER at 2.4%, with Alibaba Qwen3-ASR (2.6%) close behind. With n=50 the confidence intervals are about plus or minus 2 points, so the top four are a statistical tie. The spread below the top is real: the board runs from 2.4% to 13.2%.

English STT leaderboard

50 English clips (FLEURS), measured 2026-06-03 through the Speko gateway, loudness-normalized to -16 LUFS. WER measured on English; latency and list price from the same gateway setup on the English board. Lower is better.

Provider / model	WER	p50 latency	List price
OpenAI GPT-4o Transcribe	2.4%	1,084 ms	$0.006/min
Alibaba Qwen3-ASR	2.6%	2,195 ms	-
ElevenLabs Scribe v2	2.9%	1,353 ms	$0.0067/min
xAI Grok STT *	4.8%	996 ms	-
Cartesia Ink-2	6.1%	966 ms	$0.0022/min
Gradium	13.2%	2,528 ms	-

* Returns an empty transcript on low-level audio - it dropped ~half of un-normalized quiet clips. 4.8% is on loudness-normalized input.

How we measured

Dataset: 50 English read-speech clips from FLEURS.
Scoring: mean word error rate (WER), lower is better. Audio is loudness-normalized to -16 LUFS before scoring so input-gain handling does not contaminate the accuracy column.
Every provider is measured the same way: through the Speko gateway (POST /v1/transcribe, provider pinned), from a single location.
Latency is p50 end-to-end and includes the gateway hop for every provider. Cost is the provider list price per minute of audio; a dash means no public list price.
Run date: 2026-06-03. n=50, so confidence intervals are about plus or minus 2 points: the top four are a statistical tie.

Full interactive table, every territory, and the complete methodology: benchmarks.speko.ai

Use the winner without lock-in

The best English provider today is one benchmark run away from being second best. Speko is one API in front of every provider on this table: it routes each request to the measured-best provider for your language and fails over automatically when a provider degrades. No per-vendor integration, no migration when the leaderboard flips.

curl

curl -X POST https://api.speko.dev/v1/transcribe \
  -H "Authorization: Bearer $SPEKO_API_KEY" \
  -H "Content-Type: audio/wav" \
  -H "x-speko-intent: {\"language\":\"en\"}" \
  --data-binary @call.wav

TypeScript

import { Speko } from '@spekoai/sdk';
import { readFile } from 'node:fs/promises';

const speko = new Speko({ apiKey: process.env.SPEKO_API_KEY! });
const audio = await readFile('./call.wav');

const { text, provider, confidence } = await speko.transcribe(audio, {
  language: 'en',
});

start free read the docs

FAQ

What is the most accurate English speech-to-text API?

On Speko's 2026-06-03 FLEURS benchmark (50 English clips), OpenAI GPT-4o Transcribe posted the lowest WER at 2.4%, followed by Alibaba Qwen3-ASR at 2.6%. With n=50 the confidence intervals are about plus or minus 2 points, so the top four are a statistical tie.

Does ElevenLabs support English speech-to-text?

Yes. ElevenLabs Scribe v2 scored 2.9% WER on our English run.

Which English STT API is fastest?

Cartesia Ink-2 had the lowest p50 latency on our run at 966 ms. Every latency number includes the gateway hop, so the comparison is like-for-like across providers.

Which English STT API is cheapest?

By list price, Cartesia Ink-2 at about $0.0022/min. OpenAI GPT-4o Transcribe lists at $0.006/min and ElevenLabs Scribe v2 lists at $0.0067/min. Providers without a public per-minute list price show a dash.

Why does the xAI row carry a note?

Returns an empty transcript on low-level audio - it dropped ~half of un-normalized quiet clips. 4.8% is on loudness-normalized input.

More language benchmarks

Best Thai STT API Best Indonesian STT API Best Vietnamese STT API TTS benchmarks by language Full interactive STT benchmark