Best Filipino Text-to-Speech API (2026): independent benchmark
Filipino (Tagalog) breaks the usual TTS scoring playbook: it natively code-switches with English (Taglish), so English-sounding phones are expected content, not an accent tell. We measured 10 systems on Speko's Filipino eval set with the checks that stay objective: an intelligibility gate, round-trip CER, pacing, and signal hygiene.
8 of 10 systems produce intelligible Filipino. Polly Generative and Deepgram Aura 2 fail the gate: their "Filipino" output comes back detected as English. Among the systems that pass, xAI / Grok TTS posted the lowest round-trip CER at 1.5%.
Filipino TTS measurements
Sorted by round-trip CER, gate failures last. Objective checks only (intelligibility, pacing, hygiene): no acoustic metric validly ranks Filipino naturalness, so none is shown.
| System | Type | Gate (detected) | Round-trip CER | Pacing (w/s) | True peak (dBTP) |
|---|---|---|---|---|---|
| xAI / Grok TTS | tts | pass (tl) | 1.5% | 2.51 | -4.25 |
| Cartesia Sonic 3.5 | tts | pass (tl) | 2.2% | 2.91 | -0.86 (hot) |
| ElevenLabs v3 | tts | pass (tl) | 2.4% | 2.27 | -0.37 (hot) |
| Inworld TTS 2 | tts | pass (tl) | 3.0% | 2.78 | -4.5 |
| GPT Realtime | realtime | pass (tl) | 3.5% | 2.38 | -3.51 |
| GPT Realtime v2 | realtime | pass (tl) | 3.7% | 2.21 | -5.27 |
| GPT-4o mini TTS | tts | pass (tl) | 5.6% | 2.06 | -11.78 |
| MiniMax Speech 2.6 HD | tts | pass (tl) | 5.6% | 2.02 | -1.47 |
| Polly Generative | tts | fail (detected English) | 15.2% | 2.08 | -5.81 |
| Deepgram Aura 2 | tts | fail (detected English) | 57.4% | 1.09 (outside band) | -4.71 |
How we measured
- Eval set: Speko's Filipino TTS eval set v1 (fixed Filipino prompts synthesized by every system).
- Gate: language detection plus round-trip transcription. Output detected as English or above 50% CER fails (the model did not produce intelligible Filipino).
- Round-trip CER: synthesized audio is transcribed back and compared to the prompt. Lower is better; this is an intelligibility check, not a naturalness ranking.
- Pacing: speech rate in words/sec against a comfortable 2.0-3.2 band. A flag, not a rank.
- Signal hygiene: clipping, true peak (danger line -1.0 dBTP), and DC-offset checks on the delivered audio.
- Data is synced from the published run at benchmarks.speko.ai (snapshot 2026-06-05).
Full interactive panels, audio clips, and the complete methodology: benchmarks.speko.ai
Use the best Filipino voice without lock-in
Speko is one API in front of every system on this page: it routes each request to the measured-best provider for your language and fails over automatically when a provider degrades. When the next run reshuffles this table, your integration does not change.
curl -X POST https://api.speko.dev/v1/synthesize \
-H "Authorization: Bearer $SPEKO_API_KEY" \
-H "Content-Type: application/json" \
-d '{"text": "Kumusta! Salamat sa pagtawag.", "intent": {"language": "fil"}}' \
--output reply.audio import { Speko } from '@spekoai/sdk';
const speko = new Speko({ apiKey: process.env.SPEKO_API_KEY! });
const { audio, provider, model } = await speko.synthesize('Kumusta! Salamat sa pagtawag.', {
language: 'fil',
}); FAQ
What is the best Filipino text-to-speech API?
No acoustic metric validly ranks Filipino naturalness (Taglish code-switching inverts the usual accent signals), so we publish objective checks instead. 8 of 10 systems pass the intelligibility gate, and xAI / Grok TTS has the lowest round-trip CER at 1.5%. Accent and naturalness judgments are left to native raters.
Does AWS Polly support Filipino?
Polly Generative failed our Filipino intelligibility gate: its output was detected as English with a 15.2% round-trip CER, so it is excluded rather than ranked.
Does ElevenLabs support Filipino text-to-speech?
Yes. ElevenLabs v3 passes the gate with a 2.4% round-trip CER. One flag: its master peaks at -0.37 dBTP, above our -1 dBTP danger line for downstream clipping.
Why is there no Filipino naturalness ranking?
The rhythm metric that works for Thai inverts on Filipino (correlation -0.53), and English-phone intrusion correlates the wrong way (+0.44) because Taglish makes English phones legitimate content. No deterministic feature separates the "conyo" accent native speakers penalize from legitimate loanwords, so naturalness stays human-rated.