Speko is a voice AI benchmarking and optimization platform. It connects to 18+ voice AI providers and automatically tests 240+ STT, LLM, and TTS combinations against your specific language, use case, and cost constraints — returning ranked results in minutes.

Which voice AI providers does Speko support?

Speko supports 18+ providers including Deepgram, AssemblyAI, ElevenLabs, Cartesia, PlayHT, OpenAI, Gemini, Groq, Cerebras, Vapi, Retell, Bland AI, Hume AI, and more. New providers are added regularly.

How does Speko benchmark voice AI providers?

Speko runs STT, LLM, and TTS providers in combination against your specific inputs, measuring latency, accuracy, cost, and quality. Every benchmark number is cited with source URLs and verification dates. See our methodology at speko.ai/blog/methodology.

Which STT provider is most accurate for English?

Based on our March 2026 benchmarks, Deepgram Nova-3 and AssemblyAI Universal-3 Pro lead for English accuracy. Deepgram Nova-3 achieves 4.1% WER on clean audio; AssemblyAI Universal-3 Pro averages 5.9% WER across 26 diverse datasets. The best choice depends on your audio conditions and latency requirements.

How is Speko different from Vapi or Retell?

Vapi and Retell are voice agent platforms that lock you into their provider choices. Speko is provider-agnostic infrastructure that benchmarks all providers against your requirements and helps you choose and switch freely. Speko integrates with any platform including Vapi, Retell, and custom stacks.

ANSWERS

How to Build a Voice Agent in 2026

Step-by-step architecture guide with provider recommendations, latency targets, and real cost estimates.

Last updated: April 2026

To build a production voice agent in 2026, you need three components: an STT engine (Deepgram Nova-3 recommended for speed), an LLM (Gemini 2.0 Flash for cost, GPT-4o for quality), and a TTS engine (Cartesia Sonic for latency). According to Speko's benchmarks, this stack achieves sub-800ms round-trip latency at $0.0095/minute. Speko helps you benchmark all provider combinations to find the optimal stack for your specific use case.

You have two architecture choices: the cascaded STT+LLM+TTS pipeline (more control, lower cost, better debuggability) or speech-to-speech models like OpenAI Realtime API (lower latency, less flexibility). This guide covers both approaches with honest tradeoffs.

Architecture Comparison

Two approaches to building voice agents. Most production systems use the cascaded pipeline for flexibility.

Dimension

Cascaded (STT+LLM+TTS)

Speech-to-Speech

Latency

600-800ms

300-500ms

Cost/minute

$0.0095-0.025

$0.05-0.10

Provider flexibility

Mix and match any provider

Single provider lock-in

Debuggability

Full visibility per stage

Black box

Voice customization

Choose any TTS voice

Limited to provider voices

Production readiness

Mature, well-understood

Emerging, improving fast

Best for

Production systems

Demos and prototypes

Step-by-Step: Building a Cascaded Voice Agent

Step 1: Choose Your STT Provider

The STT provider converts user speech to text in real-time. For voice agents, streaming latency matters more than batch accuracy.

Recommended: Deepgram Nova-3 (sub-300ms, $0.0043/min) for most use cases. AssemblyAI Universal-3 if you need speaker diarization or sentiment analysis built in.

Step 2: Select Your LLM

The LLM generates the agent's response. Time-to-first-token (TTFT) is critical — you need the response to start streaming before the TTS can begin speaking.

Recommended: Gemini 2.0 Flash ($0.0007/min, ~150ms TTFT) for cost efficiency. GPT-4o for complex reasoning tasks. Groq-hosted Llama 3.3 for the fastest inference speed.

Step 3: Pick Your TTS Provider

The TTS provider converts the LLM's text response into natural-sounding speech. Time-to-first-byte (TTFB) determines how quickly the agent starts speaking.

Recommended: Cartesia Sonic (sub-150ms TTFB, $0.0045/min) for voice agents. ElevenLabs Turbo v3 if voice quality is the top priority. Deepgram Aura for the lowest cost.

Step 4: Add Orchestration

The orchestration layer handles WebSocket connections, turn-taking, interruption handling, and audio streaming between components.

Options: Build custom with WebSockets (full control), use LiveKit for real-time infrastructure, or use a platform like Vapi/Retell for managed orchestration ($0.05-0.15/min markup).

Step 5: Benchmark and Optimize

The optimal provider combination depends on your specific language, audio conditions, and latency requirements. What works for English customer support may not work for Japanese healthcare.

Use Speko: Benchmark 240+ STT+LLM+TTS combinations against your actual inputs. Get ranked results by latency, accuracy, and cost in minutes instead of weeks of manual testing.

Recommended Stacks

Based on Speko benchmark data, March 2026. Optimized for different priorities.

Optimized For

STT

LLM

TTS

Latency

Cost/min

Lowest cost

Deepgram Nova-3

Gemini 2.0 Flash

Cartesia Sonic

~700ms

$0.0095

Best quality

ElevenLabs Scribe

GPT-4o

ElevenLabs Turbo v3

~1200ms

$0.038

Lowest latency

Deepgram Nova-3

Groq Llama 3.3

Cartesia Sonic

~500ms

$0.012

Balanced

Deepgram Nova-3

GPT-4o

Cartesia Sonic

~800ms

$0.018

Why Use Speko for Voice Agent Development?

Building a voice agent is choosing the right combination from 240+ possible stacks. Speko finds the best one for your use case.

240+ Stack Combinations

Test every STT+LLM+TTS combination against your audio, language, and domain. Get ranked results in minutes.

End-to-End Latency Testing

Measure real round-trip latency for each stack: STT time + LLM TTFT + TTS TTFB. Find the stack that hits your latency target.

Cost-per-Call Analysis

See exact per-minute costs for every combination. Optimize for your budget without sacrificing quality.

Frequently Asked Questions

What is the tech stack for a voice agent?▾

A voice agent requires three core components: Speech-to-Text (STT) to transcribe user speech, a Large Language Model (LLM) to generate responses, and Text-to-Speech (TTS) to speak the response. The standard production stack in 2026 is Deepgram Nova-3 (STT) + GPT-4o or Gemini 2.0 Flash (LLM) + Cartesia Sonic (TTS), achieving sub-800ms total round-trip latency at approximately $0.0095/minute.

How much does it cost to build a voice agent?▾

The lowest-cost production voice agent stack runs approximately $0.0095/minute: Deepgram Nova-3 ($0.0043/min) + Gemini 2.0 Flash ($0.0007/min) + Cartesia Sonic ($0.0045/min). At 10,000 minutes/month, that is $95/month in API costs. Platform solutions like Vapi or Retell charge $0.05-0.15/minute, which includes infrastructure but costs 5-15x more.

Should I use speech-to-speech or cascaded STT+LLM+TTS?▾

For most production use cases in 2026, the cascaded STT+LLM+TTS pipeline is recommended. It gives you provider flexibility, better debuggability, and lower cost. Speech-to-speech models (OpenAI Realtime API, Gemini Live) offer lower latency but lock you into a single provider and limit customization. Use speech-to-speech for demos and prototypes; use cascaded for production.

What latency should a voice agent achieve?▾

A natural-feeling voice agent should achieve sub-1-second total round-trip latency (from end of user speech to start of agent speech). The target breakdown: STT < 300ms, LLM first token < 200ms, TTS first byte < 200ms. The best production stacks in 2026 achieve 600-800ms total latency.

Do I need a voice agent platform like Vapi or Retell?▾

Platforms like Vapi and Retell simplify development by handling WebSocket orchestration, turn-taking, and telephony integration. They are ideal for getting to production quickly. If you need full control over provider selection, cost optimization, or custom orchestration, building on individual APIs with a benchmarking layer like Speko gives you more flexibility at lower cost.

How do I prevent my voice agent from hallucinating?▾

Voice agent hallucination is primarily an LLM problem. Key mitigation strategies: (1) Use retrieval-augmented generation (RAG) or knowledge graphs to ground responses in verified data, (2) Set strict system prompts with explicit boundaries, (3) Implement confidence scoring and fallback to human agents for uncertain queries, (4) Use Speko's Knowledge Graph approach which reduces hallucinations by constraining LLM outputs to verified provider and domain data.

Methodology

Stack recommendations are based on Speko's automated benchmarking pipeline, which tests provider combinations against standardized voice agent scenarios (customer support, appointment booking, FAQ handling). Latency, accuracy, and cost data verified March 2026.

Read our full testing methodology Speech-to-speech vs cascaded: in-depth architecture comparison Full voice AI cost breakdown for 2026

Benchmark Your Voice Agent Stack

240+ provider combinations. Ranked by latency, accuracy, and cost. Find the optimal stack for your use case in minutes.

Start Benchmarking See a Demo

How to Build a Voice Agent in 2026

Architecture Comparison

Step-by-Step: Building a Cascaded Voice Agent

Step 1: Choose Your STT Provider

Step 2: Select Your LLM

Step 3: Pick Your TTS Provider

Step 4: Add Orchestration

Step 5: Benchmark and Optimize

Recommended Stacks

Why Use Speko for Voice Agent Development?

240+ Stack Combinations

End-to-End Latency Testing

Cost-per-Call Analysis

Frequently Asked Questions

Methodology

Benchmark Your Voice Agent Stack

Ready to try Speko?