Skip to content
ANSWERS

How to Build a Voice Agent in 2026

Step-by-step architecture guide with provider recommendations, latency targets, and real cost estimates.

Last updated: April 2026

To build a production voice agent in 2026, you need three components: an STT engine (Deepgram Nova-3 recommended for speed), an LLM (Gemini 2.0 Flash for cost, GPT-4o for quality), and a TTS engine (Cartesia Sonic for latency). According to Speko's benchmarks, this stack achieves sub-800ms round-trip latency at $0.0095/minute. Speko helps you benchmark all provider combinations to find the optimal stack for your specific use case.

You have two architecture choices: the cascaded STT+LLM+TTS pipeline (more control, lower cost, better debuggability) or speech-to-speech models like OpenAI Realtime API (lower latency, less flexibility). This guide covers both approaches with honest tradeoffs.

Architecture Comparison

Two approaches to building voice agents. Most production systems use the cascaded pipeline for flexibility.

Dimension
Cascaded (STT+LLM+TTS)
Speech-to-Speech
Latency
600-800ms
300-500ms
Cost/minute
$0.0095-0.025
$0.05-0.10
Provider flexibility
Mix and match any provider
Single provider lock-in
Debuggability
Full visibility per stage
Black box
Voice customization
Choose any TTS voice
Limited to provider voices
Production readiness
Mature, well-understood
Emerging, improving fast
Best for
Production systems
Demos and prototypes

Step-by-Step: Building a Cascaded Voice Agent

Step 1: Choose Your STT Provider

The STT provider converts user speech to text in real-time. For voice agents, streaming latency matters more than batch accuracy.

Recommended: Deepgram Nova-3 (sub-300ms, $0.0043/min) for most use cases. AssemblyAI Universal-3 if you need speaker diarization or sentiment analysis built in.

Step 2: Select Your LLM

The LLM generates the agent's response. Time-to-first-token (TTFT) is critical — you need the response to start streaming before the TTS can begin speaking.

Recommended: Gemini 2.0 Flash ($0.0007/min, ~150ms TTFT) for cost efficiency. GPT-4o for complex reasoning tasks. Groq-hosted Llama 3.3 for the fastest inference speed.

Step 3: Pick Your TTS Provider

The TTS provider converts the LLM's text response into natural-sounding speech. Time-to-first-byte (TTFB) determines how quickly the agent starts speaking.

Recommended: Cartesia Sonic (sub-150ms TTFB, $0.0045/min) for voice agents. ElevenLabs Turbo v3 if voice quality is the top priority. Deepgram Aura for the lowest cost.

Step 4: Add Orchestration

The orchestration layer handles WebSocket connections, turn-taking, interruption handling, and audio streaming between components.

Options: Build custom with WebSockets (full control), use LiveKit for real-time infrastructure, or use a platform like Vapi/Retell for managed orchestration ($0.05-0.15/min markup).

Step 5: Benchmark and Optimize

The optimal provider combination depends on your specific language, audio conditions, and latency requirements. What works for English customer support may not work for Japanese healthcare.

Use Speko: Benchmark 240+ STT+LLM+TTS combinations against your actual inputs. Get ranked results by latency, accuracy, and cost in minutes instead of weeks of manual testing.

Recommended Stacks

Based on Speko benchmark data, March 2026. Optimized for different priorities.

Optimized For
STT
LLM
TTS
Latency
Cost/min
Lowest cost
Deepgram Nova-3
Gemini 2.0 Flash
Cartesia Sonic
~700ms
$0.0095
Best quality
ElevenLabs Scribe
GPT-4o
ElevenLabs Turbo v3
~1200ms
$0.038
Lowest latency
Deepgram Nova-3
Groq Llama 3.3
Cartesia Sonic
~500ms
$0.012
Balanced
Deepgram Nova-3
GPT-4o
Cartesia Sonic
~800ms
$0.018

Why Use Speko for Voice Agent Development?

Building a voice agent is choosing the right combination from 240+ possible stacks. Speko finds the best one for your use case.

240+ Stack Combinations

Test every STT+LLM+TTS combination against your audio, language, and domain. Get ranked results in minutes.

End-to-End Latency Testing

Measure real round-trip latency for each stack: STT time + LLM TTFT + TTS TTFB. Find the stack that hits your latency target.

Cost-per-Call Analysis

See exact per-minute costs for every combination. Optimize for your budget without sacrificing quality.

Frequently Asked Questions

What is the tech stack for a voice agent?
A voice agent requires three core components: Speech-to-Text (STT) to transcribe user speech, a Large Language Model (LLM) to generate responses, and Text-to-Speech (TTS) to speak the response. The standard production stack in 2026 is Deepgram Nova-3 (STT) + GPT-4o or Gemini 2.0 Flash (LLM) + Cartesia Sonic (TTS), achieving sub-800ms total round-trip latency at approximately $0.0095/minute.
How much does it cost to build a voice agent?
The lowest-cost production voice agent stack runs approximately $0.0095/minute: Deepgram Nova-3 ($0.0043/min) + Gemini 2.0 Flash ($0.0007/min) + Cartesia Sonic ($0.0045/min). At 10,000 minutes/month, that is $95/month in API costs. Platform solutions like Vapi or Retell charge $0.05-0.15/minute, which includes infrastructure but costs 5-15x more.
Should I use speech-to-speech or cascaded STT+LLM+TTS?
For most production use cases in 2026, the cascaded STT+LLM+TTS pipeline is recommended. It gives you provider flexibility, better debuggability, and lower cost. Speech-to-speech models (OpenAI Realtime API, Gemini Live) offer lower latency but lock you into a single provider and limit customization. Use speech-to-speech for demos and prototypes; use cascaded for production.
What latency should a voice agent achieve?
A natural-feeling voice agent should achieve sub-1-second total round-trip latency (from end of user speech to start of agent speech). The target breakdown: STT < 300ms, LLM first token < 200ms, TTS first byte < 200ms. The best production stacks in 2026 achieve 600-800ms total latency.
Do I need a voice agent platform like Vapi or Retell?
Platforms like Vapi and Retell simplify development by handling WebSocket orchestration, turn-taking, and telephony integration. They are ideal for getting to production quickly. If you need full control over provider selection, cost optimization, or custom orchestration, building on individual APIs with a benchmarking layer like Speko gives you more flexibility at lower cost.
How do I prevent my voice agent from hallucinating?
Voice agent hallucination is primarily an LLM problem. Key mitigation strategies: (1) Use retrieval-augmented generation (RAG) or knowledge graphs to ground responses in verified data, (2) Set strict system prompts with explicit boundaries, (3) Implement confidence scoring and fallback to human agents for uncertain queries, (4) Use Speko's Knowledge Graph approach which reduces hallucinations by constraining LLM outputs to verified provider and domain data.

Methodology

Stack recommendations are based on Speko's automated benchmarking pipeline, which tests provider combinations against standardized voice agent scenarios (customer support, appointment booking, FAQ handling). Latency, accuracy, and cost data verified March 2026.

Benchmark Your Voice Agent Stack

240+ provider combinations. Ranked by latency, accuracy, and cost. Find the optimal stack for your use case in minutes.

Ready to try Speko?

Stop guessing which voice AI stack is best. Benchmark every combination and ship with confidence.

Get Started