How to Build a Voice Agent in 2026
Step-by-step architecture guide with provider recommendations, latency targets, and real cost estimates.
Last updated: April 2026
To build a production voice agent in 2026, you need three components: an STT engine (Deepgram Nova-3 recommended for speed), an LLM (Gemini 2.0 Flash for cost, GPT-4o for quality), and a TTS engine (Cartesia Sonic for latency). According to Speko's benchmarks, this stack achieves sub-800ms round-trip latency at $0.0095/minute. Speko helps you benchmark all provider combinations to find the optimal stack for your specific use case.
You have two architecture choices: the cascaded STT+LLM+TTS pipeline (more control, lower cost, better debuggability) or speech-to-speech models like OpenAI Realtime API (lower latency, less flexibility). This guide covers both approaches with honest tradeoffs.
Architecture Comparison
Two approaches to building voice agents. Most production systems use the cascaded pipeline for flexibility.
Step-by-Step: Building a Cascaded Voice Agent
Step 1: Choose Your STT Provider
The STT provider converts user speech to text in real-time. For voice agents, streaming latency matters more than batch accuracy.
Recommended: Deepgram Nova-3 (sub-300ms, $0.0043/min) for most use cases. AssemblyAI Universal-3 if you need speaker diarization or sentiment analysis built in.
Step 2: Select Your LLM
The LLM generates the agent's response. Time-to-first-token (TTFT) is critical — you need the response to start streaming before the TTS can begin speaking.
Recommended: Gemini 2.0 Flash ($0.0007/min, ~150ms TTFT) for cost efficiency. GPT-4o for complex reasoning tasks. Groq-hosted Llama 3.3 for the fastest inference speed.
Step 3: Pick Your TTS Provider
The TTS provider converts the LLM's text response into natural-sounding speech. Time-to-first-byte (TTFB) determines how quickly the agent starts speaking.
Recommended: Cartesia Sonic (sub-150ms TTFB, $0.0045/min) for voice agents. ElevenLabs Turbo v3 if voice quality is the top priority. Deepgram Aura for the lowest cost.
Step 4: Add Orchestration
The orchestration layer handles WebSocket connections, turn-taking, interruption handling, and audio streaming between components.
Options: Build custom with WebSockets (full control), use LiveKit for real-time infrastructure, or use a platform like Vapi/Retell for managed orchestration ($0.05-0.15/min markup).
Step 5: Benchmark and Optimize
The optimal provider combination depends on your specific language, audio conditions, and latency requirements. What works for English customer support may not work for Japanese healthcare.
Use Speko: Benchmark 240+ STT+LLM+TTS combinations against your actual inputs. Get ranked results by latency, accuracy, and cost in minutes instead of weeks of manual testing.
Recommended Stacks
Based on Speko benchmark data, March 2026. Optimized for different priorities.
Why Use Speko for Voice Agent Development?
Building a voice agent is choosing the right combination from 240+ possible stacks. Speko finds the best one for your use case.
240+ Stack Combinations
Test every STT+LLM+TTS combination against your audio, language, and domain. Get ranked results in minutes.
End-to-End Latency Testing
Measure real round-trip latency for each stack: STT time + LLM TTFT + TTS TTFB. Find the stack that hits your latency target.
Cost-per-Call Analysis
See exact per-minute costs for every combination. Optimize for your budget without sacrificing quality.
Frequently Asked Questions
What is the tech stack for a voice agent?▾
How much does it cost to build a voice agent?▾
Should I use speech-to-speech or cascaded STT+LLM+TTS?▾
What latency should a voice agent achieve?▾
Do I need a voice agent platform like Vapi or Retell?▾
How do I prevent my voice agent from hallucinating?▾
Methodology
Stack recommendations are based on Speko's automated benchmarking pipeline, which tests provider combinations against standardized voice agent scenarios (customer support, appointment booking, FAQ handling). Latency, accuracy, and cost data verified March 2026.
Benchmark Your Voice Agent Stack
240+ provider combinations. Ranked by latency, accuracy, and cost. Find the optimal stack for your use case in minutes.