Voice agents in 2026 finally feel natural. Sub-second responses, clean interruption handling, multilingual, in-context memory. The Realtime APIs from OpenAI, Google, and ElevenLabs collapsed an architecture that took 8 months in 2023 into something you can ship in a week. This post is the working guide.
What changed
Old pipeline (2023):
Mic → ASR → text → LLM → text → TTS → Speaker
+500ms +2000ms +500ms = 3 seconds
New pipeline (2026, Realtime APIs):
Mic → Realtime API → Speaker
end-to-end ~700ms
The realtime APIs do speech-to-speech directly. The model “thinks in voice” rather than transcribing → generating text → synthesizing. Result: lower latency, more natural prosody, faster interruption handling.
The contenders
| Strengths | Latency | Best for | |
|---|---|---|---|
| OpenAI Realtime | Mature, function calling, broad voice options | 600–900ms | General voice agents |
| Gemini Live | Long context, video understanding | 700–1000ms | Multimodal agents |
| ElevenLabs Conversational | Best TTS quality | 800–1200ms | Production-grade voice |
| Anthropic Realtime (rolling out 2026) | Tool use, long reasoning | 800–1100ms | Reasoning-heavy agents |
| Custom (ASR + LLM + TTS) | Full control | 1500–3000ms | Niche needs only |
For most production work in 2026, OpenAI Realtime or ElevenLabs are the right starting points.
A working OpenAI Realtime client
import asyncio
import websockets
import json
import os
URL = "wss://api.openai.com/v1/realtime?model=gpt-realtime-2024-12"
async def voice_agent():
async with websockets.connect(
URL,
extra_headers={"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",
"OpenAI-Beta": "realtime=v1"},
) as ws:
# Configure session
await ws.send(json.dumps({
"type": "session.update",
"session": {
"modalities": ["audio", "text"],
"instructions": "You are a helpful voice assistant. Be concise.",
"voice": "alloy",
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"turn_detection": {"type": "server_vad", "threshold": 0.5},
}
}))
async for raw in ws:
event = json.loads(raw)
t = event.get("type")
if t == "response.audio.delta":
audio_chunk = event["delta"] # base64 PCM16
speaker.write(audio_chunk)
elif t == "response.audio_transcript.delta":
print(event["delta"], end="", flush=True) # for debugging
elif t == "input_audio_buffer.speech_started":
# User started speaking — stop output
speaker.flush_and_clear()
The model handles ASR + thinking + TTS in one stream. Your client sends mic audio, plays output audio, and reacts to events.
Voice Activity Detection (VAD)
The realtime APIs ship server-side VAD. The server detects when the user starts speaking and when they stop. On stop, it triggers the response.
turn_detection.type = "server_vad" is the default. Two parameters that matter:
threshold— how sensitive (0.0–1.0). Higher = more conservative; needs louder speech to trigger.silence_duration_ms— how long after user stops to consider it a turn end. Lower = more responsive but more false-cuts; higher = more polite but feels slow.
For phone-quality audio, threshold ~0.5 + silence ~500ms works. For studio audio, threshold ~0.3 + silence ~700ms.
Interruption handling
The killer feature. When the user starts speaking mid-response:
- Server emits
input_audio_buffer.speech_started. - Your client must immediately stop playing audio output.
- Server cancels the in-progress response automatically.
- After the user finishes, a new response is generated.
Without correct handling, the agent talks over the user. With it, the conversation flows naturally.
elif t == "input_audio_buffer.speech_started":
speaker.flush_and_clear() # stop playback
# Optionally: send a cancel for the in-flight response
await ws.send(json.dumps({"type": "response.cancel"}))
Function calling in voice
"tools": [{
"type": "function",
"name": "lookup_order",
"description": "Look up an order by ID",
"parameters": {
"type": "object",
"properties": {"order_id": {"type": "string"}},
"required": ["order_id"]
}
}]
The voice agent can speak “let me check that for you” while calling lookup_order in parallel. When the function returns, the result feeds back into the conversation. Same shape as text Anthropic Claude API + Tool Use Guide
.
Latency budget
End-to-end (user finished speaking → agent first audio):
| Component | Budget |
|---|---|
| Mic → server | 50–100ms |
| Server VAD detect end | 300–500ms |
| LLM generate first token | 100–200ms |
| TTS generate first audio | 50–150ms |
| Server → speaker | 50–100ms |
| Total | ~600–1000ms |
The biggest variable is VAD silence threshold. Lowering it cuts latency at the cost of cutting users off. Tune on your domain.
When to roll your own
Rare. But cases where stitched ASR + LLM + TTS still wins:
- You need a specific model not on a realtime API.
- You’re running self-hosted (no realtime model is open-source-good-enough yet in 2026).
- You’re building for embedded / offline.
- Compliance requires data not leaving your VPC.
If you go this route, the rough latency is 1.5–3 seconds end-to-end. Components: Whisper (or WhisperX) for ASR; an LLM via vLLM (see Self-Hosted LLMs ); Coqui or Piper for TTS.
Production architecture
[Mobile / Web Client]
│ WebRTC / WebSocket
▼
[Voice Gateway]
• Auth
• Rate limit
• Recording (legal / training)
• Routing to provider
│
▼
[Realtime API provider]
│
▼
[Tool servers (MCP / HTTP)]
• Lookup customer
• Place order
• Update calendar
│
▼
[Postgres + Redis state]
Auth and rate-limit at the gateway. Tools as separate services (or MCP servers ). All the boring web infra still applies.
Recording, transcripts, compliance
Voice + AI = subpoena bait. Plan for:
- Recording with consent (“calls may be recorded”).
- Transcripts for review.
- PII redaction before storing or training.
- Retention policies.
- Right to delete (GDPR).
The realtime APIs surface transcripts as events; persist them with the audio.
Cost in 2026
Rough numbers (will move):
| Provider | $ / minute |
|---|---|
| OpenAI Realtime | ~$0.06 input + $0.24 output |
| Gemini Live | ~$0.10 / minute |
| ElevenLabs Conversational | ~$0.30 / minute (premium voices) |
| Self-hosted (Whisper + Llama + Piper) | ~$0.02 / minute compute |
For a customer support voice agent doing 1000 calls × 3 min = 3000 minutes/month: $180–900 depending on provider. Compare to a human agent: $4000+. The economics work even at premium pricing.
What surprises new builders
- Latency is felt at <100ms. A 1100ms response feels noticeably worse than 800ms. Optimize aggressively.
- Background noise wrecks VAD. Phone audio is noisy; studio audio isn’t. Test in real environments.
- Users interrupt all the time. Without good interruption handling, the agent feels rude.
- Long responses get monotonous. Cap the model’s verbosity (“be concise”) in the system prompt.
- Function calls add latency. A tool call mid-conversation can add 1–2s. Prompt the model to “speak while looking that up.”
- Multilingual is harder than text. Code-switching (English + Hindi mid-sentence) is common in some markets and breaks naive ASR.
Common mistakes
1. No interruption handling
The number-one beginner bug. Handle speech_started events; stop playback immediately.
2. Over-long system prompts
System prompts cost the same as text APIs. A 5k-token prompt + voice = ~$0.30 per call before the user says anything. Trim.
3. Synchronous tool calls
A 3-second tool call mid-conversation is a 3-second silence. Tell the model to acknowledge before calling. Stream a “let me look that up” first.
4. No fallback
The realtime API is down for 5 minutes. Now what? Have a fallback (text mode, “we’re having issues, please call back”).
5. Trusting the transcript
Realtime ASR makes mistakes, especially on names and numbers. For high-stakes (orders, addresses), confirm by reading back: “I have your address as 123 Main Street, is that right?”
What I’d build today
For a small voice agent project (say, customer support for a SaaS):
- OpenAI Realtime API as the engine.
- A FastAPI gateway for auth, recording, MCP tool dispatch.
- MCP server for the SaaS’s domain tools (lookup customer, update ticket).
- Postgres for transcripts and state.
- Twilio Media Streams for phone call ingress (or web-based via WebRTC).
End-to-end in 1–2 weeks of focused work. Pair with Build an MCP Server for Your SaaS .
Read this next
- AI Agents with LangGraph in 2026
- Anthropic Claude API + Tool Use Guide
- Build an MCP Server for Your SaaS
- SSE vs WebSockets in 2026
If you want a working FastAPI + OpenAI Realtime + Twilio voice agent starter, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .