What latency budget do voice agents need?

End-to-end (user stops talking → agent starts speaking) under 800ms feels natural; under 1.2s is acceptable; over 2s feels broken. The realtime APIs put you in the 600–900ms range out of the box.

Should I use a realtime API or stitch ASR + LLM + TTS myself?

For 2026 production, use a realtime API (OpenAI Realtime, Gemini Live, ElevenLabs Conversational) unless you have specific reasons. Stitched pipelines are slower and harder to handle interruptions cleanly.

How do you handle the user interrupting the agent?

Voice activity detection (VAD) detects user speech mid-response; the realtime API exposes interruption events. On interruption, cancel the current TTS, clear the queued response, return to listening. The realtime APIs handle most of this; rolling your own requires careful state machines.

Voice Agents and Realtime LLM APIs in 2026 — How They Actually Work

Voice agents in 2026 finally feel natural. Sub-second responses, clean interruption handling, multilingual, in-context memory. The Realtime APIs from OpenAI, Google, and ElevenLabs collapsed an architecture that took 8 months in 2023 into something you can ship in a week. This post is the working guide.

What changed

Old pipeline (2023):

Mic → ASR → text → LLM → text → TTS → Speaker
        +500ms     +2000ms   +500ms = 3 seconds

New pipeline (2026, Realtime APIs):

Mic → Realtime API → Speaker
       end-to-end ~700ms

The realtime APIs do speech-to-speech directly. The model “thinks in voice” rather than transcribing → generating text → synthesizing. Result: lower latency, more natural prosody, faster interruption handling.

The contenders

	Strengths	Latency	Best for
OpenAI Realtime	Mature, function calling, broad voice options	600–900ms	General voice agents
Gemini Live	Long context, video understanding	700–1000ms	Multimodal agents
ElevenLabs Conversational	Best TTS quality	800–1200ms	Production-grade voice
Anthropic Realtime (rolling out 2026)	Tool use, long reasoning	800–1100ms	Reasoning-heavy agents
Custom (ASR + LLM + TTS)	Full control	1500–3000ms	Niche needs only

For most production work in 2026, OpenAI Realtime or ElevenLabs are the right starting points.

A working OpenAI Realtime client

import asyncio
import websockets
import json
import os

URL = "wss://api.openai.com/v1/realtime?model=gpt-realtime-2024-12"

async def voice_agent():
    async with websockets.connect(
        URL,
        extra_headers={"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",
                       "OpenAI-Beta": "realtime=v1"},
    ) as ws:
        # Configure session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["audio", "text"],
                "instructions": "You are a helpful voice assistant. Be concise.",
                "voice": "alloy",
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "turn_detection": {"type": "server_vad", "threshold": 0.5},
            }
        }))

        async for raw in ws:
            event = json.loads(raw)
            t = event.get("type")
            if t == "response.audio.delta":
                audio_chunk = event["delta"]              # base64 PCM16
                speaker.write(audio_chunk)
            elif t == "response.audio_transcript.delta":
                print(event["delta"], end="", flush=True)  # for debugging
            elif t == "input_audio_buffer.speech_started":
                # User started speaking — stop output
                speaker.flush_and_clear()

The model handles ASR + thinking + TTS in one stream. Your client sends mic audio, plays output audio, and reacts to events.

Voice Activity Detection (VAD)

The realtime APIs ship server-side VAD. The server detects when the user starts speaking and when they stop. On stop, it triggers the response.

turn_detection.type = "server_vad" is the default. Two parameters that matter:

threshold — how sensitive (0.0–1.0). Higher = more conservative; needs louder speech to trigger.
silence_duration_ms — how long after user stops to consider it a turn end. Lower = more responsive but more false-cuts; higher = more polite but feels slow.

For phone-quality audio, threshold ~0.5 + silence ~500ms works. For studio audio, threshold ~0.3 + silence ~700ms.

Interruption handling

The killer feature. When the user starts speaking mid-response:

Server emits input_audio_buffer.speech_started.
Your client must immediately stop playing audio output.
Server cancels the in-progress response automatically.
After the user finishes, a new response is generated.

Without correct handling, the agent talks over the user. With it, the conversation flows naturally.

elif t == "input_audio_buffer.speech_started":
    speaker.flush_and_clear()           # stop playback
    # Optionally: send a cancel for the in-flight response
    await ws.send(json.dumps({"type": "response.cancel"}))

Function calling in voice

"tools": [{
    "type": "function",
    "name": "lookup_order",
    "description": "Look up an order by ID",
    "parameters": {
        "type": "object",
        "properties": {"order_id": {"type": "string"}},
        "required": ["order_id"]
    }
}]

The voice agent can speak “let me check that for you” while calling lookup_order in parallel. When the function returns, the result feeds back into the conversation. Same shape as text Anthropic Claude API + Tool Use Guide .

Latency budget

End-to-end (user finished speaking → agent first audio):

Component	Budget
Mic → server	50–100ms
Server VAD detect end	300–500ms
LLM generate first token	100–200ms
TTS generate first audio	50–150ms
Server → speaker	50–100ms
Total	~600–1000ms

The biggest variable is VAD silence threshold. Lowering it cuts latency at the cost of cutting users off. Tune on your domain.

When to roll your own

Rare. But cases where stitched ASR + LLM + TTS still wins:

You need a specific model not on a realtime API.
You’re running self-hosted (no realtime model is open-source-good-enough yet in 2026).
You’re building for embedded / offline.
Compliance requires data not leaving your VPC.

If you go this route, the rough latency is 1.5–3 seconds end-to-end. Components: Whisper (or WhisperX) for ASR; an LLM via vLLM (see Self-Hosted LLMs ); Coqui or Piper for TTS.

Production architecture

[Mobile / Web Client]
       │ WebRTC / WebSocket
       ▼
[Voice Gateway]
   • Auth
   • Rate limit
   • Recording (legal / training)
   • Routing to provider
       │
       ▼
[Realtime API provider]
       │
       ▼
[Tool servers (MCP / HTTP)]
   • Lookup customer
   • Place order
   • Update calendar
       │
       ▼
[Postgres + Redis state]

Auth and rate-limit at the gateway. Tools as separate services (or MCP servers ). All the boring web infra still applies.

Recording, transcripts, compliance

Voice + AI = subpoena bait. Plan for:

Recording with consent (“calls may be recorded”).
Transcripts for review.
PII redaction before storing or training.
Retention policies.
Right to delete (GDPR).

The realtime APIs surface transcripts as events; persist them with the audio.

Cost in 2026

Rough numbers (will move):

Provider	$ / minute
OpenAI Realtime	~$0.06 input + $0.24 output
Gemini Live	~$0.10 / minute
ElevenLabs Conversational	~$0.30 / minute (premium voices)
Self-hosted (Whisper + Llama + Piper)	~$0.02 / minute compute

For a customer support voice agent doing 1000 calls × 3 min = 3000 minutes/month: $180–900 depending on provider. Compare to a human agent: $4000+. The economics work even at premium pricing.

What surprises new builders

Latency is felt at <100ms. A 1100ms response feels noticeably worse than 800ms. Optimize aggressively.
Background noise wrecks VAD. Phone audio is noisy; studio audio isn’t. Test in real environments.
Users interrupt all the time. Without good interruption handling, the agent feels rude.
Long responses get monotonous. Cap the model’s verbosity (“be concise”) in the system prompt.
Function calls add latency. A tool call mid-conversation can add 1–2s. Prompt the model to “speak while looking that up.”
Multilingual is harder than text. Code-switching (English + Hindi mid-sentence) is common in some markets and breaks naive ASR.

Common mistakes

1. No interruption handling

The number-one beginner bug. Handle speech_started events; stop playback immediately.

2. Over-long system prompts

System prompts cost the same as text APIs. A 5k-token prompt + voice = ~$0.30 per call before the user says anything. Trim.

3. Synchronous tool calls

A 3-second tool call mid-conversation is a 3-second silence. Tell the model to acknowledge before calling. Stream a “let me look that up” first.

4. No fallback

The realtime API is down for 5 minutes. Now what? Have a fallback (text mode, “we’re having issues, please call back”).

5. Trusting the transcript

Realtime ASR makes mistakes, especially on names and numbers. For high-stakes (orders, addresses), confirm by reading back: “I have your address as 123 Main Street, is that right?”

What I’d build today

For a small voice agent project (say, customer support for a SaaS):

OpenAI Realtime API as the engine.
A FastAPI gateway for auth, recording, MCP tool dispatch.
MCP server for the SaaS’s domain tools (lookup customer, update ticket).
Postgres for transcripts and state.
Twilio Media Streams for phone call ingress (or web-based via WebRTC).

End-to-end in 1–2 weeks of focused work. Pair with Build an MCP Server for Your SaaS .

Read this next

If you want a working FastAPI + OpenAI Realtime + Twilio voice agent starter, it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

What changed#

The contenders#

A working OpenAI Realtime client#

Voice Activity Detection (VAD)#

Interruption handling#

Function calling in voice#

Latency budget#

When to roll your own#

Production architecture#

Recording, transcripts, compliance#

Cost in 2026#

What surprises new builders#

Common mistakes#

1. No interruption handling#

2. Over-long system prompts#

3. Synchronous tool calls#

4. No fallback#

5. Trusting the transcript#

What I’d build today#

Read this next#