What latency target for voice?

Sub-1s end-to-end (user finishes speaking → first audio out) feels conversational. <500ms is excellent. Above 2s feels stilted. Each component (STT, LLM, TTS) should target <300ms.

Single model (e.g., GPT-4o real-time) or pipeline?

Single model is simplest, lower latency, but less customizable. Pipeline (STT + LLM + TTS) is more controllable, supports tools / RAG, and decouples vendor choices. Most production agents use pipelines.

Voice Agents in 2026 — STT, LLM, TTS, and Latency That Doesn't Hurt

Q: Single model (e.g., GPT-4o real-time) or pipeline?

Single model is simplest, lower latency, but less customizable. Pipeline (STT + LLM + TTS) is more controllable, supports tools / RAG, and decouples vendor choices. Most production agents use pipelines.

Voice AI matured in 2025-2026 from impressive demos to production agents handling support calls, scheduling, voice ordering. The architecture is well-known; the latency and reliability discipline is where teams trip. This post is the working playbook.

The pipeline

[User audio] → [Streaming STT] → [LLM] → [Streaming TTS] → [User]
                     │              │            │
                  partial         tools         streaming
                  transcripts     /RAG          chunks

Each stage streams; words flow through. Total latency = max of stage delays + a bit of overhead.

Latency budget

Stage	Target
STT (first word)	<200ms
STT (final)	<300ms after user stops
LLM (first token)	<300ms
TTS (first audio)	<200ms
Network RTT	~50-100ms
Total (user → audio)	<1s

For full sentence: 1.5-2s. Conversational floor. Above that: feels robotic.

STT providers

	Strengths
Deepgram	Lowest latency; streaming-first
AssemblyAI	Quality; speaker diarization
OpenAI Whisper API	Commodity; sometimes higher latency
AWS Transcribe	AWS integration
Self-hosted Whisper	No vendor lock-in; ops cost

Deepgram is the de-facto for production voice agents in 2026.

STT integration

async with deepgram.live.connect({
    "model": "nova-3",
    "punctuate": True,
    "interim_results": True,
}) as ws:
    async def send_audio():
        async for chunk in microphone:
            await ws.send(chunk)
    
    async def recv_transcripts():
        async for msg in ws:
            data = json.loads(msg)
            if data.get("is_final"):
                await on_final_transcript(data["channel"]["alternatives"][0]["transcript"])
            else:
                await on_interim(data...)
    
    await asyncio.gather(send_audio(), recv_transcripts())

Stream both ways. Interim transcripts let you start LLM call early; finals confirm.

LLM integration

async def generate_response(transcript):
    async with client.messages.stream(
        model="claude-sonnet-4-6",
        messages=[{"role": "user", "content": transcript}],
        system=VOICE_SYSTEM_PROMPT,
        max_tokens=500,
    ) as stream:
        async for token in stream.text_stream:
            yield token

Stream tokens to TTS as they arrive. Don’t wait for full response.

System prompt for voice:

“Respond in spoken language; avoid markdown.”
“Keep responses brief (1-3 sentences).”
“Use simple, conversational tone.”

TTS providers

	Strengths
ElevenLabs	Highest quality voices; streaming
OpenAI TTS	Fast; many voices
Cartesia (Sonic)	Lowest latency (~100ms first audio)
PlayHT	Voice cloning

For lowest latency: Cartesia. For voice quality: ElevenLabs.

TTS integration

async with elevenlabs.text_to_speech.stream(
    voice_id="...",
    optimize_streaming_latency=4,  # max
    output_format="mp3_22050_32",
) as tts_stream:
    async for token in llm_tokens:
        await tts_stream.send(token)
        async for audio_chunk in tts_stream.get_audio():
            yield audio_chunk

LLM token in → TTS audio out. First audio chunk in <500ms with good config.

Barge-in / interruption

User talks while agent talks → cancel agent’s TTS, listen.

async def manage_call():
    state = CallState()
    
    async def listen():
        async for transcript in stt_stream():
            if state.is_speaking:
                await state.cancel_speaking()  # stop TTS
            state.last_user_text = transcript
            if is_final(transcript):
                await respond(transcript)
    
    async def respond(transcript):
        state.is_speaking = True
        try:
            async for audio in pipeline(transcript):
                if state.cancel_requested:
                    return
                await play_audio(audio)
        finally:
            state.is_speaking = False

Critical for natural conversation.

VAD (voice activity detection)

Don’t send silence to STT:

import webrtcvad

vad = webrtcvad.Vad(2)  # aggressiveness 0-3

async def speech_chunks(audio_stream):
    buffer = []
    speaking = False
    for chunk in audio_stream:
        is_speech = vad.is_speech(chunk, sample_rate=16000)
        if is_speech:
            speaking = True
            buffer.append(chunk)
            yield chunk
        elif speaking:
            # End of utterance
            speaking = False
            buffer.clear()

Reduces STT cost; clarifies utterance boundaries.

Tools / RAG

For agents with knowledge:

async def respond(transcript):
    # Quick: most queries don't need tools
    if simple_chat(transcript):
        return await stream_llm(transcript)
    
    # Tool path
    async for chunk in stream_llm_with_tools(transcript, tools):
        yield chunk

Trade-off: tools add latency (RAG retrieval, tool execution). For voice: aggressive caching, parallel calls, minimal tools.

Telephony

For phone calls:

Twilio Media Streams — bidirectional WebSocket audio.
Vapi / Bland.ai — voice agent platforms abstracting telephony.
LiveKit Agents — open-source framework.

Twilio + your pipeline = direct integration. LiveKit Agents = scaffolding included.

Multi-turn state

Conversation context persists across turns:

class CallSession:
    messages: list = []
    
    async def turn(self, user_text):
        self.messages.append({"role": "user", "content": user_text})
        response = await stream_response(self.messages)
        self.messages.append({"role": "assistant", "content": response})

Cap history (sliding window) to control LLM cost. See LLM Context Windows .

Cost

Typical 5-min call:

STT: ~$0.05
LLM: ~$0.10-0.30
TTS: ~$0.10
Telephony: ~$0.05

~$0.30-0.50 per call. For high-volume: optimize TTS (most expensive); use cheaper LLM where quality permits.

Quality / eval

Latency per stage; alert on regressions.
Word error rate sample monitoring.
Conversation eval (LLM judge on transcripts): did the agent help?
Hangup rate as signal.

See LLM Evaluation .

Common mistakes

1. Non-streaming pipeline

User waits 5s for “Hello, how can I help?” Streaming end-to-end is non-negotiable.

2. No barge-in

User interrupts; agent talks over them. Frustrating. Implement cancellation.

3. Long LLM responses

500-token answer takes 5s to read. Prompt: “1-3 sentences.”

4. Markdown in TTS

“Here are some bold options italic.” TTS reads punctuation. Strip formatting.

5. No fallback

Provider outage; call breaks mid-sentence. Have a backup STT/LLM/TTS.

What I’d ship today

For a voice agent:

Deepgram Nova for STT.
Claude Sonnet with caching for LLM (or Haiku for trivial).
Cartesia or ElevenLabs for TTS.
LiveKit Agents as scaffolding.
Twilio Media Streams if phone-based.
VAD for utterance detection.
Barge-in / cancellation.
Sliding window context for cost.
Tracing per call for debugging.

Read this next

If you want my voice agent reference (Deepgram + Claude + Cartesia + Twilio), it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

The pipeline#

Latency budget#

STT providers#

STT integration#

LLM integration#

TTS providers#

TTS integration#

Barge-in / interruption#

VAD (voice activity detection)#

Tools / RAG#

Telephony#

Multi-turn state#

Cost#

Quality / eval#

Common mistakes#

1. Non-streaming pipeline#

2. No barge-in#

3. Long LLM responses#

4. Markdown in TTS#

5. No fallback#

What I’d ship today#

Read this next#