Voice AI matured in 2025-2026 from impressive demos to production agents handling support calls, scheduling, voice ordering. The architecture is well-known; the latency and reliability discipline is where teams trip. This post is the working playbook.

The pipeline

[User audio]  [Streaming STT]  [LLM]  [Streaming TTS]  [User]
                                               
                  partial         tools         streaming
                  transcripts     /RAG          chunks

Each stage streams; words flow through. Total latency = max of stage delays + a bit of overhead.

Latency budget

StageTarget
STT (first word)<200ms
STT (final)<300ms after user stops
LLM (first token)<300ms
TTS (first audio)<200ms
Network RTT~50-100ms
Total (user → audio)<1s

For full sentence: 1.5-2s. Conversational floor. Above that: feels robotic.

STT providers

Strengths
DeepgramLowest latency; streaming-first
AssemblyAIQuality; speaker diarization
OpenAI Whisper APICommodity; sometimes higher latency
AWS TranscribeAWS integration
Self-hosted WhisperNo vendor lock-in; ops cost

Deepgram is the de-facto for production voice agents in 2026.

STT integration

async with deepgram.live.connect({
    "model": "nova-3",
    "punctuate": True,
    "interim_results": True,
}) as ws:
    async def send_audio():
        async for chunk in microphone:
            await ws.send(chunk)
    
    async def recv_transcripts():
        async for msg in ws:
            data = json.loads(msg)
            if data.get("is_final"):
                await on_final_transcript(data["channel"]["alternatives"][0]["transcript"])
            else:
                await on_interim(data...)
    
    await asyncio.gather(send_audio(), recv_transcripts())

Stream both ways. Interim transcripts let you start LLM call early; finals confirm.

LLM integration

async def generate_response(transcript):
    async with client.messages.stream(
        model="claude-sonnet-4-6",
        messages=[{"role": "user", "content": transcript}],
        system=VOICE_SYSTEM_PROMPT,
        max_tokens=500,
    ) as stream:
        async for token in stream.text_stream:
            yield token

Stream tokens to TTS as they arrive. Don’t wait for full response.

System prompt for voice:

  • “Respond in spoken language; avoid markdown.”
  • “Keep responses brief (1-3 sentences).”
  • “Use simple, conversational tone.”

TTS providers

Strengths
ElevenLabsHighest quality voices; streaming
OpenAI TTSFast; many voices
Cartesia (Sonic)Lowest latency (~100ms first audio)
PlayHTVoice cloning

For lowest latency: Cartesia. For voice quality: ElevenLabs.

TTS integration

async with elevenlabs.text_to_speech.stream(
    voice_id="...",
    optimize_streaming_latency=4,  # max
    output_format="mp3_22050_32",
) as tts_stream:
    async for token in llm_tokens:
        await tts_stream.send(token)
        async for audio_chunk in tts_stream.get_audio():
            yield audio_chunk

LLM token in → TTS audio out. First audio chunk in <500ms with good config.

Barge-in / interruption

User talks while agent talks → cancel agent’s TTS, listen.

async def manage_call():
    state = CallState()
    
    async def listen():
        async for transcript in stt_stream():
            if state.is_speaking:
                await state.cancel_speaking()  # stop TTS
            state.last_user_text = transcript
            if is_final(transcript):
                await respond(transcript)
    
    async def respond(transcript):
        state.is_speaking = True
        try:
            async for audio in pipeline(transcript):
                if state.cancel_requested:
                    return
                await play_audio(audio)
        finally:
            state.is_speaking = False

Critical for natural conversation.

VAD (voice activity detection)

Don’t send silence to STT:

import webrtcvad

vad = webrtcvad.Vad(2)  # aggressiveness 0-3

async def speech_chunks(audio_stream):
    buffer = []
    speaking = False
    for chunk in audio_stream:
        is_speech = vad.is_speech(chunk, sample_rate=16000)
        if is_speech:
            speaking = True
            buffer.append(chunk)
            yield chunk
        elif speaking:
            # End of utterance
            speaking = False
            buffer.clear()

Reduces STT cost; clarifies utterance boundaries.

Tools / RAG

For agents with knowledge:

async def respond(transcript):
    # Quick: most queries don't need tools
    if simple_chat(transcript):
        return await stream_llm(transcript)
    
    # Tool path
    async for chunk in stream_llm_with_tools(transcript, tools):
        yield chunk

Trade-off: tools add latency (RAG retrieval, tool execution). For voice: aggressive caching, parallel calls, minimal tools.

Telephony

For phone calls:

  • Twilio Media Streams — bidirectional WebSocket audio.
  • Vapi / Bland.ai — voice agent platforms abstracting telephony.
  • LiveKit Agents — open-source framework.

Twilio + your pipeline = direct integration. LiveKit Agents = scaffolding included.

Multi-turn state

Conversation context persists across turns:

class CallSession:
    messages: list = []
    
    async def turn(self, user_text):
        self.messages.append({"role": "user", "content": user_text})
        response = await stream_response(self.messages)
        self.messages.append({"role": "assistant", "content": response})

Cap history (sliding window) to control LLM cost. See LLM Context Windows .

Cost

Typical 5-min call:

  • STT: ~$0.05
  • LLM: ~$0.10-0.30
  • TTS: ~$0.10
  • Telephony: ~$0.05

~$0.30-0.50 per call. For high-volume: optimize TTS (most expensive); use cheaper LLM where quality permits.

Quality / eval

  • Latency per stage; alert on regressions.
  • Word error rate sample monitoring.
  • Conversation eval (LLM judge on transcripts): did the agent help?
  • Hangup rate as signal.

See LLM Evaluation .

Common mistakes

1. Non-streaming pipeline

User waits 5s for “Hello, how can I help?” Streaming end-to-end is non-negotiable.

2. No barge-in

User interrupts; agent talks over them. Frustrating. Implement cancellation.

3. Long LLM responses

500-token answer takes 5s to read. Prompt: “1-3 sentences.”

4. Markdown in TTS

“Here are some bold options italic.” TTS reads punctuation. Strip formatting.

5. No fallback

Provider outage; call breaks mid-sentence. Have a backup STT/LLM/TTS.

What I’d ship today

For a voice agent:

  • Deepgram Nova for STT.
  • Claude Sonnet with caching for LLM (or Haiku for trivial).
  • Cartesia or ElevenLabs for TTS.
  • LiveKit Agents as scaffolding.
  • Twilio Media Streams if phone-based.
  • VAD for utterance detection.
  • Barge-in / cancellation.
  • Sliding window context for cost.
  • Tracing per call for debugging.

Read this next

If you want my voice agent reference (Deepgram + Claude + Cartesia + Twilio), it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .