Voice AI matured in 2025-2026 from impressive demos to production agents handling support calls, scheduling, voice ordering. The architecture is well-known; the latency and reliability discipline is where teams trip. This post is the working playbook.
The pipeline
[User audio] → [Streaming STT] → [LLM] → [Streaming TTS] → [User]
│ │ │
partial tools streaming
transcripts /RAG chunks
Each stage streams; words flow through. Total latency = max of stage delays + a bit of overhead.
Latency budget
| Stage | Target |
|---|---|
| STT (first word) | <200ms |
| STT (final) | <300ms after user stops |
| LLM (first token) | <300ms |
| TTS (first audio) | <200ms |
| Network RTT | ~50-100ms |
| Total (user → audio) | <1s |
For full sentence: 1.5-2s. Conversational floor. Above that: feels robotic.
STT providers
| Strengths | |
|---|---|
| Deepgram | Lowest latency; streaming-first |
| AssemblyAI | Quality; speaker diarization |
| OpenAI Whisper API | Commodity; sometimes higher latency |
| AWS Transcribe | AWS integration |
| Self-hosted Whisper | No vendor lock-in; ops cost |
Deepgram is the de-facto for production voice agents in 2026.
STT integration
async with deepgram.live.connect({
"model": "nova-3",
"punctuate": True,
"interim_results": True,
}) as ws:
async def send_audio():
async for chunk in microphone:
await ws.send(chunk)
async def recv_transcripts():
async for msg in ws:
data = json.loads(msg)
if data.get("is_final"):
await on_final_transcript(data["channel"]["alternatives"][0]["transcript"])
else:
await on_interim(data...)
await asyncio.gather(send_audio(), recv_transcripts())
Stream both ways. Interim transcripts let you start LLM call early; finals confirm.
LLM integration
async def generate_response(transcript):
async with client.messages.stream(
model="claude-sonnet-4-6",
messages=[{"role": "user", "content": transcript}],
system=VOICE_SYSTEM_PROMPT,
max_tokens=500,
) as stream:
async for token in stream.text_stream:
yield token
Stream tokens to TTS as they arrive. Don’t wait for full response.
System prompt for voice:
- “Respond in spoken language; avoid markdown.”
- “Keep responses brief (1-3 sentences).”
- “Use simple, conversational tone.”
TTS providers
| Strengths | |
|---|---|
| ElevenLabs | Highest quality voices; streaming |
| OpenAI TTS | Fast; many voices |
| Cartesia (Sonic) | Lowest latency (~100ms first audio) |
| PlayHT | Voice cloning |
For lowest latency: Cartesia. For voice quality: ElevenLabs.
TTS integration
async with elevenlabs.text_to_speech.stream(
voice_id="...",
optimize_streaming_latency=4, # max
output_format="mp3_22050_32",
) as tts_stream:
async for token in llm_tokens:
await tts_stream.send(token)
async for audio_chunk in tts_stream.get_audio():
yield audio_chunk
LLM token in → TTS audio out. First audio chunk in <500ms with good config.
Barge-in / interruption
User talks while agent talks → cancel agent’s TTS, listen.
async def manage_call():
state = CallState()
async def listen():
async for transcript in stt_stream():
if state.is_speaking:
await state.cancel_speaking() # stop TTS
state.last_user_text = transcript
if is_final(transcript):
await respond(transcript)
async def respond(transcript):
state.is_speaking = True
try:
async for audio in pipeline(transcript):
if state.cancel_requested:
return
await play_audio(audio)
finally:
state.is_speaking = False
Critical for natural conversation.
VAD (voice activity detection)
Don’t send silence to STT:
import webrtcvad
vad = webrtcvad.Vad(2) # aggressiveness 0-3
async def speech_chunks(audio_stream):
buffer = []
speaking = False
for chunk in audio_stream:
is_speech = vad.is_speech(chunk, sample_rate=16000)
if is_speech:
speaking = True
buffer.append(chunk)
yield chunk
elif speaking:
# End of utterance
speaking = False
buffer.clear()
Reduces STT cost; clarifies utterance boundaries.
Tools / RAG
For agents with knowledge:
async def respond(transcript):
# Quick: most queries don't need tools
if simple_chat(transcript):
return await stream_llm(transcript)
# Tool path
async for chunk in stream_llm_with_tools(transcript, tools):
yield chunk
Trade-off: tools add latency (RAG retrieval, tool execution). For voice: aggressive caching, parallel calls, minimal tools.
Telephony
For phone calls:
- Twilio Media Streams — bidirectional WebSocket audio.
- Vapi / Bland.ai — voice agent platforms abstracting telephony.
- LiveKit Agents — open-source framework.
Twilio + your pipeline = direct integration. LiveKit Agents = scaffolding included.
Multi-turn state
Conversation context persists across turns:
class CallSession:
messages: list = []
async def turn(self, user_text):
self.messages.append({"role": "user", "content": user_text})
response = await stream_response(self.messages)
self.messages.append({"role": "assistant", "content": response})
Cap history (sliding window) to control LLM cost. See LLM Context Windows .
Cost
Typical 5-min call:
- STT: ~$0.05
- LLM: ~$0.10-0.30
- TTS: ~$0.10
- Telephony: ~$0.05
~$0.30-0.50 per call. For high-volume: optimize TTS (most expensive); use cheaper LLM where quality permits.
Quality / eval
- Latency per stage; alert on regressions.
- Word error rate sample monitoring.
- Conversation eval (LLM judge on transcripts): did the agent help?
- Hangup rate as signal.
See LLM Evaluation .
Common mistakes
1. Non-streaming pipeline
User waits 5s for “Hello, how can I help?” Streaming end-to-end is non-negotiable.
2. No barge-in
User interrupts; agent talks over them. Frustrating. Implement cancellation.
3. Long LLM responses
500-token answer takes 5s to read. Prompt: “1-3 sentences.”
4. Markdown in TTS
“Here are some bold options italic.” TTS reads punctuation. Strip formatting.
5. No fallback
Provider outage; call breaks mid-sentence. Have a backup STT/LLM/TTS.
What I’d ship today
For a voice agent:
- Deepgram Nova for STT.
- Claude Sonnet with caching for LLM (or Haiku for trivial).
- Cartesia or ElevenLabs for TTS.
- LiveKit Agents as scaffolding.
- Twilio Media Streams if phone-based.
- VAD for utterance detection.
- Barge-in / cancellation.
- Sliding window context for cost.
- Tracing per call for debugging.
Read this next
If you want my voice agent reference (Deepgram + Claude + Cartesia + Twilio), it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .