Voice chat is harder than text. Latency budgets are tight; quality requirements are high; users notice 200ms delay. The architecture is specialized. This post is the design.
Architecture
Clients (browsers / native apps with WebRTC)
↓
Signaling server (WebSocket — exchange SDP / ICE)
↓
SFU (Selective Forwarding Unit) — receives + forwards audio streams
↓
Other clients in the room
WebRTC handles audio encoding, transport, jitter, NAT traversal. Your servers handle signaling and forwarding.
SFU
Each user publishes one audio stream. SFU forwards to N other users in the room.
- CPU-cheap: forward, don’t mix.
- Selective: only send streams the consumer is actually listening to (audio-active speakers).
- Adaptive: lower bitrate to consumers on bad networks.
Open-source SFUs: LiveKit , mediasoup , ion-sfu , janus .
Signaling
Client A connects WebSocket → server
Server: "you're in room-42 with users B, C"
Client A creates RTCPeerConnection per peer
Exchange SDP offers/answers via WebSocket
ICE candidates flow
Audio streams established
Signaling is just WebSocket message exchange. Handle:
- Join / leave room.
- SDP exchange.
- ICE candidates.
- Mute / unmute.
- Speaker promotion (in stage-style rooms).
Room state
Per-room ephemeral state:
- Connected users.
- Mute states.
- Active speakers.
- Permissions (who can talk vs listen).
Lives in memory or Redis. Lost on server restart; clients reconnect.
For Cloudflare Durable Objects , one DO per room maps cleanly.
Presence
Voice chat needs presence (“who’s online”, “who’s speaking right now”). Same patterns as Design WhatsApp / Chat :
- WebSocket-driven.
- Coarse online/offline + fine speaking-or-not.
- Awareness API (Yjs-style) for cursor / state synchronization.
Latency
Budget end-to-end (mic → speaker):
- Encoding: ~20ms (Opus framing).
- Mic → server: 30–100ms (depends on geography).
- SFU forwarding: <10ms.
- Server → peer: 30–100ms.
- Decoding + jitter buffer: 30–80ms.
- Total: 130–300ms typical.
For a Discord-quality experience, < 200ms is the target. Geographic SFU placement is crucial — a user in Bangalore on a US-East SFU pays 250ms RTT just on physics.
Multi-region
User in IN → IN-region SFU
User in US → US-region SFU
SFUs forward to each other for cross-region rooms.
Trades a hop for global reach. The cross-SFU forwarding adds 100ms; acceptable for multi-region rooms.
Capacity
For 1M concurrent voice users in 100k rooms:
- Avg 10 users / room.
- Each user publishes 1 stream; receives ~10.
- SFU CPU: ~10k connections per node = 100 nodes.
- Bandwidth: ~50 kbps per stream × 1M users × ~10 peer streams = ~500 Gbps total egress.
Real numbers; serious infrastructure. Most products use managed SFU providers (LiveKit Cloud, Daily.co, Agora, Twilio Video) until volume justifies self-hosting.
Recording
A separate “egress” service joins as a hidden user, receives all streams, mixes, writes to S3.
For compliance / legal: consent banners, recording indicators, retention.
Codecs
Opus is the standard. 16–64 kbps for voice. Adaptive (bitrate ramps with available bandwidth). All major browsers support it.
Security
- TLS for signaling.
- DTLS-SRTP for media (WebRTC default).
- Room tokens (JWT) authorize join.
- Optional E2EE (frame-level encryption; not all SFU implementations).
Common operational pitfalls
- STUN / TURN servers for NAT traversal. Without TURN, ~10% of connections fail behind strict NATs. Pay for a TURN service or run your own (coturn).
- Bandwidth costs: voice bandwidth is real. CDN-style delivery.
- Echo cancellation: handled by browser WebRTC for browser clients. Native apps must ship a good AEC.
What I’d build today
For a small voice product:
- LiveKit Cloud (or self-hosted LiveKit) as SFU.
- WebSocket signaling wired into your existing auth.
- One Durable Object or per-room actor for room state.
- Postgres for persistent rooms / membership.
- Twilio / Cloudflare TURN for NAT traversal.
Scales to 10k concurrent users on minimal infra.
Read this next
- Voice Agents and Realtime LLM APIs
- Design WhatsApp / Chat
- SSE vs WebSockets in 2026
- Distributed Systems Fundamentals
If you want a LiveKit + Hono signaling server reference, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .