Voice chat is harder than text. Latency budgets are tight; quality requirements are high; users notice 200ms delay. The architecture is specialized. This post is the design.

Architecture

Clients (browsers / native apps with WebRTC)
Signaling server (WebSocket — exchange SDP / ICE)
SFU (Selective Forwarding Unit) — receives + forwards audio streams
Other clients in the room

WebRTC handles audio encoding, transport, jitter, NAT traversal. Your servers handle signaling and forwarding.

SFU

Each user publishes one audio stream. SFU forwards to N other users in the room.

  • CPU-cheap: forward, don’t mix.
  • Selective: only send streams the consumer is actually listening to (audio-active speakers).
  • Adaptive: lower bitrate to consumers on bad networks.

Open-source SFUs: LiveKit , mediasoup , ion-sfu , janus .

Signaling

Client A connects WebSocket → server
Server: "you're in room-42 with users B, C"
Client A creates RTCPeerConnection per peer
Exchange SDP offers/answers via WebSocket
ICE candidates flow
Audio streams established

Signaling is just WebSocket message exchange. Handle:

  • Join / leave room.
  • SDP exchange.
  • ICE candidates.
  • Mute / unmute.
  • Speaker promotion (in stage-style rooms).

Room state

Per-room ephemeral state:

  • Connected users.
  • Mute states.
  • Active speakers.
  • Permissions (who can talk vs listen).

Lives in memory or Redis. Lost on server restart; clients reconnect.

For Cloudflare Durable Objects , one DO per room maps cleanly.

Presence

Voice chat needs presence (“who’s online”, “who’s speaking right now”). Same patterns as Design WhatsApp / Chat :

  • WebSocket-driven.
  • Coarse online/offline + fine speaking-or-not.
  • Awareness API (Yjs-style) for cursor / state synchronization.

Latency

Budget end-to-end (mic → speaker):

  • Encoding: ~20ms (Opus framing).
  • Mic → server: 30–100ms (depends on geography).
  • SFU forwarding: <10ms.
  • Server → peer: 30–100ms.
  • Decoding + jitter buffer: 30–80ms.
  • Total: 130–300ms typical.

For a Discord-quality experience, < 200ms is the target. Geographic SFU placement is crucial — a user in Bangalore on a US-East SFU pays 250ms RTT just on physics.

Multi-region

User in IN → IN-region SFU
User in US → US-region SFU
SFUs forward to each other for cross-region rooms.

Trades a hop for global reach. The cross-SFU forwarding adds 100ms; acceptable for multi-region rooms.

Capacity

For 1M concurrent voice users in 100k rooms:

  • Avg 10 users / room.
  • Each user publishes 1 stream; receives ~10.
  • SFU CPU: ~10k connections per node = 100 nodes.
  • Bandwidth: ~50 kbps per stream × 1M users × ~10 peer streams = ~500 Gbps total egress.

Real numbers; serious infrastructure. Most products use managed SFU providers (LiveKit Cloud, Daily.co, Agora, Twilio Video) until volume justifies self-hosting.

Recording

A separate “egress” service joins as a hidden user, receives all streams, mixes, writes to S3.

For compliance / legal: consent banners, recording indicators, retention.

Codecs

Opus is the standard. 16–64 kbps for voice. Adaptive (bitrate ramps with available bandwidth). All major browsers support it.

Security

  • TLS for signaling.
  • DTLS-SRTP for media (WebRTC default).
  • Room tokens (JWT) authorize join.
  • Optional E2EE (frame-level encryption; not all SFU implementations).

Common operational pitfalls

  • STUN / TURN servers for NAT traversal. Without TURN, ~10% of connections fail behind strict NATs. Pay for a TURN service or run your own (coturn).
  • Bandwidth costs: voice bandwidth is real. CDN-style delivery.
  • Echo cancellation: handled by browser WebRTC for browser clients. Native apps must ship a good AEC.

What I’d build today

For a small voice product:

  • LiveKit Cloud (or self-hosted LiveKit) as SFU.
  • WebSocket signaling wired into your existing auth.
  • One Durable Object or per-room actor for room state.
  • Postgres for persistent rooms / membership.
  • Twilio / Cloudflare TURN for NAT traversal.

Scales to 10k concurrent users on minimal infra.

Read this next

If you want a LiveKit + Hono signaling server reference, it’s at rajpoot.dev .


Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .