SFU (Selective Forwarding Unit) is the modern default — server forwards audio streams without mixing. Lower CPU, better quality. MCU (Multipoint Control Unit) mixes server-side; rare in 2026 outside niche use cases.

Why is voice latency so unforgiving?

Humans notice voice delay above 150ms; conversation feels broken above 300ms. End-to-end (mic → server → all peers' speakers) must fit. WebRTC + nearby SFU usually delivers 80–200ms.

Design a Voice Chat System Like Discord — System Design Walkthrough

Voice chat is harder than text. Latency budgets are tight; quality requirements are high; users notice 200ms delay. The architecture is specialized. This post is the design.

Architecture

Clients (browsers / native apps with WebRTC)
    ↓
Signaling server (WebSocket — exchange SDP / ICE)
    ↓
SFU (Selective Forwarding Unit) — receives + forwards audio streams
    ↓
Other clients in the room

WebRTC handles audio encoding, transport, jitter, NAT traversal. Your servers handle signaling and forwarding.

SFU

Each user publishes one audio stream. SFU forwards to N other users in the room.

CPU-cheap: forward, don’t mix.
Selective: only send streams the consumer is actually listening to (audio-active speakers).
Adaptive: lower bitrate to consumers on bad networks.

Open-source SFUs: LiveKit , mediasoup , ion-sfu , janus .

Signaling

Client A connects WebSocket → server
Server: "you're in room-42 with users B, C"
Client A creates RTCPeerConnection per peer
Exchange SDP offers/answers via WebSocket
ICE candidates flow
Audio streams established

Signaling is just WebSocket message exchange. Handle:

Join / leave room.
SDP exchange.
ICE candidates.
Mute / unmute.
Speaker promotion (in stage-style rooms).

Room state

Per-room ephemeral state:

Connected users.
Mute states.
Active speakers.
Permissions (who can talk vs listen).

Lives in memory or Redis. Lost on server restart; clients reconnect.

For Cloudflare Durable Objects , one DO per room maps cleanly.

Presence

Voice chat needs presence (“who’s online”, “who’s speaking right now”). Same patterns as Design WhatsApp / Chat :

WebSocket-driven.
Coarse online/offline + fine speaking-or-not.
Awareness API (Yjs-style) for cursor / state synchronization.

Latency

Budget end-to-end (mic → speaker):

Encoding: ~20ms (Opus framing).
Mic → server: 30–100ms (depends on geography).
SFU forwarding: <10ms.
Server → peer: 30–100ms.
Decoding + jitter buffer: 30–80ms.
Total: 130–300ms typical.

For a Discord-quality experience, < 200ms is the target. Geographic SFU placement is crucial — a user in Bangalore on a US-East SFU pays 250ms RTT just on physics.

Multi-region

User in IN → IN-region SFU
User in US → US-region SFU
SFUs forward to each other for cross-region rooms.

Trades a hop for global reach. The cross-SFU forwarding adds 100ms; acceptable for multi-region rooms.

Capacity

For 1M concurrent voice users in 100k rooms:

Avg 10 users / room.
Each user publishes 1 stream; receives ~10.
SFU CPU: ~10k connections per node = 100 nodes.
Bandwidth: ~50 kbps per stream × 1M users × ~10 peer streams = ~500 Gbps total egress.

Real numbers; serious infrastructure. Most products use managed SFU providers (LiveKit Cloud, Daily.co, Agora, Twilio Video) until volume justifies self-hosting.

Recording

A separate “egress” service joins as a hidden user, receives all streams, mixes, writes to S3.

For compliance / legal: consent banners, recording indicators, retention.

Codecs

Opus is the standard. 16–64 kbps for voice. Adaptive (bitrate ramps with available bandwidth). All major browsers support it.

Security

TLS for signaling.
DTLS-SRTP for media (WebRTC default).
Room tokens (JWT) authorize join.
Optional E2EE (frame-level encryption; not all SFU implementations).

Common operational pitfalls

STUN / TURN servers for NAT traversal. Without TURN, ~10% of connections fail behind strict NATs. Pay for a TURN service or run your own (coturn).
Bandwidth costs: voice bandwidth is real. CDN-style delivery.
Echo cancellation: handled by browser WebRTC for browser clients. Native apps must ship a good AEC.

What I’d build today

For a small voice product:

LiveKit Cloud (or self-hosted LiveKit) as SFU.
WebSocket signaling wired into your existing auth.
One Durable Object or per-room actor for room state.
Postgres for persistent rooms / membership.
Twilio / Cloudflare TURN for NAT traversal.

Scales to 10k concurrent users on minimal infra.

Read this next

If you want a LiveKit + Hono signaling server reference, it’s at rajpoot.dev .

Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .

Architecture#

SFU#

Signaling#

Room state#

Presence#

Latency#

Multi-region#

Capacity#

Recording#

Codecs#

Security#

Common operational pitfalls#

What I’d build today#

Read this next#