A chat system is the canonical “stateful at scale” design problem. Every WebSocket is a long-lived connection. Every message has to find its destination user across a fleet of tens of thousands of servers. Receipts, presence, group chat, and end-to-end encryption layer on top.
Here’s how I’d design it.
Requirements
Functional
- 1:1 chat: send / receive messages.
- Group chat: up to 256 (or 1024) members.
- Delivery + read receipts.
- Online / typing indicators.
- Multi-device (one user, many devices in sync).
- Message history.
- Optional: end-to-end encryption.
- Optional: media (images, videos) sharing.
Non-functional
- Sub-100ms message delivery (p95) when both users are online.
- Reliable delivery when recipient is offline (queue + push notification).
- Eventual ordering within a conversation (a single thread of messages).
- Privacy — message bodies opaque to the server (E2EE).
Capacity
| Number | |
|---|---|
| Total users | 1B |
| DAU | 500M |
| Concurrent connections | 200M |
| Messages per day | 100B |
| Messages per second (peak) | 5M |
| Avg message size | ~500 bytes |
| Daily storage growth (no E2EE: server stores plaintext copy) | ~50 TB |
The hard parts:
- 200M concurrent WebSocket connections. A single beefy node holds ~50–100k. So we need 2,000–4,000 connection nodes.
- 5M msg/sec peak. Routing each one to the right node.
- Partitioning state so a user’s messages always go to the right place.
Architecture overview
┌──────────────┐ ┌──────────────────────┐
│ Client (iOS, │ WebSocket / │ Edge / Load Balance │
│ Android, ├───────────────▶│ (sticky session by │
│ Web) │ │ user_id hash) │
└──────────────┘ └─────────┬────────────┘
│
▼
┌─────────────────────┐
│ Connection nodes │
│ (WebSocket fleet) │
│ ~50k conns / node │
└────┬───────────┬────┘
│ │
┌──────────▼───┐ ┌──▼───────────┐
│ Pub/Sub │ │ Message │
│ backplane │ │ store │
│ (NATS/Kafka)│ │ (Cassandra/ │
└──────────────┘ │ ScyllaDB) │
└──────────────┘
│
┌──────▼──────┐
│ Push (APNs/ │
│ FCM) │
└─────────────┘
Five subsystems: edge LB, connection nodes (stateful), pub/sub backplane, message store, push gateway.
The WebSocket fleet
Connection nodes are stateful — each holds tens of thousands of WebSocket connections.
The shape:
- Sticky routing. Hash
user_idto a node. The same user’s devices always connect to the same node so delivery is local. - Heartbeats. Every 30s the client sends a ping. After 90s without one, the server closes the connection.
- Multiple devices per user. A user’s connections are stored as a set in node-local memory; messages broadcast across all of them.
- Graceful drain on deploy. A node that’s about to be replaced refuses new connections, drains existing ones over 60s.
# Conceptual; would be Go or Rust in production for the perf
class ConnectionNode:
def __init__(self):
self.connections: dict[user_id, set[WebSocket]] = defaultdict(set)
async def add_connection(self, user_id, ws):
self.connections[user_id].add(ws)
async def deliver_to(self, user_id, message):
for ws in self.connections.get(user_id, ()):
await ws.send_json(message)
async def remove_connection(self, user_id, ws):
self.connections[user_id].discard(ws)
if not self.connections[user_id]:
del self.connections[user_id]
For the bidirectional reasons — see SSE vs WebSockets in 2026 — chat goes WebSocket. SSE doesn’t fit.
Routing across the fleet
User A is on node 17. User B is on node 1342. A sends to B. How does node 17 deliver to node 1342?
The answer: a pub/sub backplane.
A sends message to B
↓
Node 17 receives it
↓
Node 17 publishes "msg.user.<B_id>" to NATS / Kafka
↓
Node 1342 (subscribed to "msg.user.<B_id>") delivers to B's WebSockets
Two patterns to choose:
Topic per user (NATS-style)
Each user has their own subject. Node A subscribes to subjects for users connected to it. To send: publish on the recipient’s subject.
- Pros: Trivial routing. Cheap subscription per user.
- Cons: Node B has many subscriptions (one per connected user). NATS handles it; Kafka does not.
Topic per partition + indirection
Hash user_id → partition (say 4096 partitions). Each node owns some partitions. To send: publish to the recipient’s partition.
- Pros: Bounded subscriptions; works on Kafka.
- Cons: Need a coordination layer to track which node owns which partitions.
For chat at scale, NATS or NATS JetStream is the simplest fit. See Kafka vs NATS vs RabbitMQ .
Message storage
You need durable storage for:
- Recent messages (last 30–90 days, hot).
- Older messages (cold).
- Per-conversation order.
- Per-user inbox / “what haven’t I delivered yet.”
Schema (conceptual)
messages partition by conversation_id:
conversation_id (clustering key on ts)
message_id (snowflake-ordered)
sender_id
ts
ciphertext (or plaintext if non-E2EE)
ack_status
inbox partition by user_id:
user_id
message_id (clustering key on ts)
conversation_id
delivered_at
read_at
Why Cassandra/ScyllaDB:
- Massive write throughput.
- Partitioning by
conversation_idkeeps a thread’s messages together. - Time-clustered keys give you efficient “last 100 messages” reads.
Postgres can hold this up to ~50M users with partitioning, after which the move to Cassandra is the conventional path.
Delivery semantics
The send flow:
- Client A sends
{conversation_id, client_msg_id, ciphertext}to node A’s WebSocket. - Node A: a. Generates a server-side message ID (snowflake). b. Persists to message store + B’s inbox. c. ACKs back to client A with the server ID. d. Publishes to B’s pub/sub topic.
- Node B receives, looks up B’s connections, delivers to all of them.
- Client B ACKs receipt back through B’s WebSocket.
- Node B updates
delivered_atin B’s inbox.
If client B is offline:
- Step 3 finds no connections.
- The message stays in B’s inbox marked undelivered.
- A push notification (APNs / FCM) goes out.
- When B reconnects, sync gives B all messages with
delivered_at IS NULL.
The client_msg_id is for idempotency — if A retries the send because the network blipped, the server dedupes. Same pattern as Idempotency, Retries, and Exactly-Once Illusions
.
Receipts
Three states:
- Sent — server received the message.
- Delivered — recipient’s device received the WebSocket frame.
- Read — recipient’s app showed the message to the user (client tells the server when this happens).
Each one is its own event flowing back from the recipient’s device to the sender. Same pub/sub mechanism in reverse:
Client B → Node B → publish receipt → Node A subscribed → Client A
Receipts are usually batched per conversation per few seconds to avoid 1-receipt-per-message storms.
Online presence
Presence (“user is online”) is harder than it looks. Naive design: subscribe to every contact’s online status. With 200 contacts × 500M users, presence pub/sub is a firehose.
Better designs:
- Coarse presence updates every minute, not every connection event.
- On-demand fetch — when you open a chat, fetch the contact’s recent online status from a presence service.
- Last-seen-at stored in a key-value store, written debounced (every 30s).
For 1:1 chat, presence is a UX nice-to-have, not a hot path. Build it cheaply.
Group chat
A group of N members → a tweet posted by user U fans out to N-1 inboxes. Two flavors:
Server-side fanout (most chat apps)
Server expands group → individual messages → fanout to each member’s inbox.
For non-encrypted: cheap; one ciphertext per member is wasteful but tractable.
For E2EE: each recipient needs their own key wrap. Megolm (Matrix) and WhatsApp’s group protocol handle this with sender keys + per-member key distribution.
Client-side fanout
Client encrypts and sends one message per recipient. Used for very small groups in some E2EE designs. Doesn’t scale to 256-member groups.
End-to-end encryption (Signal Protocol)
E2EE means the server never sees plaintext. The protocol stack:
- X3DH (Extended Triple Diffie-Hellman) — initial asymmetric key exchange. Each user publishes long-term + medium-term + one-time prekeys to a server. To start a session, sender combines them into a shared secret.
- Double Ratchet — for each subsequent message, derive a fresh per-message key. Forward-secure: a leaked key reveals only past messages, not future ones.
The server stores:
- Each user’s public prekeys.
- Ciphertext of every message (it can’t decrypt).
The server cannot:
- Read message content.
- Forge messages between users.
The server can:
- See metadata — who messaged whom, when, message size.
- Block users.
- Enforce rate limits.
- Subpoena ciphertext (which is useless without the keys).
Open implementations: libsignal (Rust, used by Signal/WhatsApp), Olm/Megolm (Matrix). Don’t roll your own crypto.
Multi-device
A user has phone + laptop + tablet. All should receive every message; sending from one should sync to others.
The pattern:
- Each device is a separate principal in the protocol.
- Each session is per-device in E2EE — sender encrypts the message N times (one per recipient device).
- Sent-message sync — sender’s other devices receive the message via a special “sync” topic so the conversation stays consistent.
This is non-trivial. Signal’s docs describe it well; expect 2–3 weeks of design alone for the multi-device story.
Push notifications
When the recipient is offline, the server sends a push via APNs (iOS) or FCM (Android) / Web Push.
For E2EE: the push contains no message content — just “you have a new message from $sender”. The client wakes up, syncs ciphertext from the server, decrypts locally.
Operating push at scale:
- Token registration: every device registers its push token; the server stores
(user_id, device_id, push_token). - Batching — one batch call to APNs handles thousands of pushes.
- Backoff on push failures (token revoked, etc.).
- Cleanup — invalid tokens get pruned.
Capacity arithmetic
For 200M concurrent connections at 50k per node: 4,000 connection nodes. At 5M msg/sec peak split across them: each node handles ~1,250 msg/sec routing. Eminently doable.
Storage: 100B msg/day × 500B avg ≈ 50 TB/day. With 90-day hot retention: ~4.5 PB. Cassandra/ScyllaDB cluster of 50–100 nodes.
Push: 200M devices × 5 pushes/day on average = 1B pushes/day. Push gateway batches 1k per call to APNs/FCM. Manageable.
What interviewers love to dig into
- “What if a connection node dies?” → Active connections drop, clients reconnect (their sticky-routing hash points elsewhere), backplane redelivers any in-flight messages from the queue.
- “What if a user spams?” → Rate limit at the WebSocket level (token bucket per user — see Design a Rate Limiter ).
- “How do you ensure ordering in a conversation?” → Snowflake message IDs; clients render in ID order. The server’s per-conversation partition gives natural ordering.
- “How does the user sync history on a new device?” → For non-E2EE: server returns recent messages on auth. For E2EE: the new device joins existing sessions via a key-transfer protocol; old messages may not be re-encrypted to the new device (depends on app’s design).
- “How do you handle media?” → Media goes to S3 / blob store; chat message contains the URL + size. For E2EE, encrypt the file with a per-message key, store ciphertext in S3, share the key in the message.
What I’d build today
For a small chat product (10k users):
- One Postgres for messages, follows, devices.
- NATS JetStream for the backplane.
- A Go or Rust WebSocket server (single binary, ~50k conns/node).
- Web Push + FCM/APNs for offline delivery.
- Cloudflare in front for TLS + DDoS.
- Optional: Olm/Matrix for E2EE.
For WhatsApp scale, the architecture is the same — just thousands of nodes, Cassandra instead of Postgres, multi-region replication, and a small army of SREs.
Read this next
- Distributed Systems Fundamentals
- SSE vs WebSockets in 2026
- Kafka vs NATS vs RabbitMQ
- Design Twitter / News Feed
If you want a small reference implementation (Go WebSocket server + NATS + Postgres + push) you can clone and learn from, it’s at rajpoot.dev .
Building something AI-, backend-, or data-heavy and want a second pair of eyes? I do consulting and freelance work — see my projects and ways to reach me at rajpoot.dev .