Real-World System Design Case Studies

Design a Chat System

18 min Lesson 5 of 10

Design a Chat System

WhatsApp processes roughly 100 billion messages per day across two billion users. Telegram delivers messages to groups of up to 200,000 members. Behind both products is a deceptively difficult engineering problem: how do you move a tiny payload — a text message — from one phone to another, anywhere on earth, in under a second, reliably, at planetary scale?

This case study walks through the architecture of a WhatsApp-style chat system: one-to-one messaging, group messaging, online/offline presence, and message delivery guarantees. We will make explicit trade-off decisions at each step and justify them.

Requirements and Scale

Before drawing boxes, nail down the requirements:

Functional: send and receive text messages in real time; one-to-one and group chats (up to 500 members); online presence indicators; message delivery receipts (sent, delivered, read); offline message storage and delivery on reconnect.
Non-functional: 50 million daily active users; peak 500,000 concurrent WebSocket connections per datacenter; message latency under 200 ms (p99); 5 nines availability; messages stored for at least 30 days.
Out of scope (for this lesson): media files, end-to-end encryption key management, voice/video calls.

Key idea: Always clarify whether messages need to be stored server-side after delivery. WhatsApp historically deleted server-side messages once delivered; Telegram stores everything in the cloud. This single decision drives enormous differences in storage cost and architecture.

Choosing the Transport: WebSocket over HTTP Polling

HTTP polling (the client asks "any new messages?" every few seconds) wastes bandwidth and adds latency proportional to the poll interval. WebSockets solve this: after a single HTTP upgrade handshake, the connection becomes a full-duplex, persistent TCP channel. The server can push a message the instant it arrives — no polling loop needed.

Each mobile client opens one long-lived WebSocket to a Chat Server. Because TCP connections are stateful, the Chat Server that holds a user's socket must be the one that delivers messages to that user. This is the central architectural constraint that shapes everything else.

High-Level Architecture

High-level chat system architecture: WebSocket path, message queue, Cassandra persistence, Redis session store, and presence service.

The key flows:

Client A connects via WebSocket through the load balancer to Chat Server Instance 1.
A sends a message. Chat Server 1 publishes it to Kafka and immediately returns an ACK to Client A (the "sent" tick).
A Delivery Service (Kafka consumer) looks up which Chat Server instance holds Client B's open socket (via Redis session cache), then pushes the message down that connection.
Separately, the Persistence Service (another Kafka consumer) writes the message to Cassandra for history and offline delivery.

Session Routing: How Does the Delivery Service Find Client B?

This is the crux of the design. When Client B opens a WebSocket, the Chat Server it lands on writes a session record to Redis:

SET session:{userId} {chatServerHostname} EX 3600

When a message arrives for User B, the Delivery Service reads session:{userId} from Redis. If B is online, the key exists and names the Chat Server holding the socket. The Delivery Service sends an internal RPC to that server, which pushes the message down the open WebSocket. If the key is absent, B is offline — the message stays in Cassandra and is delivered in bulk when B reconnects and calls a "sync since last seen" endpoint.

Best practice: Keep the session store (Redis) strictly separate from the message store (Cassandra). Session data is hot, ephemeral, and tiny. Message history is cold, durable, and enormous. Mixing them forces you to over-provision your cache or under-provision your database.

Message Storage: Why Cassandra?

A chat message has a natural access pattern: give me all messages in conversation X, ordered by time, most recent first. Cassandra's data model fits this perfectly. A partition key of (conversation_id) with a clustering key of (created_at DESC, message_id) stores all messages for one conversation on the same node(s), enabling single-partition range scans.

At WhatsApp scale, the write rate is enormous — 100 billion messages per day is roughly 1.16 million writes per second. Cassandra's LSM-tree storage engine absorbs write bursts far better than a B-tree (PostgreSQL / MySQL) because writes always go to an in-memory MemTable first, batched to disk as SSTables. There is no random-I/O write amplification.

Trade-off: Cassandra provides eventual consistency by default. For a chat system this is almost always acceptable — a message appearing slightly out of order in a high-traffic group is tolerable. But if you need strong ordering guarantees (e.g. financial ledger chat), you would need to use a single-partition serial consistency level at the cost of throughput, or use a different database entirely.

Group Messaging: Fan-Out

Group messages require delivering one message to N recipients. There are two strategies:

Fan-out on write: When Alice sends a message in a 50-person group, the system immediately creates 50 delivery tasks (one per member). Simple to implement; latency is predictable. Becomes expensive at scale for large groups (200,000 members on Telegram).
Fan-out on read: Store one copy of the message. Each member's client polls for "messages in group G since my last read cursor." Scales storage well but increases read complexity.

Most chat systems use a hybrid: fan-out on write for small groups (under ~500 members) and fan-out on read for very large channels/broadcasts. The threshold depends on your write capacity — if you can absorb the write amplification, fan-out on write gives lower read latency.

Presence: Online / Offline / Last Seen

Heartbeat-driven presence: the client sends a ping every 30 s; the Presence Service refreshes a Redis key with TTL 65 s. Silence = offline.

Presence is deceptively tricky at scale. The naive approach of listening to WebSocket connect/disconnect events fails because a client may lose connectivity without sending a clean close frame (mobile networks, backgrounded apps). The robust pattern:

Client sends a heartbeat message every 30 seconds over the open WebSocket.
The Chat Server forwards this to the Presence Service, which calls SET presence:{userId} 1 EX 65 in Redis.
If no heartbeat arrives within 65 seconds, the TTL expires and the key disappears — the user is offline.
The Presence Service also writes the current timestamp to last_seen:{userId} on every heartbeat, so friends can see "Last seen 2 minutes ago."

At 50 million DAU with roughly 10 million concurrent users, 10 million Redis writes every 30 seconds is about 333,000 writes/sec to the presence store. This is non-trivial. Production systems shard the presence Redis cluster and batch-update presence rather than issuing individual SET calls per user per heartbeat.

Message Delivery Guarantees

Chat systems commonly use a three-state delivery model:

Sent (single tick): The server accepted and acknowledged the message. Kafka has it; it will be persisted.
Delivered (double tick): The message was pushed to the recipient's device and the device ACK'd receipt.
Read (blue ticks): The recipient's app reported that the message was viewed.

Each state transition requires the recipient's device to send a tiny acknowledgement back through its WebSocket. These ACKs are also queued and delivered asynchronously to the sender — the sender's Chat Server subscribes to a Kafka topic keyed by their user ID to receive them.

Best practice: Use monotonically increasing message IDs per conversation (not global UUIDs). On reconnect, the client sends its last-seen message ID and the server returns everything after that. This "sync cursor" pattern avoids paginating through all of history and is cheap to implement with a Cassandra range scan.

Handling Offline Users

When the Delivery Service finds no session key for a user in Redis, the message is already safely in Cassandra. When the user comes back online and reconnects, the client sends a SYNC {lastSeenMessageId} request to the REST API Gateway, which queries Cassandra for all messages in each of the user's conversations after that ID. This "catch-up on reconnect" pattern is far more efficient than push-retrying every offline message — it is a single read per conversation rather than N queued deliveries.

Key Design Decisions — Summary

WebSocket over polling — sub-second delivery, half-duplex not needed, persistent connections manageable with session routing via Redis.
Kafka as the message backbone — decouples the Chat Server (accept & ACK fast) from Delivery and Persistence consumers; provides durability if a consumer is down.
Cassandra for message storage — write-optimised, horizontally scalable, natural fit for time-ordered conversation partitions.
Redis for session and presence — sub-millisecond lookups, built-in TTL for presence expiry, trivially horizontally shardable.
Fan-out strategy depends on group size — write-time fan-out for small groups; read-time for large channels.