Case Study: Chat Application (WhatsApp/Discord)

Difficulty: Intermediate | Category: Real-Time | Similar Systems: Slack, Facebook Messenger

Designing a real-time chat application introduces the complexities of persistent connections, low-latency message delivery, and real-time presence (online/offline status).

1. Requirements Clarification

Functional Requirements

1-on-1 Chat: Users can send real-time text messages to each other.
Group Chat: Users can participate in groups (up to 500 members).
Online Presence: Users can see if their friends are online or offline.
Message History: Users can scroll up to view older messages across devices.

Non-Functional Requirements

Low Latency: Messages should be delivered with barely noticeable delay.
High Availability: The system must be robust; however, strict consistency is less critical than availability (eventual consistency for message history is acceptable).
Persistent Connections: Millions of concurrent open connections.

2. Back-of-the-Envelope Estimation

Assume 500 million Daily Active Users (DAU).

Traffic: Each user sends 40 messages a day.
- 500 M * 40 = 20 Billion messages/day.
- 20B / 86,400 = ~230,000 messages/second (peak could be 3x this).
Storage: If an average message is 100 bytes.
- 20B * 100 bytes = 2 TB / day.
Connections: At peak, we might have 100–200 million concurrent TCP/WebSocket connections open.

3. High-Level Architecture

Because HTTP is stateless and request-driven (a client must pull data), it is inefficient for real-time chat where the server needs to push data to the client.

We use WebSockets for persistent, bi-directional communication.

Component Breakdown

Chat Servers: Hold thousands of open WebSocket connections. They route messages in real-time but do not do heavy business logic.
Message Queue (Kafka): Acts as the central nervous system. When User A sends a message on Chat Server 1 meant for User B on Chat Server 2, it is routed through a Pub/Sub queue.
API Servers: Traditional stateless HTTP servers for user login, changing profile pictures, and fetching historical messages.

4. Message Flow Deep Dive

1-on-1 Messaging

User A sends a message to User B.
Chat Server 1 receives the message via WebSocket.
Chat Server 1 looks up User B's connection status via a distributed Session/Presence Service (usually backed by Redis).
The Session Service says, "User B is connected to Chat Server 2".
Chat Server 1 pushes the message onto a message queue (or directly via RPC) targeted at Chat Server 2.
Chat Server 2 pushes the message down the WebSocket to User B.
In parallel, a Sync Worker reads from the queue and persists the message to the database.

Group Messaging (Fan-Out)

If User A sends a message to a group with 500 people:

We don't want the client to send 500 requests.
User A sends one message to the server.
A Group Message Service looks up all 500 group members.
It "fans out" the message, placing 500 individual messages into the respective queues of the chat servers hosting those members.

5. Database Design

Why Cassandra / HBase?

We generate 2 TB of highly sequential, write-heavy data per day. We rarely update or delete messages, but we constantly append them and read them chronologically. Cassandra (or HBase) is the industry standard for this:

Optimized for massive write throughput.
Data is partitioned (sharded) perfectly by conversation_id.

Message Table Schema:

conversation_id (Partition Key - groups all messages for a chat on the same disk node)
message_id (Clustering Key - orders messages chronologically)
sender_id
content
created_at

TIP

Snowflake IDs for Messages Standard timestamps aren't precise enough to act as unique, sortable IDs for millions of concurrent messages. We use a Snowflake ID (a 64-bit integer combining a timestamp, worker machine ID, and sequence number) to ensure message_id is both globally unique and inherently sortable by time.

6. Managing Presence (Online/Offline)

Managing whether 500M users are online is notoriously challenging due to dropped connections and network flakiness.

Heartbeats: The client sends a "ping" over the WebSocket every 5 seconds.
Redis State: A Presence Service updates a Redis key user_status:{id} with the end (TTL) of 10 seconds.
Offline detection: If the WebSocket disconnects, or if the server doesn't receive a heartbeat for 10 seconds, the Redis key expires, triggering a Pub/Sub event that broadcasts "User Offline" to their friends.

Case Study: Chat Application (WhatsApp/Discord) ​

1. Requirements Clarification ​

Functional Requirements ​

Non-Functional Requirements ​

2. Back-of-the-Envelope Estimation ​

3. High-Level Architecture ​

Component Breakdown ​

4. Message Flow Deep Dive ​

1-on-1 Messaging ​

Group Messaging (Fan-Out) ​

5. Database Design ​

Why Cassandra / HBase? ​

6. Managing Presence (Online/Offline) ​