Case Study: Chat Application (WhatsApp/Discord)
Difficulty: Intermediate | Category: Real-Time | Similar Systems: Slack, Facebook Messenger
Designing a real-time chat application introduces the complexities of persistent connections, low-latency message delivery, and real-time presence (online/offline status).
1. Requirements Clarification
Functional Requirements
- 1-on-1 Chat: Users can send real-time text messages to each other.
- Group Chat: Users can participate in groups (up to 500 members).
- Online Presence: Users can see if their friends are online or offline.
- Message History: Users can scroll up to view older messages across devices.
Non-Functional Requirements
- Low Latency: Messages should be delivered with barely noticeable delay.
- High Availability: The system must be robust; however, strict consistency is less critical than availability (eventual consistency for message history is acceptable).
- Persistent Connections: Millions of concurrent open connections.
2. Back-of-the-Envelope Estimation
Assume 500 million Daily Active Users (DAU).
- Traffic: Each user sends 40 messages a day.
- 500 M * 40 = 20 Billion messages/day.
- 20B / 86,400 = ~230,000 messages/second (peak could be 3x this).
- Storage: If an average message is 100 bytes.
- 20B * 100 bytes = 2 TB / day.
- Connections: At peak, we might have 100–200 million concurrent TCP/WebSocket connections open.
3. High-Level Architecture
Because HTTP is stateless and request-driven (a client must pull data), it is inefficient for real-time chat where the server needs to push data to the client.
We use WebSockets for persistent, bi-directional communication.
Component Breakdown
- Chat Servers: Hold thousands of open WebSocket connections. They route messages in real-time but do not do heavy business logic.
- Message Queue (Kafka): Acts as the central nervous system. When User A sends a message on Chat Server 1 meant for User B on Chat Server 2, it is routed through a Pub/Sub queue.
- API Servers: Traditional stateless HTTP servers for user login, changing profile pictures, and fetching historical messages.
4. Message Flow Deep Dive
1-on-1 Messaging
- User A sends a message to User B.
- Chat Server 1 receives the message via WebSocket.
- Chat Server 1 looks up User B's connection status via a distributed Session/Presence Service (usually backed by Redis).
- The Session Service says, "User B is connected to Chat Server 2".
- Chat Server 1 pushes the message onto a message queue (or directly via RPC) targeted at Chat Server 2.
- Chat Server 2 pushes the message down the WebSocket to User B.
- In parallel, a Sync Worker reads from the queue and persists the message to the database.
Group Messaging (Fan-Out)
If User A sends a message to a group with 500 people:
- We don't want the client to send 500 requests.
- User A sends one message to the server.
- A Group Message Service looks up all 500 group members.
- It "fans out" the message, placing 500 individual messages into the respective queues of the chat servers hosting those members.
5. Database Design
Why Cassandra / HBase?
We generate 2 TB of highly sequential, write-heavy data per day. We rarely update or delete messages, but we constantly append them and read them chronologically. Cassandra (or HBase) is the industry standard for this:
- Optimized for massive write throughput.
- Data is partitioned (sharded) perfectly by
conversation_id.
Message Table Schema:
conversation_id(Partition Key - groups all messages for a chat on the same disk node)message_id(Clustering Key - orders messages chronologically)sender_idcontentcreated_at
TIP
Snowflake IDs for Messages Standard timestamps aren't precise enough to act as unique, sortable IDs for millions of concurrent messages. We use a Snowflake ID (a 64-bit integer combining a timestamp, worker machine ID, and sequence number) to ensure message_id is both globally unique and inherently sortable by time.
6. Managing Presence (Online/Offline)
Managing whether 500M users are online is notoriously challenging due to dropped connections and network flakiness.
- Heartbeats: The client sends a "ping" over the WebSocket every 5 seconds.
- Redis State: A Presence Service updates a Redis key
user_status:{id}with the end (TTL) of 10 seconds. - Offline detection: If the WebSocket disconnects, or if the server doesn't receive a heartbeat for 10 seconds, the Redis key expires, triggering a Pub/Sub event that broadcasts "User Offline" to their friends.
