Skip to content

Case Study: Chat Application (WhatsApp/Discord)

Difficulty: Intermediate | Category: Real-Time | Similar Systems: Slack, Facebook Messenger

Designing a real-time chat application introduces the complexities of persistent connections, low-latency message delivery, and real-time presence (online/offline status).


1. Requirements Clarification

Functional Requirements

  • 1-on-1 Chat: Users can send real-time text messages to each other.
  • Group Chat: Users can participate in groups (up to 500 members).
  • Online Presence: Users can see if their friends are online or offline.
  • Message History: Users can scroll up to view older messages across devices.

Non-Functional Requirements

  • Low Latency: Messages should be delivered with barely noticeable delay.
  • High Availability: The system must be robust; however, strict consistency is less critical than availability (eventual consistency for message history is acceptable).
  • Persistent Connections: Millions of concurrent open connections.

2. Back-of-the-Envelope Estimation

Assume 500 million Daily Active Users (DAU).

  • Traffic: Each user sends 40 messages a day.
    • 500 M * 40 = 20 Billion messages/day.
    • 20B / 86,400 = ~230,000 messages/second (peak could be 3x this).
  • Storage: If an average message is 100 bytes.
    • 20B * 100 bytes = 2 TB / day.
  • Connections: At peak, we might have 100–200 million concurrent TCP/WebSocket connections open.

3. High-Level Architecture

Because HTTP is stateless and request-driven (a client must pull data), it is inefficient for real-time chat where the server needs to push data to the client.

We use WebSockets for persistent, bi-directional communication.

Component Breakdown

  1. Chat Servers: Hold thousands of open WebSocket connections. They route messages in real-time but do not do heavy business logic.
  2. Message Queue (Kafka): Acts as the central nervous system. When User A sends a message on Chat Server 1 meant for User B on Chat Server 2, it is routed through a Pub/Sub queue.
  3. API Servers: Traditional stateless HTTP servers for user login, changing profile pictures, and fetching historical messages.

4. Message Flow Deep Dive

1-on-1 Messaging

  1. User A sends a message to User B.
  2. Chat Server 1 receives the message via WebSocket.
  3. Chat Server 1 looks up User B's connection status via a distributed Session/Presence Service (usually backed by Redis).
  4. The Session Service says, "User B is connected to Chat Server 2".
  5. Chat Server 1 pushes the message onto a message queue (or directly via RPC) targeted at Chat Server 2.
  6. Chat Server 2 pushes the message down the WebSocket to User B.
  7. In parallel, a Sync Worker reads from the queue and persists the message to the database.

Group Messaging (Fan-Out)

If User A sends a message to a group with 500 people:

  • We don't want the client to send 500 requests.
  • User A sends one message to the server.
  • A Group Message Service looks up all 500 group members.
  • It "fans out" the message, placing 500 individual messages into the respective queues of the chat servers hosting those members.

5. Database Design

Why Cassandra / HBase?

We generate 2 TB of highly sequential, write-heavy data per day. We rarely update or delete messages, but we constantly append them and read them chronologically. Cassandra (or HBase) is the industry standard for this:

  • Optimized for massive write throughput.
  • Data is partitioned (sharded) perfectly by conversation_id.

Message Table Schema:

  • conversation_id (Partition Key - groups all messages for a chat on the same disk node)
  • message_id (Clustering Key - orders messages chronologically)
  • sender_id
  • content
  • created_at

TIP

Snowflake IDs for Messages Standard timestamps aren't precise enough to act as unique, sortable IDs for millions of concurrent messages. We use a Snowflake ID (a 64-bit integer combining a timestamp, worker machine ID, and sequence number) to ensure message_id is both globally unique and inherently sortable by time.


6. Managing Presence (Online/Offline)

Managing whether 500M users are online is notoriously challenging due to dropped connections and network flakiness.

  • Heartbeats: The client sends a "ping" over the WebSocket every 5 seconds.
  • Redis State: A Presence Service updates a Redis key user_status:{id} with the end (TTL) of 10 seconds.
  • Offline detection: If the WebSocket disconnects, or if the server doesn't receive a heartbeat for 10 seconds, the Redis key expires, triggering a Pub/Sub event that broadcasts "User Offline" to their friends.

Released under the ISC License.