💬 System Design: WhatsApp / Chat Application ​
Real-time messaging at 2 billion user scale.
Step 1: Requirements ​
Functional ​
- 1:1 messaging
- Group chats (up to 256 members)
- Message delivery receipts (sent ✓, delivered ✓✓, read 🔵✓✓)
- Media sharing (images, video, voice)
- Last seen / online status
- End-to-end encryption
Non-Functional ​
- 2 billion users, 100B messages/day
- Delivery latency < 100ms
- High availability 99.99%
- No message loss
Step 2: Core Protocol — WebSocket ​
txt
Why WebSocket over HTTP polling?
HTTP Polling (bad):
Client asks every 5 seconds: "Any new messages?"
2B users × 1 req/5sec = 400M req/sec wasted!
WebSocket (good):
00: Persistent TCP connection
1A: Server pushes messages instantly
2B: users × 1 connection = maintained open connections
3C: High-Level Architecture2.1: Apache Kafka Workflow ​
Step 3: High-Level Architecture ​
Example: Basic WebSocket Server for Chat Connections ​
javascript
const { WebSocketServer } = require("ws");
const Redis = require("ioredis");
const wss = new WebSocketServer({ port: 8080 });
const redis = new Redis();
// Keep track of connected users locally on this server instance
const activeConnections = new Map(); // userId -> ws
wss.on("connection", async function connection(ws, req) {
// 1. Authenticate and extract userId
const userId = extractUserId(req);
activeConnections.set(userId, ws);
// 2. Update Presence in Redis (TTL-based heartbeat)
await redis.set(`presence:${userId}`, "online", "EX", 30);
// 3. Listen for incoming messages
ws.on("message", function message(data) {
const msg = JSON.parse(data);
handleIncomingMessage(userId, msg);
// Refresh presence heartbeat
redis.expire(`presence:${userId}`, 30);
});
ws.on("close", () => {
activeConnections.delete(userId);
// Let Redis TTL expire, or explicitly set to offline
redis.del(`presence:${userId}`);
});
});Step 4: Message Delivery Flow ​
Example: Publishing Message to Kafka ​
javascript
const { Kafka } = require("kafkajs");
const kafka = new Kafka({ clientId: "chat-server", brokers: ["kafka1:9092"] });
const producer = kafka.producer();
async function handleIncomingMessage(senderId, msgPayload) {
const { receiverId, content, messageId } = msgPayload;
// 1. Validate and prep the message payload
const messageEvent = {
messageId,
senderId,
receiverId,
content, // In reality, this is an encrypted blob (Signal protocol)
timestamp: Date.now(),
status: "SENT",
};
// 2. Publish to Kafka topic
// We partition by receiverId so all messages FOR a user go to the same partition
await producer.send({
topic: "chat-messages",
messages: [
{ key: String(receiverId), value: JSON.stringify(messageEvent) },
],
});
// Note: A separate consumer writes these sequentially to Cassandra
}Step 5: Message Delivery Receipts ​
Example: Client-side Receipt Handling ​
javascript
// On Bob's device (Browser or Mobile Client)
const ws = new WebSocket("wss://chat.whatsapp.com");
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === "NEW_MESSAGE") {
// 1. Message arrived on device
displayMessageInBackground(data.message);
// 2. Immediately send DELIVERED receipt back to server
ws.send(
JSON.stringify({
type: "RECEIPT",
messageId: data.message.id,
status: "DELIVERED",
senderId: data.message.senderId,
})
);
}
};
// When Bob actually opens the chat screen
function onChatOpened(activeChatId, unreadMessageIds) {
unreadMessageIds.forEach((msgId) => {
// 3. Send READ receipt
ws.send(
JSON.stringify({
type: "RECEIPT",
messageId: msgId,
status: "READ",
senderId: activeChatId,
})
);
});
}Step 6: Group Messages ​
Example: Group Message Receipt Processing ​
javascript
// Group Service handling incoming receipts for a group message
async function processGroupReceipt(messageId, groupId, memberId, newStatus) {
// 1. Update this specific member's status in DB
await updateMemberReceiptStatus(messageId, memberId, newStatus);
// 2. Fetch all members' statuses for this message
// Example: { "member1": "READ", "member2": "DELIVERED", ... }
const allStatuses = await getMessageStatuses(messageId);
const totalMembers = Object.keys(allStatuses).length; // e.g., 256
// 3. Check aggregate status
const readCount = Object.values(allStatuses).filter(
(s) => s === "READ"
).length;
const deliveredCount = Object.values(allStatuses).filter(
(s) => s === "DELIVERED" || s === "READ"
).length;
// 4. If EVERYONE has hit the milestone, emit an update to the sender
if (readCount === totalMembers) {
emitGroupStatusToSender(messageId, "READ"); // Upgrade to 🔵✓✓
} else if (deliveredCount === totalMembers) {
emitGroupStatusToSender(messageId, "DELIVERED"); // Upgrade to ✓✓
}
}Step 7: Database Schema ​
sql
-- Messages (Cassandra — append-only, high write)
CREATE TABLE messages (
chat_id UUID,
message_id TIMEUUID, -- sortable by time
sender_id BIGINT,
content TEXT,
media_url TEXT,
status TINYINT, -- 1=sent, 2=delivered, 3=read
created_at TIMESTAMP,
PRIMARY KEY (chat_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
-- User presence (Redis — fast TTL-based)
SET presence:{user_id} "online" EX 30 -- Expires in 30s if no heartbeat
-- Push tokens (MySQL)
CREATE TABLE push_tokens (
user_id BIGINT,
token VARCHAR(255),
platform ENUM('ios', 'android'),
PRIMARY KEY (user_id)
);Step 8: End-to-End Encryption ​
txt
WhatsApp uses Signal Protocol:
Key Exchange:
Alice and Bob exchange public keys via server
Server NEVER sees private keys
Encryption:
Alice encrypts message with Bob's public key
Only Bob's private key can decrypt
Double Ratchet Algorithm:
New encryption key generated for each message
Even if one key is compromised, past/future messages safe
Result: Server stores ENCRYPTED blobs — cannot read messages📊 Summary ​
| Component | Technology |
|---|---|
| Client-Server | WebSocket (persistent) |
| Message Storage | Cassandra (chat_id as partition key) |
| Message Queue | Kafka |
| Presence | Redis (TTL-based) |
| Push Notifications | APNS (iOS), FCM (Android) |
| Media Storage | S3 + CDN |
| Encryption | Signal Protocol (E2E) |
Key insight: The hardest parts are maintaining billions of WebSocket connections and the group message fan-out. WhatsApp famously handled 1 million concurrent connections on a single Erlang-based server.
