Enterprise Notification System: Scaling to 1 Billion/Day
Designing a notification system that handles 1 billion messages per day is an exercise in extreme reliability and scalability. At this scale, even a 0.01% failure rate means 100,000 users missed their critical alerts.
This guide explores the architecture of a high-performance, multi-channel (SMS, Email, Push) notification system used by enterprise giants like Uber, Airbnb, and Amazon.
1. Requirements & Goals
Functional Requirements
- Support Multi-channel: Push notifications (iOS/Android), SMS, and Email.
- Priority Handling: Critical alerts (2FA) must be delivered faster than marketing messages.
- Templating: Dynamic content hydration (e.g., "Hello [Name]").
- Analytics: Track "Sent", "Delivered", and "Opened" states.
- User Preferences: Support opt-outs and channel preferences.
Non-Functional Requirements
- High Availability: The system must be always operational.
- Scalability: Handle 1 billion notifications per day with peaks.
- Durability: No notification should be lost (zero data loss).
- Latency: Critical notifications delivered within < 5 seconds.
2. Scale Estimation
Let's do the math for 1 billion notifications per day:
- Daily Volume: notifications.
- Average Throughput (TPS): messages per second.
- Peak Throughput (P99): Marketing "blasts" or news alerts can spike traffic 10x. We must design for ~120,000 TPS.
- Data Storage: Asssuming 1KB per metadata log:
- of logs per day.
- Retention for 30 days = 30TB (Requires a distributed DB like Cassandra or BigTable).
3. High-Level Architecture
We use an Asynchronous, Event-Driven Architecture to decouple ingestion from delivery providers.
4. Technical Deep Dive
A. Rate Limiting & Prioritization
We can't send 1B messages at once without getting blacklisted by providers.
- Per-User Limits: Prevent spamming a single user (e.g., max 5 marketing emails/day).
- Prioritization: Use separate Kafka Topics for different priorities:
topic.priority.high: 2FA, OTP, Transaction alerts.topic.priority.low: Weekly newsletters, promotional offers.
B. Idempotency (The "Exactly-Once" Problem)
In distributed systems, retries are inevitable. To avoid sending the same SMS twice:
- Generate a unique
notification_idfor every request. - Workers check a high-speed cache (Redis) before sending:
SET notification_id SENT NX EX 24h. - If the key exists, skip the send.
C. Worker Cluster & Circuit Breaker
Third-party APIs (Twilio, SendGrid) can fail or slow down.
- Circuit Breaker Pattern: If Twilio returns 500s, the circuit "opens" and the worker either retries later or switches to a backup provider (e.g., MessageBird).
- Exponential Backoff: If a send fails, retry in 1s, 2s, 4s, 8s... don't hammer the API.
5. Implementation Example: Delivery Worker
This example shows a simplified worker that consumes a notification task and uses a Circuit Breaker for reliability.
import { Opossum } from "opossum"; // Circuit Breaker Library
interface NotificationTask {
id: string;
type: "SMS" | "EMAIL" | "PUSH";
recipient: string;
content: string;
}
class NotificationWorker {
private breaker: any;
constructor() {
// Configure Circuit Breaker
this.breaker = new Opossum(this.sendToProvider, {
timeout: 3000, // If provider takes > 3s, fail
errorThresholdPercentage: 50, // Open circuit if 50% fail
resetTimeout: 30000, // Wait 30s before trying again
});
}
// Actual API call to provider (e.g., Twilio)
async sendToProvider(task: NotificationTask) {
console.log(`Sending ${task.type} to ${task.recipient}...`);
// Simulated API Call
// await axios.post('https://api.provider.com/send', task);
}
async processTask(task: NotificationTask) {
try {
// 1. Check Idempotency in Redis
const alreadySent = await redis.set(task.id, "SENT", "NX", "EX", 86400);
if (!alreadySent) return;
// 2. Execute via Circuit Breaker
await this.breaker.fire(task);
// 3. Log Success
await db.logs.update(task.id, { status: "SENT" });
} catch (error) {
if (error.name === "OpenCircuitError") {
console.error("Provider is down. Retrying through backup...");
// Logic to route to backup provider
} else {
// Move to Dead Letter Queue (DLQ) for manual inspection
await dlq.push(task);
}
}
}
}6. Feedback Loop (Webhooks)
Providers usually confirm delivery via Webhooks.
- Our system must expose a
/webhook/[provider]endpoint. - Webhooks update the status in the Tracking Database.
- If status is
BOUNCED, we mark the email as invalid to protect our reputation score.
Key Metrics to Monitor
- Delivery Success Rate: % of messages delivered.
- Latency: Time from Ingestion to Delivery.
- Provider Health: Error rates per provider.
- Queue Backup: Number of messages waiting in Kafka.
7. Enterprise Considerations: Cost & Compliance
Cost Optimization
At 1 billion notifications, cost is as much an architecture problem as latency.
- Deduplication: Use Redis to prevent sending redundant marketing messages (Saves 5-10% in costs).
- Least-Cost Routing: Dynamically switch between Twilio, Plivo, and AWS SNS based on current per-country pricing.
- Tiered Storage: Move logs from ClickHouse to cold storage (S3/GCS) after 7 days to reduce storage costs.
Compliance (GDPR, TCPA, CCPA)
- Opt-Out Management: A centralized service must store "Do Not Disturb" lists.
- Data Anonymization: PII (Phone numbers, Emails) in logs should be masked or encrypted at rest.
- Rate Limits for Compliance: TCPA requires specific quiet hours for marketing SMS (e.g., 8 PM - 8 AM). The Prioritization Service must buffer these messages during quiet hours.
Appendix: Provider Comparison
Choosing the right provider is critical for both cost and deliverability at scale. Below is a comparison of industry leaders for 2024-2026.
1. SMS Providers
| Provider | Key Strengths | Best For | Regional Reach | Pricing |
|---|---|---|---|---|
| Twilio | Best API & Docs, Global reach, highly reliable. | Scale (1B+), complex logic. | Global (Excellent) | $$$ |
| Plivo | Cheaper than Twilio, simple API, solid high volume. | Cost-conscious high volume. | Global (Good) | $$ |
| AWS SNS | Deep AWS integration, extremely low overhead. | Basic alerts, existing AWS apps. | Global (Standard) | $ |
| Vonage | Reliable international routing, high security. | Enterprise-grade global SMS. | Global (Excellent) | $$ |
2. Email Providers
| Provider | Key Strengths | Best For | Deliverability | Pricing |
|---|---|---|---|---|
| Amazon SES | Lowest cost per email ($0.10/1k), scales infinitely. | Bulk marketing, system alerts. | High (Requires warmup) | $ |
| Postmark | Fastest delivery, best-in-class transactional focus. | Password resets, OTPs. | Very High | $$$ |
| SendGrid | Excellent templates, robust marketing automation. | Mixed Marketing & Transactional. | High | $$ |
| Mailgun | Powerful inbound parsing, strong developer API. | Dynamic apps, dev-heavy teams. | High | $$ |
3. Push Notification Providers
| Provider | Key Strengths | Best For | Platform Support | Pricing |
|---|---|---|---|---|
| FCM | Free, native Android integration, cross-platform. | Most mobile apps, 2FA push. | Android, iOS, Web | Free / Paygo |
| OneSignal | Easy setup, advanced user segmentation. | Marketing-heavy engagement. | All platforms | Freemium |
| Airship | Enterprise orchestration, real-time analytics. | Large-scale high-end mobile apps. | Mobile, Web, Wallet | $$$ |
8. Interview Tips: Common Pitfalls & Gotchas
When designing this system in an interview, be sure to address these common interviewer follow-ups:
- "What if the Queue is full?": Discuss Backpressure. Tell the ingestion service to slow down or return HTTP 429 to client services.
- "How to handle duplicate notifications?": Emphasize Idempotency. Explain the Redis
SET NXapproach. - "How do you handle provider outages?": Mention Circuit Breakers and Failover Routing (e.g., if SNS is down, use Twilio).
- "How to avoid being marked as Spam?": Discuss Deliverability. Handling
BOUNCESandUNSUBSCRIBESimmediately (Feedback Loop). - "What about security?": Discuss Data at Rest encryption for PII and API Keys rotation for 3rd party providers.
Summary
Building for 1 billion notifications isn't just about sending messages; it's about defensive engineering. By using distributed queues, circuit breakers, and aggressive idempotency checks, you can ensure that your system stays resilient even when third-party providers fail.
