Skip to content

Enterprise Notification System: Scaling to 1 Billion/Day

Designing a notification system that handles 1 billion messages per day is an exercise in extreme reliability and scalability. At this scale, even a 0.01% failure rate means 100,000 users missed their critical alerts.

This guide explores the architecture of a high-performance, multi-channel (SMS, Email, Push) notification system used by enterprise giants like Uber, Airbnb, and Amazon.


1. Requirements & Goals

Functional Requirements

  • Support Multi-channel: Push notifications (iOS/Android), SMS, and Email.
  • Priority Handling: Critical alerts (2FA) must be delivered faster than marketing messages.
  • Templating: Dynamic content hydration (e.g., "Hello [Name]").
  • Analytics: Track "Sent", "Delivered", and "Opened" states.
  • User Preferences: Support opt-outs and channel preferences.

Non-Functional Requirements

  • High Availability: The system must be always operational.
  • Scalability: Handle 1 billion notifications per day with peaks.
  • Durability: No notification should be lost (zero data loss).
  • Latency: Critical notifications delivered within < 5 seconds.

2. Scale Estimation

Let's do the math for 1 billion notifications per day:

  • Daily Volume: notifications.
  • Average Throughput (TPS): messages per second.
  • Peak Throughput (P99): Marketing "blasts" or news alerts can spike traffic 10x. We must design for ~120,000 TPS.
  • Data Storage: Asssuming 1KB per metadata log:
    • of logs per day.
    • Retention for 30 days = 30TB (Requires a distributed DB like Cassandra or BigTable).

3. High-Level Architecture

We use an Asynchronous, Event-Driven Architecture to decouple ingestion from delivery providers.


4. Technical Deep Dive

A. Rate Limiting & Prioritization

We can't send 1B messages at once without getting blacklisted by providers.

  1. Per-User Limits: Prevent spamming a single user (e.g., max 5 marketing emails/day).
  2. Prioritization: Use separate Kafka Topics for different priorities:
    • topic.priority.high: 2FA, OTP, Transaction alerts.
    • topic.priority.low: Weekly newsletters, promotional offers.

B. Idempotency (The "Exactly-Once" Problem)

In distributed systems, retries are inevitable. To avoid sending the same SMS twice:

  • Generate a unique notification_id for every request.
  • Workers check a high-speed cache (Redis) before sending: SET notification_id SENT NX EX 24h.
  • If the key exists, skip the send.

C. Worker Cluster & Circuit Breaker

Third-party APIs (Twilio, SendGrid) can fail or slow down.

  • Circuit Breaker Pattern: If Twilio returns 500s, the circuit "opens" and the worker either retries later or switches to a backup provider (e.g., MessageBird).
  • Exponential Backoff: If a send fails, retry in 1s, 2s, 4s, 8s... don't hammer the API.

5. Implementation Example: Delivery Worker

This example shows a simplified worker that consumes a notification task and uses a Circuit Breaker for reliability.

typescript
import { Opossum } from "opossum"; // Circuit Breaker Library

interface NotificationTask {
  id: string;
  type: "SMS" | "EMAIL" | "PUSH";
  recipient: string;
  content: string;
}

class NotificationWorker {
  private breaker: any;

  constructor() {
    // Configure Circuit Breaker
    this.breaker = new Opossum(this.sendToProvider, {
      timeout: 3000, // If provider takes > 3s, fail
      errorThresholdPercentage: 50, // Open circuit if 50% fail
      resetTimeout: 30000, // Wait 30s before trying again
    });
  }

  // Actual API call to provider (e.g., Twilio)
  async sendToProvider(task: NotificationTask) {
    console.log(`Sending ${task.type} to ${task.recipient}...`);
    // Simulated API Call
    // await axios.post('https://api.provider.com/send', task);
  }

  async processTask(task: NotificationTask) {
    try {
      // 1. Check Idempotency in Redis
      const alreadySent = await redis.set(task.id, "SENT", "NX", "EX", 86400);
      if (!alreadySent) return;

      // 2. Execute via Circuit Breaker
      await this.breaker.fire(task);

      // 3. Log Success
      await db.logs.update(task.id, { status: "SENT" });
    } catch (error) {
      if (error.name === "OpenCircuitError") {
        console.error("Provider is down. Retrying through backup...");
        // Logic to route to backup provider
      } else {
        // Move to Dead Letter Queue (DLQ) for manual inspection
        await dlq.push(task);
      }
    }
  }
}

6. Feedback Loop (Webhooks)

Providers usually confirm delivery via Webhooks.

  • Our system must expose a /webhook/[provider] endpoint.
  • Webhooks update the status in the Tracking Database.
  • If status is BOUNCED, we mark the email as invalid to protect our reputation score.

Key Metrics to Monitor

  • Delivery Success Rate: % of messages delivered.
  • Latency: Time from Ingestion to Delivery.
  • Provider Health: Error rates per provider.
  • Queue Backup: Number of messages waiting in Kafka.

7. Enterprise Considerations: Cost & Compliance

Cost Optimization

At 1 billion notifications, cost is as much an architecture problem as latency.

  • Deduplication: Use Redis to prevent sending redundant marketing messages (Saves 5-10% in costs).
  • Least-Cost Routing: Dynamically switch between Twilio, Plivo, and AWS SNS based on current per-country pricing.
  • Tiered Storage: Move logs from ClickHouse to cold storage (S3/GCS) after 7 days to reduce storage costs.

Compliance (GDPR, TCPA, CCPA)

  • Opt-Out Management: A centralized service must store "Do Not Disturb" lists.
  • Data Anonymization: PII (Phone numbers, Emails) in logs should be masked or encrypted at rest.
  • Rate Limits for Compliance: TCPA requires specific quiet hours for marketing SMS (e.g., 8 PM - 8 AM). The Prioritization Service must buffer these messages during quiet hours.

Appendix: Provider Comparison

Choosing the right provider is critical for both cost and deliverability at scale. Below is a comparison of industry leaders for 2024-2026.

1. SMS Providers

ProviderKey StrengthsBest ForRegional ReachPricing
TwilioBest API & Docs, Global reach, highly reliable.Scale (1B+), complex logic.Global (Excellent)$$$
PlivoCheaper than Twilio, simple API, solid high volume.Cost-conscious high volume.Global (Good)$$
AWS SNSDeep AWS integration, extremely low overhead.Basic alerts, existing AWS apps.Global (Standard)$
VonageReliable international routing, high security.Enterprise-grade global SMS.Global (Excellent)$$

2. Email Providers

ProviderKey StrengthsBest ForDeliverabilityPricing
Amazon SESLowest cost per email ($0.10/1k), scales infinitely.Bulk marketing, system alerts.High (Requires warmup)$
PostmarkFastest delivery, best-in-class transactional focus.Password resets, OTPs.Very High$$$
SendGridExcellent templates, robust marketing automation.Mixed Marketing & Transactional.High$$
MailgunPowerful inbound parsing, strong developer API.Dynamic apps, dev-heavy teams.High$$

3. Push Notification Providers

ProviderKey StrengthsBest ForPlatform SupportPricing
FCMFree, native Android integration, cross-platform.Most mobile apps, 2FA push.Android, iOS, WebFree / Paygo
OneSignalEasy setup, advanced user segmentation.Marketing-heavy engagement.All platformsFreemium
AirshipEnterprise orchestration, real-time analytics.Large-scale high-end mobile apps.Mobile, Web, Wallet$$$

8. Interview Tips: Common Pitfalls & Gotchas

When designing this system in an interview, be sure to address these common interviewer follow-ups:

  • "What if the Queue is full?": Discuss Backpressure. Tell the ingestion service to slow down or return HTTP 429 to client services.
  • "How to handle duplicate notifications?": Emphasize Idempotency. Explain the Redis SET NX approach.
  • "How do you handle provider outages?": Mention Circuit Breakers and Failover Routing (e.g., if SNS is down, use Twilio).
  • "How to avoid being marked as Spam?": Discuss Deliverability. Handling BOUNCES and UNSUBSCRIBES immediately (Feedback Loop).
  • "What about security?": Discuss Data at Rest encryption for PII and API Keys rotation for 3rd party providers.

Summary

Building for 1 billion notifications isn't just about sending messages; it's about defensive engineering. By using distributed queues, circuit breakers, and aggressive idempotency checks, you can ensure that your system stays resilient even when third-party providers fail.

Released under the ISC License.