Message Queues & Pub-Sub — Core Concepts
Interview Relevance: Very High — Decoupling services with async messaging is the backbone of every scalable microservice architecture. Know when and why to reach for a queue.
Why Message Queues?
The 3 core benefits:
- Decoupling — Producer doesn't know or care about consumers
- Async Processing — Producer returns immediately; work happens in background
- Backpressure — Queue absorbs traffic spikes; consumers process at sustainable rate
Message Queue vs. Pub-Sub: The Key Distinction
The Major Systems
Apache Kafka
Kafka is a distributed commit log — a high-throughput, fault-tolerant, durable event streaming platform.
Key Kafka Concepts:
| Concept | Description | Interview Relevance |
|---|---|---|
| Topic | Named stream of events (like a table in a DB) | Always say "topic", not "queue" |
| Partition | Ordered, immutable log — unit of parallelism | More partitions = more throughput |
| Offset | Position of a message in a partition | Consumers track their own offset |
| Consumer Group | Group of consumers sharing work on a topic | Each partition → exactly 1 consumer in a group |
| Replication Factor | Copies of each partition across brokers | RF=3 → survives 2 broker failures |
| Retention | How long messages are kept (default: 7 days) | Consumers can replay from any offset |
Why Kafka is Special:
✅ Extremely high throughput (millions of msg/sec per broker)
✅ Durable — messages persisted to disk with replication
✅ Replayable — consumers can re-read old messages (replay from offset 0)
✅ Multiple consumer groups — same topic consumed independently by many services
✅ Ordered within a partition (use partition key for ordering guarantees)
❌ Complex to operate (ZooKeeper/KRaft, partition rebalancing)
❌ No per-message routing (unlike RabbitMQ)
❌ At-least-once delivery by default (handle idempotency in consumer)RabbitMQ
RabbitMQ is a traditional message broker with rich routing, exchanges, and delivery semantics.
RabbitMQ vs Kafka:
| Feature | RabbitMQ | Kafka |
|---|---|---|
| Model | Message broker | Distributed log / event stream |
| Throughput | ~50K msg/sec | ~1M+ msg/sec per broker |
| Routing | ✅ Rich (Exchange types) | ❌ Topic-only (partition key) |
| Message retention | Deleted on ACK | Retained for TTL (replayable) |
| Consumer groups | Competing consumers per queue | Multiple groups, independent offsets |
| Ordering | Per-queue (single consumer) | Per-partition |
| Use case | Task queues, RPC, complex routing | Event streaming, audit logs, high-throughput |
| At-most-once | ✅ Configurable | ❌ Default at-least-once |
AWS SQS & SNS
Managed cloud-native alternatives — zero infrastructure to maintain.
| Feature | SQS | SNS |
|---|---|---|
| Type | Message Queue (point-to-point) | Pub-Sub (fan-out) |
| Consumers | One consumer group | Multiple subscribers |
| Delivery | Pull-based (consumers poll) | Push-based (to endpoints/SQS) |
| Retention | Up to 14 days | No retention (fire-and-forget) |
| Ordering | Standard (best-effort) or FIFO | No ordering |
| Use case | Decoupling services, work queues | Fan-out, broadcast to many services |
| Pattern | SQS → one processor | SNS → multiple SQS/Lambda/HTTP |
Backpressure
Backpressure is the mechanism by which a slow consumer signals upstream to slow down production, preventing queue overflow and OOM crashes.
Backpressure Strategies
In Kafka specifically:
Kafka Producer Backpressure:
- buffer.memory: total bytes the producer can buffer (default 32MB)
- max.block.ms: how long producer blocks when buffer is full (default 60s)
- If buffer fills and max.block.ms exceeded → TimeoutException
Kafka Consumer Lag:
- Monitor: consumer lag = (latest offset) - (consumer offset)
- Alert when lag > threshold (e.g., > 100,000 messages)
- Remedy: add more consumers (up to partition count)Delivery Guarantees
One of the most tested message queue concepts in FAANG interviews.
Making At-Least-Once safe with Idempotency:
Problem: Consumer processes message → crashes before ACK → broker redelivers → duplicate!
Solution: Idempotent Consumer
- Add a unique message_id to every event
- Consumer checks: "Have I processed message_id X before?"
- YES → skip (already done)
- NO → process + record message_id in DB/Redis
- Redis SETNX message_id EX 86400 ← atomic check-and-setDead Letter Queue (DLQ)
When a message fails repeatedly, it's moved to a DLQ instead of blocking the main queue.
Worked Example: Order Processing System
Design decisions explained:
| Decision | Choice | Reason |
|---|---|---|
| Partition key | order_id | All events for same order → same partition → ordered processing |
| Partition count | 3 (matches consumer count) | 1 partition per consumer = max parallelism |
| Delivery guarantee | At-least-once + idempotent consumers | Payment service deduplicates by event_id |
| Replication factor | 3 | Survive 2 broker failures |
| Retention | 7 days | Allows replaying events for new services |
| DLQ | order-events-dlq topic | Failed payment events don't block other orders |
Choosing the Right System
Interview Cheat Sheet
One-Line Summaries
Kafka: Distributed commit log — high throughput, replayable, event streaming
RabbitMQ: Message broker — rich routing, task queues, lower throughput
SQS: Managed AWS queue — point-to-point, pull-based, zero ops
SNS: Managed AWS pub-sub — fan-out broadcast to many subscribers
Decoupling: Producer doesn't know consumers exist → independent failure & scaling
Backpressure: Consumer slowing down signals producer to slow down → prevent OOM
At-most-once: May lose messages (metrics, logs)
At-least-once: May duplicate (use idempotent consumer to fix)
Exactly-once: Kafka transactions (expensive, for financial systems)
DLQ: Failed messages parked for inspection — never block the main queueThe Interview Phrase
"After the user places an order, the Order API publishes an
OrderPlaced event to a Kafka topic and returns 200 immediately.
Three independent consumer groups — Inventory, Email, and Payment —
each consume the event at their own pace. I partition by order_id
so all events for an order are ordered within a partition. Each
consumer is idempotent (deduplicates by event_id) to safely handle
Kafka's at-least-once delivery. Failed events go to a DLQ topic
after 3 retries so they can be replayed after a bug fix."Red Flags vs. Green Flags
| 🔴 Red Flag | 🟢 Green Flag |
|---|---|
| Use Kafka for simple task queues | Use SQS/RabbitMQ for simple queues; Kafka for event streams |
| Ignore idempotency with at-least-once | Always discuss idempotent consumer pattern |
| No DLQ mentioned | Always add DLQ for failed messages |
| Forget about consumer lag / backpressure | Monitor lag; auto-scale consumers when lag grows |
| Use one partition for ordered processing | Partition by entity key (order_id) for parallelism + ordering |
| Treat SNS and SQS as the same thing | SNS = fan-out pub-sub; SQS = point-to-point queue |
IMPORTANT
Always mention idempotency when discussing at-least-once delivery. Say: "Since Kafka delivers at-least-once, my consumer deduplicates by checking if event_id has been processed before writing to Redis with a 24h TTL."
TIP
Mentioning consumer lag monitoring (e.g., via CloudWatch or Kafka's consumer group lag metric) and auto-scaling consumers when lag exceeds a threshold is a strong senior signal.
