Message Queues & Pub-Sub — Core Concepts

Interview Relevance: Very High — Decoupling services with async messaging is the backbone of every scalable microservice architecture. Know when and why to reach for a queue.

Why Message Queues?

The 3 core benefits:

Decoupling — Producer doesn't know or care about consumers
Async Processing — Producer returns immediately; work happens in background
Backpressure — Queue absorbs traffic spikes; consumers process at sustainable rate

Message Queue vs. Pub-Sub: The Key Distinction

The Major Systems

Apache Kafka

Kafka is a distributed commit log — a high-throughput, fault-tolerant, durable event streaming platform.

Key Kafka Concepts:

Concept	Description	Interview Relevance
Topic	Named stream of events (like a table in a DB)	Always say "topic", not "queue"
Partition	Ordered, immutable log — unit of parallelism	More partitions = more throughput
Offset	Position of a message in a partition	Consumers track their own offset
Consumer Group	Group of consumers sharing work on a topic	Each partition → exactly 1 consumer in a group
Replication Factor	Copies of each partition across brokers	RF=3 → survives 2 broker failures
Retention	How long messages are kept (default: 7 days)	Consumers can replay from any offset

Why Kafka is Special:

✅ Extremely high throughput (millions of msg/sec per broker)
✅ Durable — messages persisted to disk with replication
✅ Replayable — consumers can re-read old messages (replay from offset 0)
✅ Multiple consumer groups — same topic consumed independently by many services
✅ Ordered within a partition (use partition key for ordering guarantees)
❌ Complex to operate (ZooKeeper/KRaft, partition rebalancing)
❌ No per-message routing (unlike RabbitMQ)
❌ At-least-once delivery by default (handle idempotency in consumer)

RabbitMQ

RabbitMQ is a traditional message broker with rich routing, exchanges, and delivery semantics.

RabbitMQ vs Kafka:

Feature	RabbitMQ	Kafka
Model	Message broker	Distributed log / event stream
Throughput	~50K msg/sec	~1M+ msg/sec per broker
Routing	✅ Rich (Exchange types)	❌ Topic-only (partition key)
Message retention	Deleted on ACK	Retained for TTL (replayable)
Consumer groups	Competing consumers per queue	Multiple groups, independent offsets
Ordering	Per-queue (single consumer)	Per-partition
Use case	Task queues, RPC, complex routing	Event streaming, audit logs, high-throughput
At-most-once	✅ Configurable	❌ Default at-least-once

AWS SQS & SNS

Managed cloud-native alternatives — zero infrastructure to maintain.

Feature	SQS	SNS
Type	Message Queue (point-to-point)	Pub-Sub (fan-out)
Consumers	One consumer group	Multiple subscribers
Delivery	Pull-based (consumers poll)	Push-based (to endpoints/SQS)
Retention	Up to 14 days	No retention (fire-and-forget)
Ordering	Standard (best-effort) or FIFO	No ordering
Use case	Decoupling services, work queues	Fan-out, broadcast to many services
Pattern	SQS → one processor	SNS → multiple SQS/Lambda/HTTP

Backpressure

Backpressure is the mechanism by which a slow consumer signals upstream to slow down production, preventing queue overflow and OOM crashes.

Backpressure Strategies

In Kafka specifically:

Kafka Producer Backpressure:
  - buffer.memory: total bytes the producer can buffer (default 32MB)
  - max.block.ms: how long producer blocks when buffer is full (default 60s)
  - If buffer fills and max.block.ms exceeded → TimeoutException

Kafka Consumer Lag:
  - Monitor: consumer lag = (latest offset) - (consumer offset)
  - Alert when lag > threshold (e.g., > 100,000 messages)
  - Remedy: add more consumers (up to partition count)

Delivery Guarantees

One of the most tested message queue concepts in FAANG interviews.

Making At-Least-Once safe with Idempotency:

Problem: Consumer processes message → crashes before ACK → broker redelivers → duplicate!

Solution: Idempotent Consumer
  - Add a unique message_id to every event
  - Consumer checks: "Have I processed message_id X before?"
    - YES → skip (already done)
    - NO  → process + record message_id in DB/Redis
  - Redis SETNX message_id EX 86400  ← atomic check-and-set

Dead Letter Queue (DLQ)

When a message fails repeatedly, it's moved to a DLQ instead of blocking the main queue.

Worked Example: Order Processing System

Design decisions explained:

Decision	Choice	Reason
Partition key	`order_id`	All events for same order → same partition → ordered processing
Partition count	3 (matches consumer count)	1 partition per consumer = max parallelism
Delivery guarantee	At-least-once + idempotent consumers	Payment service deduplicates by `event_id`
Replication factor	3	Survive 2 broker failures
Retention	7 days	Allows replaying events for new services
DLQ	`order-events-dlq` topic	Failed payment events don't block other orders

Choosing the Right System

Interview Cheat Sheet

One-Line Summaries

Kafka:       Distributed commit log — high throughput, replayable, event streaming
RabbitMQ:    Message broker — rich routing, task queues, lower throughput
SQS:         Managed AWS queue — point-to-point, pull-based, zero ops
SNS:         Managed AWS pub-sub — fan-out broadcast to many subscribers
Decoupling:  Producer doesn't know consumers exist → independent failure & scaling
Backpressure: Consumer slowing down signals producer to slow down → prevent OOM
At-most-once:  May lose messages (metrics, logs)
At-least-once: May duplicate (use idempotent consumer to fix)
Exactly-once:  Kafka transactions (expensive, for financial systems)
DLQ:         Failed messages parked for inspection — never block the main queue

The Interview Phrase

"After the user places an order, the Order API publishes an
 OrderPlaced event to a Kafka topic and returns 200 immediately.
 Three independent consumer groups — Inventory, Email, and Payment —
 each consume the event at their own pace. I partition by order_id
 so all events for an order are ordered within a partition. Each
 consumer is idempotent (deduplicates by event_id) to safely handle
 Kafka's at-least-once delivery. Failed events go to a DLQ topic
 after 3 retries so they can be replayed after a bug fix."

Red Flags vs. Green Flags

🔴 Red Flag	🟢 Green Flag
Use Kafka for simple task queues	Use SQS/RabbitMQ for simple queues; Kafka for event streams
Ignore idempotency with at-least-once	Always discuss idempotent consumer pattern
No DLQ mentioned	Always add DLQ for failed messages
Forget about consumer lag / backpressure	Monitor lag; auto-scale consumers when lag grows
Use one partition for ordered processing	Partition by entity key (order_id) for parallelism + ordering
Treat SNS and SQS as the same thing	SNS = fan-out pub-sub; SQS = point-to-point queue

IMPORTANT

Always mention idempotency when discussing at-least-once delivery. Say: "Since Kafka delivers at-least-once, my consumer deduplicates by checking if event_id has been processed before writing to Redis with a 24h TTL."

TIP

Mentioning consumer lag monitoring (e.g., via CloudWatch or Kafka's consumer group lag metric) and auto-scaling consumers when lag exceeds a threshold is a strong senior signal.

Message Queues & Pub-Sub — Core Concepts ​

Why Message Queues? ​

Message Queue vs. Pub-Sub: The Key Distinction ​

The Major Systems ​

Apache Kafka ​

RabbitMQ ​

AWS SQS & SNS ​

Backpressure ​

Backpressure Strategies ​

Delivery Guarantees ​

Dead Letter Queue (DLQ) ​

Worked Example: Order Processing System ​

Choosing the Right System ​

Interview Cheat Sheet ​

One-Line Summaries ​

The Interview Phrase ​

Red Flags vs. Green Flags ​