Skip to content

Message Queues & Pub-Sub — Core Concepts

Interview Relevance: Very High — Decoupling services with async messaging is the backbone of every scalable microservice architecture. Know when and why to reach for a queue.


Why Message Queues?

The 3 core benefits:

  1. Decoupling — Producer doesn't know or care about consumers
  2. Async Processing — Producer returns immediately; work happens in background
  3. Backpressure — Queue absorbs traffic spikes; consumers process at sustainable rate

Message Queue vs. Pub-Sub: The Key Distinction


The Major Systems

Apache Kafka

Kafka is a distributed commit log — a high-throughput, fault-tolerant, durable event streaming platform.

Key Kafka Concepts:

ConceptDescriptionInterview Relevance
TopicNamed stream of events (like a table in a DB)Always say "topic", not "queue"
PartitionOrdered, immutable log — unit of parallelismMore partitions = more throughput
OffsetPosition of a message in a partitionConsumers track their own offset
Consumer GroupGroup of consumers sharing work on a topicEach partition → exactly 1 consumer in a group
Replication FactorCopies of each partition across brokersRF=3 → survives 2 broker failures
RetentionHow long messages are kept (default: 7 days)Consumers can replay from any offset

Why Kafka is Special:

✅ Extremely high throughput (millions of msg/sec per broker)
✅ Durable — messages persisted to disk with replication
✅ Replayable — consumers can re-read old messages (replay from offset 0)
✅ Multiple consumer groups — same topic consumed independently by many services
✅ Ordered within a partition (use partition key for ordering guarantees)
❌ Complex to operate (ZooKeeper/KRaft, partition rebalancing)
❌ No per-message routing (unlike RabbitMQ)
❌ At-least-once delivery by default (handle idempotency in consumer)

RabbitMQ

RabbitMQ is a traditional message broker with rich routing, exchanges, and delivery semantics.

RabbitMQ vs Kafka:

FeatureRabbitMQKafka
ModelMessage brokerDistributed log / event stream
Throughput~50K msg/sec~1M+ msg/sec per broker
Routing✅ Rich (Exchange types)❌ Topic-only (partition key)
Message retentionDeleted on ACKRetained for TTL (replayable)
Consumer groupsCompeting consumers per queueMultiple groups, independent offsets
OrderingPer-queue (single consumer)Per-partition
Use caseTask queues, RPC, complex routingEvent streaming, audit logs, high-throughput
At-most-once✅ Configurable❌ Default at-least-once

AWS SQS & SNS

Managed cloud-native alternatives — zero infrastructure to maintain.

FeatureSQSSNS
TypeMessage Queue (point-to-point)Pub-Sub (fan-out)
ConsumersOne consumer groupMultiple subscribers
DeliveryPull-based (consumers poll)Push-based (to endpoints/SQS)
RetentionUp to 14 daysNo retention (fire-and-forget)
OrderingStandard (best-effort) or FIFONo ordering
Use caseDecoupling services, work queuesFan-out, broadcast to many services
PatternSQS → one processorSNS → multiple SQS/Lambda/HTTP

Backpressure

Backpressure is the mechanism by which a slow consumer signals upstream to slow down production, preventing queue overflow and OOM crashes.

Backpressure Strategies

In Kafka specifically:

Kafka Producer Backpressure:
  - buffer.memory: total bytes the producer can buffer (default 32MB)
  - max.block.ms: how long producer blocks when buffer is full (default 60s)
  - If buffer fills and max.block.ms exceeded → TimeoutException

Kafka Consumer Lag:
  - Monitor: consumer lag = (latest offset) - (consumer offset)
  - Alert when lag > threshold (e.g., > 100,000 messages)
  - Remedy: add more consumers (up to partition count)

Delivery Guarantees

One of the most tested message queue concepts in FAANG interviews.

Making At-Least-Once safe with Idempotency:

Problem: Consumer processes message → crashes before ACK → broker redelivers → duplicate!

Solution: Idempotent Consumer
  - Add a unique message_id to every event
  - Consumer checks: "Have I processed message_id X before?"
    - YES → skip (already done)
    - NO  → process + record message_id in DB/Redis
  - Redis SETNX message_id EX 86400  ← atomic check-and-set

Dead Letter Queue (DLQ)

When a message fails repeatedly, it's moved to a DLQ instead of blocking the main queue.


Worked Example: Order Processing System

Design decisions explained:

DecisionChoiceReason
Partition keyorder_idAll events for same order → same partition → ordered processing
Partition count3 (matches consumer count)1 partition per consumer = max parallelism
Delivery guaranteeAt-least-once + idempotent consumersPayment service deduplicates by event_id
Replication factor3Survive 2 broker failures
Retention7 daysAllows replaying events for new services
DLQorder-events-dlq topicFailed payment events don't block other orders

Choosing the Right System


Interview Cheat Sheet

One-Line Summaries

Kafka:       Distributed commit log — high throughput, replayable, event streaming
RabbitMQ:    Message broker — rich routing, task queues, lower throughput
SQS:         Managed AWS queue — point-to-point, pull-based, zero ops
SNS:         Managed AWS pub-sub — fan-out broadcast to many subscribers
Decoupling:  Producer doesn't know consumers exist → independent failure & scaling
Backpressure: Consumer slowing down signals producer to slow down → prevent OOM
At-most-once:  May lose messages (metrics, logs)
At-least-once: May duplicate (use idempotent consumer to fix)
Exactly-once:  Kafka transactions (expensive, for financial systems)
DLQ:         Failed messages parked for inspection — never block the main queue

The Interview Phrase

"After the user places an order, the Order API publishes an
 OrderPlaced event to a Kafka topic and returns 200 immediately.
 Three independent consumer groups — Inventory, Email, and Payment —
 each consume the event at their own pace. I partition by order_id
 so all events for an order are ordered within a partition. Each
 consumer is idempotent (deduplicates by event_id) to safely handle
 Kafka's at-least-once delivery. Failed events go to a DLQ topic
 after 3 retries so they can be replayed after a bug fix."

Red Flags vs. Green Flags

🔴 Red Flag🟢 Green Flag
Use Kafka for simple task queuesUse SQS/RabbitMQ for simple queues; Kafka for event streams
Ignore idempotency with at-least-onceAlways discuss idempotent consumer pattern
No DLQ mentionedAlways add DLQ for failed messages
Forget about consumer lag / backpressureMonitor lag; auto-scale consumers when lag grows
Use one partition for ordered processingPartition by entity key (order_id) for parallelism + ordering
Treat SNS and SQS as the same thingSNS = fan-out pub-sub; SQS = point-to-point queue

IMPORTANT

Always mention idempotency when discussing at-least-once delivery. Say: "Since Kafka delivers at-least-once, my consumer deduplicates by checking if event_id has been processed before writing to Redis with a 24h TTL."

TIP

Mentioning consumer lag monitoring (e.g., via CloudWatch or Kafka's consumer group lag metric) and auto-scaling consumers when lag exceeds a threshold is a strong senior signal.

Released under the ISC License.