🛒 E-Commerce Platform — Microservices Architecture & System Design

Difficulty: Advanced | Category: Write-Heavy, Event-Driven | Similar Systems: Amazon, Daraz, Shopify, Shopee

An end-to-end case study of how a production e-commerce platform handles millions of orders — covering domain decomposition, key user flows, distributed transactions, resilience patterns, and deployment infrastructure.

📚 Table of Contents

Requirements Clarification
Back-of-the-Envelope Estimation
API Design
Monolith vs Microservices — Why Switch?
High-Level Architecture Overview
Domain Breakdown — Every Service Explained
Key User Flows — Step-by-Step
Communication Patterns
Data Management — Database per Service
Resilience Patterns
Infrastructure & Deployment
Trade-offs & Lessons Learned

1. Requirements Clarification

Functional Requirements

Browse & Search: Users can search and filter products by keyword, category, price, brand, and rating.
Place Orders: Users can add items to a cart and place orders with address and payment details.
Payment: Support multiple methods — card, mobile wallet (bKash), cash on delivery.
Inventory: Accurately track stock levels; prevent overselling even under concurrent purchases.
Seller Portal: Vendors can list products, manage inventory, and view payouts.
Order Tracking: Users can track their package from "Placed" → "Shipped" → "Delivered".
Notifications: Real-time SMS/email for order updates, payment receipts, and promotions.

Non-Functional Requirements

High Availability: Target 99.9% uptime; no single point of failure.
Low Latency: Product pages < 100ms; order placement < 500ms end-to-end.
Write-Heavy under bursts: 10× normal traffic during flash sales.
Consistency: Payment and inventory must be strongly consistent. Notifications can be eventual.
Scalability: Must scale individual bottlenecks (search, payments) without scaling everything.

2. Back-of-the-Envelope Estimation

Assume a mid-size marketplace like Daraz Bangladesh.

text

Daily Active Users (DAU):     5 million
Sellers:                      50,000
Products in catalog:          10 million
Orders per day (normal):      200,000
Orders per day (flash sale):  2,000,000 (10× spike)
Read-to-Write ratio:          ~100:1 (browsing >> purchasing)

Traffic Estimates

Operation	Normal (req/sec)	Peak Flash Sale
Product page views	~2,300	~23,000
Search queries	~580	~5,800
Order placements	~2.3	~23
Payment transactions	~2.3	~23

TIP

The read-to-write ratio of 100:1 tells us to invest heavily in caching (Redis) and read replicas for product and search data, while keeping transactional services (Order, Payment, Inventory) consistent and strongly isolated.

Storage Estimates

text

Per Order record:    ~2 KB (items, address, status history)
Orders per year:     200,000/day × 365 = 73 million
Order storage/year:  73M × 2 KB = ~146 GB

Per Product record:  ~5 KB (name, description, images metadata)
Total catalog:       10M × 5 KB = ~50 GB in PostgreSQL
Images:              10M × 5 images × 200 KB = ~10 TB in S3/CDN

Kafka event log:     ~500 events/sec × 1 KB = 500 MB/hour retained

NOTE

146 GB/year for orders is very manageable. The real challenge is the 10 TB of images (solved by S3 + CDN) and the 10× traffic spikes (solved by auto-scaling + circuit breakers).

3. API Design

All APIs are versioned under /api/v1/. The API Gateway enforces JWT auth before forwarding to services.

Core REST Endpoints

Product Service

http

GET  /api/v1/products?category=shoes&brand=Nike&minPrice=500&page=1
GET  /api/v1/products/:productId
POST /api/v1/products                    (Seller auth required)
PUT  /api/v1/products/:productId         (Seller auth required)

Order Service

http

POST /api/v1/orders

Request body:

json

{
  "items": [{ "productId": "P-101", "quantity": 2, "price": 1200 }],
  "shippingAddressId": "ADDR-55",
  "paymentMethod": "bkash",
  "idempotencyKey": "uuid-client-generated-key"
}

Response (201 Created):

json

{
  "orderId": "ORD-2024-00123",
  "status": "PENDING_PAYMENT",
  "totalAmount": 2400,
  "estimatedDelivery": "2024-12-18"
}

http

GET    /api/v1/orders/:orderId            (Customer: own orders)
GET    /api/v1/orders?userId=42&status=SHIPPED
DELETE /api/v1/orders/:orderId            (Cancel, if still PENDING)

Inventory Service (internal only — not exposed to clients)

http

POST /internal/v1/inventory/reserve      (Called by Order Service)
POST /internal/v1/inventory/release      (Called on order cancel/fail)
GET  /internal/v1/inventory/:productId   (Check stock level)

Payment Controller

http

POST /api/v1/payments

Request:

json

{
  "orderId": "ORD-2024-00123",
  "method": "bkash",
  "amount": 2400,
  "idempotencyKey": "same-uuid-as-order-request"
}

Response (200 OK):

json

{
  "transactionId": "TXN-BKash-789XYZ",
  "status": "SUCCESS",
  "gatewayRef": "BKASH-TRX-2024"
}

Order State Machine

IMPORTANT

The idempotencyKey on both Order and Payment endpoints is critical. If a mobile client's request times out and it retries, the server uses this key (stored in Redis for 24 hours) to detect the duplicate and return the original response instead of charging the customer twice.

4. Monolith vs Microservices

The Core Challenge: Requirements and scale are defined — now, why does a monolith fail here, and why are microservices the right answer?

Starting Point — The Monolith

In the beginning, the team builds everything as one big application. This is called a Monolith.

What goes wrong as you grow?

Problem	Real Impact
One bug in Payments crashes the entire site	Black Friday sale ruined
Need to scale Product search → must scale everything	Expensive & wasteful
Two teams editing the same codebase	Conflicts, slow releases
One database gets overloaded	Entire platform slows down

The Fix — Break It Into Microservices

Each business capability becomes its own independent service with its own database.

Key Benefits

Feature	Monolith	Microservices
Deploy	Redeploy entire app	Deploy only changed service
Scale	Scale everything	Scale only what needs it
Failure	One bug = site down	One service fails, rest work
Team	All devs in one repo	Each team owns their service
Technology	Same language/DB for all	Best tool for each job

5. High-Level Architecture Overview

This is the full picture of how all services connect. Clients never talk directly to services — they go through the API Gateway.

NOTE

The API Gateway is the single entry point. It handles JWT verification, rate limiting, and routes requests to the right service. Services never expose their ports to the public internet.

4. Domain Breakdown — Every Service Explained

🛍️ Domain 1: Core Commerce

This is the heartbeat of the platform — the critical path for every purchase.

Service	Responsibility	Database
Product Service	Product catalog, images, categories, attributes	PostgreSQL + Redis (cache)
Order Service	Create, update, cancel orders (state machine)	PostgreSQL
Inventory Service	Track stock, lock inventory on order	PostgreSQL + Redis (locks)
Payment Controller	Accept payment request, return response to client	Redis (idempotency keys)
Payment Backend	Talk to Stripe/bKash/SSLCommerz, handle webhooks	PostgreSQL

🔍 Domain 2: Search & Discovery

Dedicated domain for helping users find products fast.

Search Service: Simple keyword search, autocomplete
Advanced Search: Faceted filters (brand, price, rating), AI-powered ranking, personalized results

TIP

Products are indexed into Elasticsearch asynchronously via Kafka events. When a seller updates a product, the Product Service publishes a product.updated event → Search Service consumes it and re-indexes.

🏪 Domain 3: Seller Domain

Everything the vendor/seller needs to manage their store.

Real-World Example: A seller uploads a CSV file with 5,000 products.

Bulk Uploader Service accepts the file and puts it in a job queue
Seller Worker picks up the job, validates rows, and publishes events to Kafka
Product Service consumes events and creates products
Seller gets an email notification when done

👥 Domain 4: Customer & Support

Manages the buyer side and customer service operations.

Service	Responsibility
Customer Service Backend	User profiles, address book, order history, preferences
CS Backend Service	Support tickets, agent tools, resolution workflows
Chat Backend Service	Real-time chat (WebSocket) between customers and support/sellers

🚚 Domain 5: Operations & Logistics

Handles the physical movement of goods and platform security.

Fraud Engine: Every order runs through ML-based risk scoring before payment is processed
3PL Integration: Connects with external delivery partners via their APIs

💰 Domain 6: Finance & Reporting

The money flow and business intelligence layer.

⚙️ Domain 7: Infrastructure (Cross-cutting Concerns)

Services that every other domain depends on.

5. Key User Flows — Step-by-Step

🛒 Flow 1: User Places an Order

This is the most critical flow. Let's trace every step.

What happens in plain English:

User submits order → Gateway verifies their login token
Order Service starts the workflow
Inventory is locked (reserved) so no one else can buy the last item
Fraud Engine scores the order for risk
Payment is charged
Order is confirmed and an event is published to Kafka
Multiple services react independently and asynchronously: SMS sent, logistics notified, accounts updated

💳 Flow 2: Payment Processing

IMPORTANT

Idempotency is critical here. If the user clicks "Pay" twice, the system must not charge them twice. The Payment Controller stores a unique idempotency key in Redis for each request and rejects duplicates.

📦 Flow 3: Real-Time Inventory Update

Why Redis for inventory?

DECR in Redis is atomic — even with 1,000 concurrent buyers, it will never go below 0
PostgreSQL is updated asynchronously for durability
This prevents the "overselling problem"

🔔 Flow 4: Notification Delivery

The Notification Service is a pure consumer — it only listens to Kafka events and never gets called directly by other services. This means you can update, restart, or even replace it without affecting the Order or Payment flow.

8. Communication Patterns

The platform uses two types of communication between services:

When to use each?

Scenario	Pattern	Why
Check if email exists during signup	REST	Need immediate yes/no answer
Place an order	REST (for sync steps) + Kafka (for downstream)	Need confirmation, but logistics/notifications can be async
Bulk product reindex after catalog update	Kafka	Fire-and-forget, Search Service handles it when ready
Inter-service calls needing response	gRPC	Strongly typed, low latency, internal network

7. Data Management — Database per Service

Each service owns its own data. No service reads another service's database directly.

Technology Choices

Service	Database	Reason
Order, Payment, Product	PostgreSQL	ACID transactions, relational data
Session, Cart, Rate Limits	Redis	Sub-millisecond reads, TTL support
Product Search	Elasticsearch	Full-text search, facets, ranking
Logs & Metrics	Elasticsearch + Kibana	Log analysis, dashboards
Job Queues	Redis / Kafka	Reliable async task processing

The Saga Pattern — Distributed Transactions

Problem: An order involves multiple services. What if payment succeeds but inventory reservation fails?

The Saga Pattern solves this by defining a sequence of local transactions, each with a compensating transaction to undo if something fails downstream.

10. Database Schema — Deep Dive

Each service owns its schema completely. Below are the key table definitions for the most critical services.

Orders DB (PostgreSQL)

sql

-- Core order record
CREATE TABLE orders (
    id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id       UUID NOT NULL,
    status        VARCHAR(30) NOT NULL DEFAULT 'PENDING_PAYMENT',
    -- PENDING_PAYMENT | CONFIRMED | PROCESSING | SHIPPED | DELIVERED
    -- CANCELLED | RETURN_REQUESTED | REFUNDED | FAILED
    total_amount  NUMERIC(12, 2) NOT NULL,
    currency      CHAR(3) NOT NULL DEFAULT 'BDT',
    idempotency_key VARCHAR(128) UNIQUE NOT NULL,   -- prevents duplicate orders
    created_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at    TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Line items for each order
CREATE TABLE order_items (
    id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    order_id     UUID NOT NULL REFERENCES orders(id) ON DELETE CASCADE,
    product_id   UUID NOT NULL,
    seller_id    UUID NOT NULL,
    quantity     INT NOT NULL CHECK (quantity > 0),
    unit_price   NUMERIC(10, 2) NOT NULL,
    total_price  NUMERIC(12, 2) GENERATED ALWAYS AS (quantity * unit_price) STORED
);

-- Full audit trail of every status change
CREATE TABLE order_status_history (
    id         UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    order_id   UUID NOT NULL REFERENCES orders(id),
    old_status VARCHAR(30),
    new_status VARCHAR(30) NOT NULL,
    reason     TEXT,
    changed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Indexes for common query patterns
CREATE INDEX idx_orders_user_id   ON orders(user_id);
CREATE INDEX idx_orders_status    ON orders(status);
CREATE INDEX idx_order_items_order ON order_items(order_id);

Products DB (PostgreSQL)

sql

CREATE TABLE products (
    id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    seller_id    UUID NOT NULL,
    name         VARCHAR(500) NOT NULL,
    description  TEXT,
    category_id  UUID NOT NULL,
    base_price   NUMERIC(10, 2) NOT NULL,
    status       VARCHAR(20) NOT NULL DEFAULT 'ACTIVE',  -- ACTIVE | INACTIVE | DELETED
    created_at   TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Flexible key-value attributes (color, size, material, etc.)
CREATE TABLE product_attributes (
    product_id   UUID NOT NULL REFERENCES products(id),
    attr_key     VARCHAR(100) NOT NULL,
    attr_value   VARCHAR(500) NOT NULL,
    PRIMARY KEY (product_id, attr_key)
);

-- Image URLs stored as array (actual files live in S3)
CREATE TABLE product_images (
    id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    product_id   UUID NOT NULL REFERENCES products(id),
    image_url    TEXT NOT NULL,        -- CDN URL
    is_primary   BOOLEAN DEFAULT FALSE,
    display_order INT DEFAULT 0
);

CREATE INDEX idx_products_seller   ON products(seller_id);
CREATE INDEX idx_products_category ON products(category_id);
CREATE INDEX idx_products_status   ON products(status);

Inventory DB (PostgreSQL + Redis)

sql

CREATE TABLE inventory (
    product_id        UUID PRIMARY KEY,
    warehouse_id      UUID NOT NULL,
    total_stock       INT NOT NULL DEFAULT 0 CHECK (total_stock >= 0),
    reserved_stock    INT NOT NULL DEFAULT 0 CHECK (reserved_stock >= 0),
    -- available = total_stock - reserved_stock
    updated_at        TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Log every reservation and release for auditability
CREATE TABLE inventory_transactions (
    id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    product_id   UUID NOT NULL,
    order_id     UUID,
    type         VARCHAR(20) NOT NULL,  -- RESERVE | RELEASE | RESTOCK | ADJUST
    quantity     INT NOT NULL,
    created_at   TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Redis Key Pattern for Hot Inventory:

text

Key:   inventory:{productId}
Type:  String (integer)
Value: Available stock count (available = total - reserved)
TTL:   None (permanent, synced from PostgreSQL)

Operation: DECRBY inventory:P-101 2   → atomic, thread-safe reservation
           INCRBY inventory:P-101 2   → atomic release on cancel

NOTE

The Redis count is the real-time source of truth for availability checks. PostgreSQL is the durable source of truth for the actual stock ledger. A background job syncs them every 60 seconds and on any discrepancy.

Payments DB (PostgreSQL)

sql

CREATE TABLE payments (
    id                UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    order_id          UUID NOT NULL,
    method            VARCHAR(30) NOT NULL,  -- bkash | card | cod | sslcommerz
    amount            NUMERIC(12, 2) NOT NULL,
    currency          CHAR(3) NOT NULL DEFAULT 'BDT',
    status            VARCHAR(20) NOT NULL,  -- PENDING | SUCCESS | FAILED | REFUNDED
    gateway_txn_id    VARCHAR(200),          -- External gateway reference
    idempotency_key   VARCHAR(128) UNIQUE NOT NULL,
    created_at        TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    settled_at        TIMESTAMPTZ            -- When funds confirmed by gateway
);

CREATE INDEX idx_payments_order_id ON payments(order_id);
-- Unique partial index: only one SUCCESS payment per order
CREATE UNIQUE INDEX idx_payments_order_success
    ON payments(order_id) WHERE status = 'SUCCESS';

11. Resilience Patterns

Circuit Breaker

Prevents a slow/failing service from crashing everything else.

Real Example: Payment gateway is slow during peak hours.

Without circuit breaker: Every order hangs for 30 seconds, threads pile up, site crashes
With circuit breaker: After 5 failures, the breaker opens → Orders get "Payment temporarily unavailable, try again" within milliseconds

Rate Limiting at the API Gateway

Retry with Exponential Backoff

When a transient failure occurs (network hiccup), retry intelligently:

text

Attempt 1 → fails → wait 1s
Attempt 2 → fails → wait 2s
Attempt 3 → fails → wait 4s
Attempt 4 → success ✅

12. Infrastructure & Deployment

Container Orchestration with Kubernetes

Each microservice runs in its own Docker container, managed by Kubernetes.

CI/CD Pipeline

Observability Stack

Distributed Tracing Example: A single order request touches 6 services. Jaeger shows you the full call chain with timing for each hop — so you can pinpoint exactly which service is slow.

13. Trade-offs & Lessons Learned

The Good

Benefit	How It Helps
Independent Scaling	Scale Payment Service 10x during sale without touching others
Independent Deployment	Fix a bug in Notifications without redeploying the whole platform
Fault Isolation	Chat Service goes down → Orders still work perfectly
Tech Flexibility	Use Go for high-throughput services, Python for ML/fraud detection
Team Autonomy	6 teams work in parallel without blocking each other

The Hard Parts

Challenge	Solution
Distributed Transactions	Saga Pattern with compensating transactions
Data Consistency	Eventual consistency via Kafka events
Service Discovery	Kubernetes DNS or Consul
Debugging	Distributed tracing (Jaeger), correlation IDs on every request
Testing	Contract testing (Pact), integration tests per service
Operational Complexity	Kubernetes, Helm charts, centralized config (Vault/ConfigMap)

WARNING

Don't start with microservices. Most successful companies (Amazon, Netflix, Uber) started as monoliths and evolved to microservices as the team and traffic scaled. Starting with microservices too early adds enormous complexity for no benefit.

🗺️ Summary — The Big Picture

This is the complete journey of a single order — touching 10+ microservices, using 4 external integrations, and processing 3 async Kafka events, all within a few seconds.

📖 Key Concepts Glossary

Term	Meaning
API Gateway	Single entry point that routes requests to the right service
Kafka	Distributed message bus for async event-driven communication
Saga Pattern	Distributed transaction handling with compensating rollbacks
Circuit Breaker	Pattern to stop calling a failing service and return fallback
Idempotency	Doing the same operation twice produces the same result (no double charges)
Eventual Consistency	Data will be consistent across services — just not instantly
Domain-Driven Design	Organizing services around business domains (Orders, Payments, etc.)
Service Mesh	Infrastructure layer (Istio) handling mTLS, tracing, retries between services
CQRS	Command Query Responsibility Segregation — separate read and write models
Database per Service	Each service owns its own database; no shared databases

➡️ Next: Case Study: Netflix Video Streaming — a similar event-driven architecture with extreme bandwidth requirements.
🔗 Related: Level 6 — Microservices Patterns | Level 5 — Messaging & Kafka | Level 2 — Scalability

🛒 E-Commerce Platform — Microservices Architecture & System Design ​

📚 Table of Contents ​

1. Requirements Clarification ​

Functional Requirements ​

Non-Functional Requirements ​

2. Back-of-the-Envelope Estimation ​

Traffic Estimates ​

Storage Estimates ​

3. API Design ​

Core REST Endpoints ​

Order State Machine ​

4. Monolith vs Microservices ​

Starting Point — The Monolith ​

The Fix — Break It Into Microservices ​

5. High-Level Architecture Overview ​

4. Domain Breakdown — Every Service Explained ​

🛍️ Domain 1: Core Commerce ​

🔍 Domain 2: Search & Discovery ​

🏪 Domain 3: Seller Domain ​

👥 Domain 4: Customer & Support ​

🚚 Domain 5: Operations & Logistics ​

💰 Domain 6: Finance & Reporting ​

⚙️ Domain 7: Infrastructure (Cross-cutting Concerns) ​

5. Key User Flows — Step-by-Step ​

🛒 Flow 1: User Places an Order ​

💳 Flow 2: Payment Processing ​

📦 Flow 3: Real-Time Inventory Update ​

🔔 Flow 4: Notification Delivery ​

8. Communication Patterns ​

When to use each? ​

7. Data Management — Database per Service ​

Technology Choices ​

The Saga Pattern — Distributed Transactions ​

10. Database Schema — Deep Dive ​

Orders DB (PostgreSQL) ​

Products DB (PostgreSQL) ​

Inventory DB (PostgreSQL + Redis) ​

Payments DB (PostgreSQL) ​

11. Resilience Patterns ​

Circuit Breaker ​

Rate Limiting at the API Gateway ​

Retry with Exponential Backoff ​

12. Infrastructure & Deployment ​

Container Orchestration with Kubernetes ​

CI/CD Pipeline ​

Observability Stack ​

13. Trade-offs & Lessons Learned ​

The Good ​

The Hard Parts ​

🗺️ Summary — The Big Picture ​

📖 Key Concepts Glossary ​

🛒 E-Commerce Platform — Microservices Architecture & System Design

📚 Table of Contents

1. Requirements Clarification

Functional Requirements

Non-Functional Requirements

2. Back-of-the-Envelope Estimation

Traffic Estimates

Storage Estimates

3. API Design

Core REST Endpoints

Order State Machine

4. Monolith vs Microservices

Starting Point — The Monolith

The Fix — Break It Into Microservices

5. High-Level Architecture Overview

4. Domain Breakdown — Every Service Explained

🛍️ Domain 1: Core Commerce

🔍 Domain 2: Search & Discovery

🏪 Domain 3: Seller Domain

👥 Domain 4: Customer & Support

🚚 Domain 5: Operations & Logistics

💰 Domain 6: Finance & Reporting

⚙️ Domain 7: Infrastructure (Cross-cutting Concerns)

5. Key User Flows — Step-by-Step

🛒 Flow 1: User Places an Order

💳 Flow 2: Payment Processing

📦 Flow 3: Real-Time Inventory Update

🔔 Flow 4: Notification Delivery

8. Communication Patterns

When to use each?

7. Data Management — Database per Service

Technology Choices

The Saga Pattern — Distributed Transactions

10. Database Schema — Deep Dive

Orders DB (PostgreSQL)

Products DB (PostgreSQL)

Inventory DB (PostgreSQL + Redis)

Payments DB (PostgreSQL)

11. Resilience Patterns

Circuit Breaker

Rate Limiting at the API Gateway

Retry with Exponential Backoff

12. Infrastructure & Deployment

Container Orchestration with Kubernetes

CI/CD Pipeline

Observability Stack

13. Trade-offs & Lessons Learned

The Good

The Hard Parts

🗺️ Summary — The Big Picture

📖 Key Concepts Glossary