Skip to content

πŸ›’ E-Commerce Platform β€” Microservices Architecture & System Design ​

Difficulty: Advanced | Category: Write-Heavy, Event-Driven | Similar Systems: Amazon, Daraz, Shopify, Shopee

An end-to-end case study of how a production e-commerce platform handles millions of orders β€” covering domain decomposition, key user flows, distributed transactions, resilience patterns, and deployment infrastructure.


πŸ“š Table of Contents ​

  1. Requirements Clarification
  2. Back-of-the-Envelope Estimation
  3. API Design
  4. Monolith vs Microservices β€” Why Switch?
  5. High-Level Architecture Overview
  6. Domain Breakdown β€” Every Service Explained
  7. Key User Flows β€” Step-by-Step
  8. Communication Patterns
  9. Data Management β€” Database per Service
  10. Resilience Patterns
  11. Infrastructure & Deployment
  12. Trade-offs & Lessons Learned

1. Requirements Clarification ​

Functional Requirements ​

  • Browse & Search: Users can search and filter products by keyword, category, price, brand, and rating.
  • Place Orders: Users can add items to a cart and place orders with address and payment details.
  • Payment: Support multiple methods β€” card, mobile wallet (bKash), cash on delivery.
  • Inventory: Accurately track stock levels; prevent overselling even under concurrent purchases.
  • Seller Portal: Vendors can list products, manage inventory, and view payouts.
  • Order Tracking: Users can track their package from "Placed" β†’ "Shipped" β†’ "Delivered".
  • Notifications: Real-time SMS/email for order updates, payment receipts, and promotions.

Non-Functional Requirements ​

  • High Availability: Target 99.9% uptime; no single point of failure.
  • Low Latency: Product pages < 100ms; order placement < 500ms end-to-end.
  • Write-Heavy under bursts: 10Γ— normal traffic during flash sales.
  • Consistency: Payment and inventory must be strongly consistent. Notifications can be eventual.
  • Scalability: Must scale individual bottlenecks (search, payments) without scaling everything.

2. Back-of-the-Envelope Estimation ​

Assume a mid-size marketplace like Daraz Bangladesh.

text
Daily Active Users (DAU):     5 million
Sellers:                      50,000
Products in catalog:          10 million
Orders per day (normal):      200,000
Orders per day (flash sale):  2,000,000 (10Γ— spike)
Read-to-Write ratio:          ~100:1 (browsing >> purchasing)

Traffic Estimates ​

OperationNormal (req/sec)Peak Flash Sale
Product page views~2,300~23,000
Search queries~580~5,800
Order placements~2.3~23
Payment transactions~2.3~23

TIP

The read-to-write ratio of 100:1 tells us to invest heavily in caching (Redis) and read replicas for product and search data, while keeping transactional services (Order, Payment, Inventory) consistent and strongly isolated.

Storage Estimates ​

text
Per Order record:    ~2 KB (items, address, status history)
Orders per year:     200,000/day Γ— 365 = 73 million
Order storage/year:  73M Γ— 2 KB = ~146 GB

Per Product record:  ~5 KB (name, description, images metadata)
Total catalog:       10M Γ— 5 KB = ~50 GB in PostgreSQL
Images:              10M Γ— 5 images Γ— 200 KB = ~10 TB in S3/CDN

Kafka event log:     ~500 events/sec Γ— 1 KB = 500 MB/hour retained

NOTE

146 GB/year for orders is very manageable. The real challenge is the 10 TB of images (solved by S3 + CDN) and the 10Γ— traffic spikes (solved by auto-scaling + circuit breakers).


3. API Design ​

All APIs are versioned under /api/v1/. The API Gateway enforces JWT auth before forwarding to services.

Core REST Endpoints ​

Product Service

http
GET  /api/v1/products?category=shoes&brand=Nike&minPrice=500&page=1
GET  /api/v1/products/:productId
POST /api/v1/products                    (Seller auth required)
PUT  /api/v1/products/:productId         (Seller auth required)

Order Service

http
POST /api/v1/orders

Request body:

json
{
  "items": [{ "productId": "P-101", "quantity": 2, "price": 1200 }],
  "shippingAddressId": "ADDR-55",
  "paymentMethod": "bkash",
  "idempotencyKey": "uuid-client-generated-key"
}

Response (201 Created):

json
{
  "orderId": "ORD-2024-00123",
  "status": "PENDING_PAYMENT",
  "totalAmount": 2400,
  "estimatedDelivery": "2024-12-18"
}
http
GET    /api/v1/orders/:orderId            (Customer: own orders)
GET    /api/v1/orders?userId=42&status=SHIPPED
DELETE /api/v1/orders/:orderId            (Cancel, if still PENDING)

Inventory Service (internal only β€” not exposed to clients)

http
POST /internal/v1/inventory/reserve      (Called by Order Service)
POST /internal/v1/inventory/release      (Called on order cancel/fail)
GET  /internal/v1/inventory/:productId   (Check stock level)

Payment Controller

http
POST /api/v1/payments

Request:

json
{
  "orderId": "ORD-2024-00123",
  "method": "bkash",
  "amount": 2400,
  "idempotencyKey": "same-uuid-as-order-request"
}

Response (200 OK):

json
{
  "transactionId": "TXN-BKash-789XYZ",
  "status": "SUCCESS",
  "gatewayRef": "BKASH-TRX-2024"
}

Order State Machine ​

IMPORTANT

The idempotencyKey on both Order and Payment endpoints is critical. If a mobile client's request times out and it retries, the server uses this key (stored in Redis for 24 hours) to detect the duplicate and return the original response instead of charging the customer twice.


4. Monolith vs Microservices ​

The Core Challenge: Requirements and scale are defined β€” now, why does a monolith fail here, and why are microservices the right answer?

Starting Point β€” The Monolith ​

In the beginning, the team builds everything as one big application. This is called a Monolith.

What goes wrong as you grow?

ProblemReal Impact
One bug in Payments crashes the entire siteBlack Friday sale ruined
Need to scale Product search β†’ must scale everythingExpensive & wasteful
Two teams editing the same codebaseConflicts, slow releases
One database gets overloadedEntire platform slows down

The Fix β€” Break It Into Microservices ​

Each business capability becomes its own independent service with its own database.

Key Benefits

FeatureMonolithMicroservices
DeployRedeploy entire appDeploy only changed service
ScaleScale everythingScale only what needs it
FailureOne bug = site downOne service fails, rest work
TeamAll devs in one repoEach team owns their service
TechnologySame language/DB for allBest tool for each job

5. High-Level Architecture Overview ​

This is the full picture of how all services connect. Clients never talk directly to services β€” they go through the API Gateway.

NOTE

The API Gateway is the single entry point. It handles JWT verification, rate limiting, and routes requests to the right service. Services never expose their ports to the public internet.


4. Domain Breakdown β€” Every Service Explained ​

πŸ›οΈ Domain 1: Core Commerce ​

This is the heartbeat of the platform β€” the critical path for every purchase.

ServiceResponsibilityDatabase
Product ServiceProduct catalog, images, categories, attributesPostgreSQL + Redis (cache)
Order ServiceCreate, update, cancel orders (state machine)PostgreSQL
Inventory ServiceTrack stock, lock inventory on orderPostgreSQL + Redis (locks)
Payment ControllerAccept payment request, return response to clientRedis (idempotency keys)
Payment BackendTalk to Stripe/bKash/SSLCommerz, handle webhooksPostgreSQL

πŸ” Domain 2: Search & Discovery ​

Dedicated domain for helping users find products fast.

  • Search Service: Simple keyword search, autocomplete
  • Advanced Search: Faceted filters (brand, price, rating), AI-powered ranking, personalized results

TIP

Products are indexed into Elasticsearch asynchronously via Kafka events. When a seller updates a product, the Product Service publishes a product.updated event β†’ Search Service consumes it and re-indexes.


πŸͺ Domain 3: Seller Domain ​

Everything the vendor/seller needs to manage their store.

Real-World Example: A seller uploads a CSV file with 5,000 products.

  1. Bulk Uploader Service accepts the file and puts it in a job queue
  2. Seller Worker picks up the job, validates rows, and publishes events to Kafka
  3. Product Service consumes events and creates products
  4. Seller gets an email notification when done

πŸ‘₯ Domain 4: Customer & Support ​

Manages the buyer side and customer service operations.

ServiceResponsibility
Customer Service BackendUser profiles, address book, order history, preferences
CS Backend ServiceSupport tickets, agent tools, resolution workflows
Chat Backend ServiceReal-time chat (WebSocket) between customers and support/sellers

🚚 Domain 5: Operations & Logistics ​

Handles the physical movement of goods and platform security.

  • Fraud Engine: Every order runs through ML-based risk scoring before payment is processed
  • 3PL Integration: Connects with external delivery partners via their APIs

πŸ’° Domain 6: Finance & Reporting ​

The money flow and business intelligence layer.


βš™οΈ Domain 7: Infrastructure (Cross-cutting Concerns) ​

Services that every other domain depends on.


5. Key User Flows β€” Step-by-Step ​

πŸ›’ Flow 1: User Places an Order ​

This is the most critical flow. Let's trace every step.

What happens in plain English:

  1. User submits order β†’ Gateway verifies their login token
  2. Order Service starts the workflow
  3. Inventory is locked (reserved) so no one else can buy the last item
  4. Fraud Engine scores the order for risk
  5. Payment is charged
  6. Order is confirmed and an event is published to Kafka
  7. Multiple services react independently and asynchronously: SMS sent, logistics notified, accounts updated

πŸ’³ Flow 2: Payment Processing ​

IMPORTANT

Idempotency is critical here. If the user clicks "Pay" twice, the system must not charge them twice. The Payment Controller stores a unique idempotency key in Redis for each request and rejects duplicates.


πŸ“¦ Flow 3: Real-Time Inventory Update ​

Why Redis for inventory?

  • DECR in Redis is atomic β€” even with 1,000 concurrent buyers, it will never go below 0
  • PostgreSQL is updated asynchronously for durability
  • This prevents the "overselling problem"

πŸ”” Flow 4: Notification Delivery ​

The Notification Service is a pure consumer β€” it only listens to Kafka events and never gets called directly by other services. This means you can update, restart, or even replace it without affecting the Order or Payment flow.


8. Communication Patterns ​

The platform uses two types of communication between services:

When to use each? ​

ScenarioPatternWhy
Check if email exists during signupRESTNeed immediate yes/no answer
Place an orderREST (for sync steps) + Kafka (for downstream)Need confirmation, but logistics/notifications can be async
Bulk product reindex after catalog updateKafkaFire-and-forget, Search Service handles it when ready
Inter-service calls needing responsegRPCStrongly typed, low latency, internal network

7. Data Management β€” Database per Service ​

Each service owns its own data. No service reads another service's database directly.

Technology Choices ​

ServiceDatabaseReason
Order, Payment, ProductPostgreSQLACID transactions, relational data
Session, Cart, Rate LimitsRedisSub-millisecond reads, TTL support
Product SearchElasticsearchFull-text search, facets, ranking
Logs & MetricsElasticsearch + KibanaLog analysis, dashboards
Job QueuesRedis / KafkaReliable async task processing

The Saga Pattern β€” Distributed Transactions ​

Problem: An order involves multiple services. What if payment succeeds but inventory reservation fails?

The Saga Pattern solves this by defining a sequence of local transactions, each with a compensating transaction to undo if something fails downstream.


10. Database Schema β€” Deep Dive ​

Each service owns its schema completely. Below are the key table definitions for the most critical services.

Orders DB (PostgreSQL) ​

sql
-- Core order record
CREATE TABLE orders (
    id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id       UUID NOT NULL,
    status        VARCHAR(30) NOT NULL DEFAULT 'PENDING_PAYMENT',
    -- PENDING_PAYMENT | CONFIRMED | PROCESSING | SHIPPED | DELIVERED
    -- CANCELLED | RETURN_REQUESTED | REFUNDED | FAILED
    total_amount  NUMERIC(12, 2) NOT NULL,
    currency      CHAR(3) NOT NULL DEFAULT 'BDT',
    idempotency_key VARCHAR(128) UNIQUE NOT NULL,   -- prevents duplicate orders
    created_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at    TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Line items for each order
CREATE TABLE order_items (
    id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    order_id     UUID NOT NULL REFERENCES orders(id) ON DELETE CASCADE,
    product_id   UUID NOT NULL,
    seller_id    UUID NOT NULL,
    quantity     INT NOT NULL CHECK (quantity > 0),
    unit_price   NUMERIC(10, 2) NOT NULL,
    total_price  NUMERIC(12, 2) GENERATED ALWAYS AS (quantity * unit_price) STORED
);

-- Full audit trail of every status change
CREATE TABLE order_status_history (
    id         UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    order_id   UUID NOT NULL REFERENCES orders(id),
    old_status VARCHAR(30),
    new_status VARCHAR(30) NOT NULL,
    reason     TEXT,
    changed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Indexes for common query patterns
CREATE INDEX idx_orders_user_id   ON orders(user_id);
CREATE INDEX idx_orders_status    ON orders(status);
CREATE INDEX idx_order_items_order ON order_items(order_id);

Products DB (PostgreSQL) ​

sql
CREATE TABLE products (
    id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    seller_id    UUID NOT NULL,
    name         VARCHAR(500) NOT NULL,
    description  TEXT,
    category_id  UUID NOT NULL,
    base_price   NUMERIC(10, 2) NOT NULL,
    status       VARCHAR(20) NOT NULL DEFAULT 'ACTIVE',  -- ACTIVE | INACTIVE | DELETED
    created_at   TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Flexible key-value attributes (color, size, material, etc.)
CREATE TABLE product_attributes (
    product_id   UUID NOT NULL REFERENCES products(id),
    attr_key     VARCHAR(100) NOT NULL,
    attr_value   VARCHAR(500) NOT NULL,
    PRIMARY KEY (product_id, attr_key)
);

-- Image URLs stored as array (actual files live in S3)
CREATE TABLE product_images (
    id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    product_id   UUID NOT NULL REFERENCES products(id),
    image_url    TEXT NOT NULL,        -- CDN URL
    is_primary   BOOLEAN DEFAULT FALSE,
    display_order INT DEFAULT 0
);

CREATE INDEX idx_products_seller   ON products(seller_id);
CREATE INDEX idx_products_category ON products(category_id);
CREATE INDEX idx_products_status   ON products(status);

Inventory DB (PostgreSQL + Redis) ​

sql
CREATE TABLE inventory (
    product_id        UUID PRIMARY KEY,
    warehouse_id      UUID NOT NULL,
    total_stock       INT NOT NULL DEFAULT 0 CHECK (total_stock >= 0),
    reserved_stock    INT NOT NULL DEFAULT 0 CHECK (reserved_stock >= 0),
    -- available = total_stock - reserved_stock
    updated_at        TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Log every reservation and release for auditability
CREATE TABLE inventory_transactions (
    id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    product_id   UUID NOT NULL,
    order_id     UUID,
    type         VARCHAR(20) NOT NULL,  -- RESERVE | RELEASE | RESTOCK | ADJUST
    quantity     INT NOT NULL,
    created_at   TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Redis Key Pattern for Hot Inventory:

text
Key:   inventory:{productId}
Type:  String (integer)
Value: Available stock count (available = total - reserved)
TTL:   None (permanent, synced from PostgreSQL)

Operation: DECRBY inventory:P-101 2   β†’ atomic, thread-safe reservation
           INCRBY inventory:P-101 2   β†’ atomic release on cancel

NOTE

The Redis count is the real-time source of truth for availability checks. PostgreSQL is the durable source of truth for the actual stock ledger. A background job syncs them every 60 seconds and on any discrepancy.

Payments DB (PostgreSQL) ​

sql
CREATE TABLE payments (
    id                UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    order_id          UUID NOT NULL,
    method            VARCHAR(30) NOT NULL,  -- bkash | card | cod | sslcommerz
    amount            NUMERIC(12, 2) NOT NULL,
    currency          CHAR(3) NOT NULL DEFAULT 'BDT',
    status            VARCHAR(20) NOT NULL,  -- PENDING | SUCCESS | FAILED | REFUNDED
    gateway_txn_id    VARCHAR(200),          -- External gateway reference
    idempotency_key   VARCHAR(128) UNIQUE NOT NULL,
    created_at        TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    settled_at        TIMESTAMPTZ            -- When funds confirmed by gateway
);

CREATE INDEX idx_payments_order_id ON payments(order_id);
-- Unique partial index: only one SUCCESS payment per order
CREATE UNIQUE INDEX idx_payments_order_success
    ON payments(order_id) WHERE status = 'SUCCESS';

11. Resilience Patterns ​

Circuit Breaker ​

Prevents a slow/failing service from crashing everything else.

Real Example: Payment gateway is slow during peak hours.

  • Without circuit breaker: Every order hangs for 30 seconds, threads pile up, site crashes
  • With circuit breaker: After 5 failures, the breaker opens β†’ Orders get "Payment temporarily unavailable, try again" within milliseconds

Rate Limiting at the API Gateway ​

Retry with Exponential Backoff ​

When a transient failure occurs (network hiccup), retry intelligently:

text
Attempt 1 β†’ fails β†’ wait 1s
Attempt 2 β†’ fails β†’ wait 2s
Attempt 3 β†’ fails β†’ wait 4s
Attempt 4 β†’ success βœ…

12. Infrastructure & Deployment ​

Container Orchestration with Kubernetes ​

Each microservice runs in its own Docker container, managed by Kubernetes.

CI/CD Pipeline ​

Observability Stack ​

Distributed Tracing Example: A single order request touches 6 services. Jaeger shows you the full call chain with timing for each hop β€” so you can pinpoint exactly which service is slow.


13. Trade-offs & Lessons Learned ​

The Good ​

BenefitHow It Helps
Independent ScalingScale Payment Service 10x during sale without touching others
Independent DeploymentFix a bug in Notifications without redeploying the whole platform
Fault IsolationChat Service goes down β†’ Orders still work perfectly
Tech FlexibilityUse Go for high-throughput services, Python for ML/fraud detection
Team Autonomy6 teams work in parallel without blocking each other

The Hard Parts ​

ChallengeSolution
Distributed TransactionsSaga Pattern with compensating transactions
Data ConsistencyEventual consistency via Kafka events
Service DiscoveryKubernetes DNS or Consul
DebuggingDistributed tracing (Jaeger), correlation IDs on every request
TestingContract testing (Pact), integration tests per service
Operational ComplexityKubernetes, Helm charts, centralized config (Vault/ConfigMap)

WARNING

Don't start with microservices. Most successful companies (Amazon, Netflix, Uber) started as monoliths and evolved to microservices as the team and traffic scaled. Starting with microservices too early adds enormous complexity for no benefit.


πŸ—ΊοΈ Summary β€” The Big Picture ​

This is the complete journey of a single order β€” touching 10+ microservices, using 4 external integrations, and processing 3 async Kafka events, all within a few seconds.


πŸ“– Key Concepts Glossary ​

TermMeaning
API GatewaySingle entry point that routes requests to the right service
KafkaDistributed message bus for async event-driven communication
Saga PatternDistributed transaction handling with compensating rollbacks
Circuit BreakerPattern to stop calling a failing service and return fallback
IdempotencyDoing the same operation twice produces the same result (no double charges)
Eventual ConsistencyData will be consistent across services β€” just not instantly
Domain-Driven DesignOrganizing services around business domains (Orders, Payments, etc.)
Service MeshInfrastructure layer (Istio) handling mTLS, tracing, retries between services
CQRSCommand Query Responsibility Segregation β€” separate read and write models
Database per ServiceEach service owns its own database; no shared databases

➑️ Next: Case Study: Netflix Video Streaming β€” a similar event-driven architecture with extreme bandwidth requirements.

πŸ”— Related: Level 6 β€” Microservices Patterns | Level 5 β€” Messaging & Kafka | Level 2 β€” Scalability

Released under the ISC License.