π E-Commerce Platform β Microservices Architecture & System Design β
Difficulty: Advanced | Category: Write-Heavy, Event-Driven | Similar Systems: Amazon, Daraz, Shopify, Shopee
An end-to-end case study of how a production e-commerce platform handles millions of orders β covering domain decomposition, key user flows, distributed transactions, resilience patterns, and deployment infrastructure.
π Table of Contents β
- Requirements Clarification
- Back-of-the-Envelope Estimation
- API Design
- Monolith vs Microservices β Why Switch?
- High-Level Architecture Overview
- Domain Breakdown β Every Service Explained
- Key User Flows β Step-by-Step
- Communication Patterns
- Data Management β Database per Service
- Resilience Patterns
- Infrastructure & Deployment
- Trade-offs & Lessons Learned
1. Requirements Clarification β
Functional Requirements β
- Browse & Search: Users can search and filter products by keyword, category, price, brand, and rating.
- Place Orders: Users can add items to a cart and place orders with address and payment details.
- Payment: Support multiple methods β card, mobile wallet (bKash), cash on delivery.
- Inventory: Accurately track stock levels; prevent overselling even under concurrent purchases.
- Seller Portal: Vendors can list products, manage inventory, and view payouts.
- Order Tracking: Users can track their package from "Placed" β "Shipped" β "Delivered".
- Notifications: Real-time SMS/email for order updates, payment receipts, and promotions.
Non-Functional Requirements β
- High Availability: Target 99.9% uptime; no single point of failure.
- Low Latency: Product pages < 100ms; order placement < 500ms end-to-end.
- Write-Heavy under bursts: 10Γ normal traffic during flash sales.
- Consistency: Payment and inventory must be strongly consistent. Notifications can be eventual.
- Scalability: Must scale individual bottlenecks (search, payments) without scaling everything.
2. Back-of-the-Envelope Estimation β
Assume a mid-size marketplace like Daraz Bangladesh.
Daily Active Users (DAU): 5 million
Sellers: 50,000
Products in catalog: 10 million
Orders per day (normal): 200,000
Orders per day (flash sale): 2,000,000 (10Γ spike)
Read-to-Write ratio: ~100:1 (browsing >> purchasing)Traffic Estimates β
| Operation | Normal (req/sec) | Peak Flash Sale |
|---|---|---|
| Product page views | ~2,300 | ~23,000 |
| Search queries | ~580 | ~5,800 |
| Order placements | ~2.3 | ~23 |
| Payment transactions | ~2.3 | ~23 |
TIP
The read-to-write ratio of 100:1 tells us to invest heavily in caching (Redis) and read replicas for product and search data, while keeping transactional services (Order, Payment, Inventory) consistent and strongly isolated.
Storage Estimates β
Per Order record: ~2 KB (items, address, status history)
Orders per year: 200,000/day Γ 365 = 73 million
Order storage/year: 73M Γ 2 KB = ~146 GB
Per Product record: ~5 KB (name, description, images metadata)
Total catalog: 10M Γ 5 KB = ~50 GB in PostgreSQL
Images: 10M Γ 5 images Γ 200 KB = ~10 TB in S3/CDN
Kafka event log: ~500 events/sec Γ 1 KB = 500 MB/hour retainedNOTE
146 GB/year for orders is very manageable. The real challenge is the 10 TB of images (solved by S3 + CDN) and the 10Γ traffic spikes (solved by auto-scaling + circuit breakers).
3. API Design β
All APIs are versioned under /api/v1/. The API Gateway enforces JWT auth before forwarding to services.
Core REST Endpoints β
Product Service
GET /api/v1/products?category=shoes&brand=Nike&minPrice=500&page=1
GET /api/v1/products/:productId
POST /api/v1/products (Seller auth required)
PUT /api/v1/products/:productId (Seller auth required)Order Service
POST /api/v1/ordersRequest body:
{
"items": [{ "productId": "P-101", "quantity": 2, "price": 1200 }],
"shippingAddressId": "ADDR-55",
"paymentMethod": "bkash",
"idempotencyKey": "uuid-client-generated-key"
}Response (201 Created):
{
"orderId": "ORD-2024-00123",
"status": "PENDING_PAYMENT",
"totalAmount": 2400,
"estimatedDelivery": "2024-12-18"
}GET /api/v1/orders/:orderId (Customer: own orders)
GET /api/v1/orders?userId=42&status=SHIPPED
DELETE /api/v1/orders/:orderId (Cancel, if still PENDING)Inventory Service (internal only β not exposed to clients)
POST /internal/v1/inventory/reserve (Called by Order Service)
POST /internal/v1/inventory/release (Called on order cancel/fail)
GET /internal/v1/inventory/:productId (Check stock level)Payment Controller
POST /api/v1/paymentsRequest:
{
"orderId": "ORD-2024-00123",
"method": "bkash",
"amount": 2400,
"idempotencyKey": "same-uuid-as-order-request"
}Response (200 OK):
{
"transactionId": "TXN-BKash-789XYZ",
"status": "SUCCESS",
"gatewayRef": "BKASH-TRX-2024"
}Order State Machine β
IMPORTANT
The idempotencyKey on both Order and Payment endpoints is critical. If a mobile client's request times out and it retries, the server uses this key (stored in Redis for 24 hours) to detect the duplicate and return the original response instead of charging the customer twice.
4. Monolith vs Microservices β
The Core Challenge: Requirements and scale are defined β now, why does a monolith fail here, and why are microservices the right answer?
Starting Point β The Monolith β
In the beginning, the team builds everything as one big application. This is called a Monolith.
What goes wrong as you grow?
| Problem | Real Impact |
|---|---|
| One bug in Payments crashes the entire site | Black Friday sale ruined |
| Need to scale Product search β must scale everything | Expensive & wasteful |
| Two teams editing the same codebase | Conflicts, slow releases |
| One database gets overloaded | Entire platform slows down |
The Fix β Break It Into Microservices β
Each business capability becomes its own independent service with its own database.
Key Benefits
| Feature | Monolith | Microservices |
|---|---|---|
| Deploy | Redeploy entire app | Deploy only changed service |
| Scale | Scale everything | Scale only what needs it |
| Failure | One bug = site down | One service fails, rest work |
| Team | All devs in one repo | Each team owns their service |
| Technology | Same language/DB for all | Best tool for each job |
5. High-Level Architecture Overview β
This is the full picture of how all services connect. Clients never talk directly to services β they go through the API Gateway.
NOTE
The API Gateway is the single entry point. It handles JWT verification, rate limiting, and routes requests to the right service. Services never expose their ports to the public internet.
4. Domain Breakdown β Every Service Explained β
ποΈ Domain 1: Core Commerce β
This is the heartbeat of the platform β the critical path for every purchase.
| Service | Responsibility | Database |
|---|---|---|
| Product Service | Product catalog, images, categories, attributes | PostgreSQL + Redis (cache) |
| Order Service | Create, update, cancel orders (state machine) | PostgreSQL |
| Inventory Service | Track stock, lock inventory on order | PostgreSQL + Redis (locks) |
| Payment Controller | Accept payment request, return response to client | Redis (idempotency keys) |
| Payment Backend | Talk to Stripe/bKash/SSLCommerz, handle webhooks | PostgreSQL |
π Domain 2: Search & Discovery β
Dedicated domain for helping users find products fast.
- Search Service: Simple keyword search, autocomplete
- Advanced Search: Faceted filters (brand, price, rating), AI-powered ranking, personalized results
TIP
Products are indexed into Elasticsearch asynchronously via Kafka events. When a seller updates a product, the Product Service publishes a product.updated event β Search Service consumes it and re-indexes.
πͺ Domain 3: Seller Domain β
Everything the vendor/seller needs to manage their store.
Real-World Example: A seller uploads a CSV file with 5,000 products.
Bulk Uploader Serviceaccepts the file and puts it in a job queueSeller Workerpicks up the job, validates rows, and publishes events to KafkaProduct Serviceconsumes events and creates products- Seller gets an email notification when done
π₯ Domain 4: Customer & Support β
Manages the buyer side and customer service operations.
| Service | Responsibility |
|---|---|
| Customer Service Backend | User profiles, address book, order history, preferences |
| CS Backend Service | Support tickets, agent tools, resolution workflows |
| Chat Backend Service | Real-time chat (WebSocket) between customers and support/sellers |
π Domain 5: Operations & Logistics β
Handles the physical movement of goods and platform security.
- Fraud Engine: Every order runs through ML-based risk scoring before payment is processed
- 3PL Integration: Connects with external delivery partners via their APIs
π° Domain 6: Finance & Reporting β
The money flow and business intelligence layer.
βοΈ Domain 7: Infrastructure (Cross-cutting Concerns) β
Services that every other domain depends on.
5. Key User Flows β Step-by-Step β
π Flow 1: User Places an Order β
This is the most critical flow. Let's trace every step.
What happens in plain English:
- User submits order β Gateway verifies their login token
- Order Service starts the workflow
- Inventory is locked (reserved) so no one else can buy the last item
- Fraud Engine scores the order for risk
- Payment is charged
- Order is confirmed and an event is published to Kafka
- Multiple services react independently and asynchronously: SMS sent, logistics notified, accounts updated
π³ Flow 2: Payment Processing β
IMPORTANT
Idempotency is critical here. If the user clicks "Pay" twice, the system must not charge them twice. The Payment Controller stores a unique idempotency key in Redis for each request and rejects duplicates.
π¦ Flow 3: Real-Time Inventory Update β
Why Redis for inventory?
DECRin Redis is atomic β even with 1,000 concurrent buyers, it will never go below 0- PostgreSQL is updated asynchronously for durability
- This prevents the "overselling problem"
π Flow 4: Notification Delivery β
The Notification Service is a pure consumer β it only listens to Kafka events and never gets called directly by other services. This means you can update, restart, or even replace it without affecting the Order or Payment flow.
8. Communication Patterns β
The platform uses two types of communication between services:
When to use each? β
| Scenario | Pattern | Why |
|---|---|---|
| Check if email exists during signup | REST | Need immediate yes/no answer |
| Place an order | REST (for sync steps) + Kafka (for downstream) | Need confirmation, but logistics/notifications can be async |
| Bulk product reindex after catalog update | Kafka | Fire-and-forget, Search Service handles it when ready |
| Inter-service calls needing response | gRPC | Strongly typed, low latency, internal network |
7. Data Management β Database per Service β
Each service owns its own data. No service reads another service's database directly.
Technology Choices β
| Service | Database | Reason |
|---|---|---|
| Order, Payment, Product | PostgreSQL | ACID transactions, relational data |
| Session, Cart, Rate Limits | Redis | Sub-millisecond reads, TTL support |
| Product Search | Elasticsearch | Full-text search, facets, ranking |
| Logs & Metrics | Elasticsearch + Kibana | Log analysis, dashboards |
| Job Queues | Redis / Kafka | Reliable async task processing |
The Saga Pattern β Distributed Transactions β
Problem: An order involves multiple services. What if payment succeeds but inventory reservation fails?
The Saga Pattern solves this by defining a sequence of local transactions, each with a compensating transaction to undo if something fails downstream.
10. Database Schema β Deep Dive β
Each service owns its schema completely. Below are the key table definitions for the most critical services.
Orders DB (PostgreSQL) β
-- Core order record
CREATE TABLE orders (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL,
status VARCHAR(30) NOT NULL DEFAULT 'PENDING_PAYMENT',
-- PENDING_PAYMENT | CONFIRMED | PROCESSING | SHIPPED | DELIVERED
-- CANCELLED | RETURN_REQUESTED | REFUNDED | FAILED
total_amount NUMERIC(12, 2) NOT NULL,
currency CHAR(3) NOT NULL DEFAULT 'BDT',
idempotency_key VARCHAR(128) UNIQUE NOT NULL, -- prevents duplicate orders
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Line items for each order
CREATE TABLE order_items (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
order_id UUID NOT NULL REFERENCES orders(id) ON DELETE CASCADE,
product_id UUID NOT NULL,
seller_id UUID NOT NULL,
quantity INT NOT NULL CHECK (quantity > 0),
unit_price NUMERIC(10, 2) NOT NULL,
total_price NUMERIC(12, 2) GENERATED ALWAYS AS (quantity * unit_price) STORED
);
-- Full audit trail of every status change
CREATE TABLE order_status_history (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
order_id UUID NOT NULL REFERENCES orders(id),
old_status VARCHAR(30),
new_status VARCHAR(30) NOT NULL,
reason TEXT,
changed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Indexes for common query patterns
CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_status ON orders(status);
CREATE INDEX idx_order_items_order ON order_items(order_id);Products DB (PostgreSQL) β
CREATE TABLE products (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
seller_id UUID NOT NULL,
name VARCHAR(500) NOT NULL,
description TEXT,
category_id UUID NOT NULL,
base_price NUMERIC(10, 2) NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'ACTIVE', -- ACTIVE | INACTIVE | DELETED
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Flexible key-value attributes (color, size, material, etc.)
CREATE TABLE product_attributes (
product_id UUID NOT NULL REFERENCES products(id),
attr_key VARCHAR(100) NOT NULL,
attr_value VARCHAR(500) NOT NULL,
PRIMARY KEY (product_id, attr_key)
);
-- Image URLs stored as array (actual files live in S3)
CREATE TABLE product_images (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
product_id UUID NOT NULL REFERENCES products(id),
image_url TEXT NOT NULL, -- CDN URL
is_primary BOOLEAN DEFAULT FALSE,
display_order INT DEFAULT 0
);
CREATE INDEX idx_products_seller ON products(seller_id);
CREATE INDEX idx_products_category ON products(category_id);
CREATE INDEX idx_products_status ON products(status);Inventory DB (PostgreSQL + Redis) β
CREATE TABLE inventory (
product_id UUID PRIMARY KEY,
warehouse_id UUID NOT NULL,
total_stock INT NOT NULL DEFAULT 0 CHECK (total_stock >= 0),
reserved_stock INT NOT NULL DEFAULT 0 CHECK (reserved_stock >= 0),
-- available = total_stock - reserved_stock
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Log every reservation and release for auditability
CREATE TABLE inventory_transactions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
product_id UUID NOT NULL,
order_id UUID,
type VARCHAR(20) NOT NULL, -- RESERVE | RELEASE | RESTOCK | ADJUST
quantity INT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);Redis Key Pattern for Hot Inventory:
Key: inventory:{productId}
Type: String (integer)
Value: Available stock count (available = total - reserved)
TTL: None (permanent, synced from PostgreSQL)
Operation: DECRBY inventory:P-101 2 β atomic, thread-safe reservation
INCRBY inventory:P-101 2 β atomic release on cancelNOTE
The Redis count is the real-time source of truth for availability checks. PostgreSQL is the durable source of truth for the actual stock ledger. A background job syncs them every 60 seconds and on any discrepancy.
Payments DB (PostgreSQL) β
CREATE TABLE payments (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
order_id UUID NOT NULL,
method VARCHAR(30) NOT NULL, -- bkash | card | cod | sslcommerz
amount NUMERIC(12, 2) NOT NULL,
currency CHAR(3) NOT NULL DEFAULT 'BDT',
status VARCHAR(20) NOT NULL, -- PENDING | SUCCESS | FAILED | REFUNDED
gateway_txn_id VARCHAR(200), -- External gateway reference
idempotency_key VARCHAR(128) UNIQUE NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
settled_at TIMESTAMPTZ -- When funds confirmed by gateway
);
CREATE INDEX idx_payments_order_id ON payments(order_id);
-- Unique partial index: only one SUCCESS payment per order
CREATE UNIQUE INDEX idx_payments_order_success
ON payments(order_id) WHERE status = 'SUCCESS';11. Resilience Patterns β
Circuit Breaker β
Prevents a slow/failing service from crashing everything else.
Real Example: Payment gateway is slow during peak hours.
- Without circuit breaker: Every order hangs for 30 seconds, threads pile up, site crashes
- With circuit breaker: After 5 failures, the breaker opens β Orders get "Payment temporarily unavailable, try again" within milliseconds
Rate Limiting at the API Gateway β
Retry with Exponential Backoff β
When a transient failure occurs (network hiccup), retry intelligently:
Attempt 1 β fails β wait 1s
Attempt 2 β fails β wait 2s
Attempt 3 β fails β wait 4s
Attempt 4 β success β
12. Infrastructure & Deployment β
Container Orchestration with Kubernetes β
Each microservice runs in its own Docker container, managed by Kubernetes.
CI/CD Pipeline β
Observability Stack β
Distributed Tracing Example: A single order request touches 6 services. Jaeger shows you the full call chain with timing for each hop β so you can pinpoint exactly which service is slow.
13. Trade-offs & Lessons Learned β
The Good β
| Benefit | How It Helps |
|---|---|
| Independent Scaling | Scale Payment Service 10x during sale without touching others |
| Independent Deployment | Fix a bug in Notifications without redeploying the whole platform |
| Fault Isolation | Chat Service goes down β Orders still work perfectly |
| Tech Flexibility | Use Go for high-throughput services, Python for ML/fraud detection |
| Team Autonomy | 6 teams work in parallel without blocking each other |
The Hard Parts β
| Challenge | Solution |
|---|---|
| Distributed Transactions | Saga Pattern with compensating transactions |
| Data Consistency | Eventual consistency via Kafka events |
| Service Discovery | Kubernetes DNS or Consul |
| Debugging | Distributed tracing (Jaeger), correlation IDs on every request |
| Testing | Contract testing (Pact), integration tests per service |
| Operational Complexity | Kubernetes, Helm charts, centralized config (Vault/ConfigMap) |
WARNING
Don't start with microservices. Most successful companies (Amazon, Netflix, Uber) started as monoliths and evolved to microservices as the team and traffic scaled. Starting with microservices too early adds enormous complexity for no benefit.
πΊοΈ Summary β The Big Picture β
This is the complete journey of a single order β touching 10+ microservices, using 4 external integrations, and processing 3 async Kafka events, all within a few seconds.
π Key Concepts Glossary β
| Term | Meaning |
|---|---|
| API Gateway | Single entry point that routes requests to the right service |
| Kafka | Distributed message bus for async event-driven communication |
| Saga Pattern | Distributed transaction handling with compensating rollbacks |
| Circuit Breaker | Pattern to stop calling a failing service and return fallback |
| Idempotency | Doing the same operation twice produces the same result (no double charges) |
| Eventual Consistency | Data will be consistent across services β just not instantly |
| Domain-Driven Design | Organizing services around business domains (Orders, Payments, etc.) |
| Service Mesh | Infrastructure layer (Istio) handling mTLS, tracing, retries between services |
| CQRS | Command Query Responsibility Segregation β separate read and write models |
| Database per Service | Each service owns its own database; no shared databases |
β‘οΈ Next: Case Study: Netflix Video Streaming β a similar event-driven architecture with extreme bandwidth requirements.
π Related: Level 6 β Microservices Patterns | Level 5 β Messaging & Kafka | Level 2 β Scalability
