🗃️ Data Storage Patterns in System Design

How you store data is as critical as how you process it. The right storage pattern determines your system's scalability, reliability, and performance.

Overview

1. 📦 Normalization vs Denormalization

Normalization — Eliminate Redundancy

Organize data to reduce duplication. Data is split into multiple related tables.

✅ When to Use: Write-heavy systems, financial applications, when data integrity is critical.

sql

-- Normalized Query: Join required
SELECT u.name, p.name, oi.quantity
FROM users u
JOIN orders o ON u.user_id = o.user_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE u.user_id = 42;

Denormalization — Optimize for Reads

Intentionally duplicate data to avoid expensive JOINs.

✅ When to Use: Read-heavy workloads, analytics dashboards, reporting systems.

javascript

// MongoDB denormalized document — no joins needed
const order = {
  orderId: "ord_123",
  user: { id: 42, name: "Alice", email: "alice@example.com" },
  items: [
    { productId: "p1", name: "Laptop", price: 999.99, qty: 1 },
    { productId: "p2", name: "Mouse", price: 29.99, qty: 2 },
  ],
  totalAmount: 1059.97,
  createdAt: new Date(),
};

	Normalization	Denormalization
Storage	Less (no duplication)	More (data duplicated)
Read Speed	Slower (JOINs needed)	Faster (single query)
Write Speed	Faster (one place)	Slower (update many places)
Consistency	Easier	Harder

2. 🔑 Key-Value Store Pattern

Data stored as simple key → value pairs. Extremely fast lookups by key.

Real-world example: Session Management

javascript

// Store session — O(1) write
await redis.setex(
  `session:${userId}`,
  3600,
  JSON.stringify({
    userId,
    role: "admin",
    loginAt: Date.now(),
  })
);

// Retrieve session — O(1) read
const session = JSON.parse(await redis.get(`session:${userId}`));

// Rate limiting with atomic increment
const hits = await redis.incr(`rate:${ip}:${minute}`);
if (hits > 100) throw new Error("Rate limit exceeded");

✅ Best For: Sessions, caching, rate limiting, feature flags, leaderboards
❌ Not For: Complex queries, relationships, full-text search

3. 📄 Document Store Pattern

Stores self-contained JSON/BSON documents. No rigid schema required.

Real-world example: E-commerce Product Catalog

javascript

// Products can have completely different structures — schema-flexible
const smartphone = {
  _id: "prod_001",
  name: "iPhone 15",
  category: "electronics",
  specs: { storage: "256GB", camera: "48MP", battery: "3877mAh" },
  variants: [
    { color: "Black", stock: 50 },
    { color: "White", stock: 30 },
  ],
  price: 999,
  tags: ["apple", "5g", "flagship"],
};

const book = {
  _id: "prod_002",
  name: "Clean Code",
  category: "books",
  author: "Robert C. Martin",
  isbn: "978-0132350884",
  pages: 464,
  price: 35,
};

// Query with embedded arrays — no JOIN needed
const results = await db.products.find({
  category: "electronics",
  "variants.stock": { $gt: 0 },
  tags: "5g",
});

✅ Best For: Catalogs, CMS, user profiles, event logs
❌ Not For: Multi-entity transactions, complex reporting with aggregations

4. 🏛️ Wide-Column Store Pattern

Rows have dynamic columns grouped into column families. Optimized for time-series and heavy writes.

Real-world example: IoT Sensor Time-Series

javascript

// Cassandra data model — partition by device, sort by time
// Table: CREATE TABLE sensor_data (
//   device_id text,
//   recorded_at timestamp,
//   temperature float,
//   humidity float,
//   PRIMARY KEY (device_id, recorded_at)
// ) WITH CLUSTERING ORDER BY (recorded_at DESC);

// Insert — extremely fast, append-only
await client.execute(
  `INSERT INTO sensor_data (device_id, recorded_at, temperature, humidity)
   VALUES (?, ?, ?, ?)`,
  ["device_abc", new Date(), 23.5, 61.2],
  { prepare: true }
);

// Range query on one partition — millisecond response even at billions of rows
const result = await client.execute(
  `SELECT * FROM sensor_data
   WHERE device_id = 'device_abc'
   AND recorded_at > ? AND recorded_at < ?`,
  [yesterday, now],
  { prepare: true }
);

✅ Best For: Time-series, IoT, analytics, write-heavy append logs
❌ Not For: Ad-hoc queries, joins, secondary filtering

5. 🕸️ Graph Database Pattern

Entities are nodes; relationships are edges with properties. Perfect for connected data.

Real-world example: Social Network Friend Recommendations

cypher

// Neo4j Cypher Query — Find friends-of-friends (2nd degree connections)
MATCH (me:User {id: "alice"})
      -[:FOLLOWS]->(:User)
      -[:FOLLOWS]->(recommended:User)
WHERE NOT (me)-[:FOLLOWS]->(recommended)
  AND me <> recommended
RETURN recommended.name, COUNT(*) AS mutualConnections
ORDER BY mutualConnections DESC
LIMIT 10;

javascript

// Node.js with Neo4j driver
const session = driver.session();
const result = await session.run(
  `
  MATCH (u:User {id: $userId})-[:PURCHASED]->(p:Product)
        <-[:PURCHASED]-(similar:User)-[:PURCHASED]->(rec:Product)
  WHERE NOT (u)-[:PURCHASED]->(rec)
  RETURN rec.name, COUNT(*) AS score
  ORDER BY score DESC LIMIT 5
`,
  { userId: "user_42" }
);

const recommendations = result.records.map((r) => r.get("rec.name"));

✅ Best For: Social graphs, fraud detection, recommendation engines, knowledge graphs
❌ Not For: Simple key lookups, bulk analytics, high-throughput writes

6. ⚡ CQRS Pattern (Command Query Responsibility Segregation)

Separate the write model (Commands) from the read model (Queries).

Real-world example: Order Management System

typescript

// COMMAND — Write side (strict validation, normalized)
class PlaceOrderCommand {
  constructor(
    public readonly userId: string,
    public readonly items: OrderItem[]
  ) {}
}

class OrderCommandHandler {
  async handle(cmd: PlaceOrderCommand) {
    const order = Order.create(cmd.userId, cmd.items); // domain logic
    await this.orderRepo.save(order); // save to write DB
    await this.eventBus.publish(new OrderPlacedEvent(order)); // notify
  }
}

// QUERY — Read side (pre-aggregated, fast)
class GetUserOrdersQuery {
  constructor(public readonly userId: string) {}
}

class OrderQueryHandler {
  async handle(query: GetUserOrdersQuery) {
    // Read from pre-projected, denormalized read model
    return this.readModel.findByUserId(query.userId);
  }
}

// READ PROJECTOR — keeps read DB in sync
class OrderProjector {
  async on(event: OrderPlacedEvent) {
    await this.readModel.upsert({
      orderId: event.orderId,
      userName: await this.userService.getName(event.userId),
      itemCount: event.items.length,
      totalAmount: event.items.reduce((s, i) => s + i.price, 0),
      status: "pending",
      createdAt: event.timestamp,
    });
  }
}

✅ Benefits: Independent scaling of reads/writes, optimized query models, audit trail
⚠️ Tradeoff: Eventual consistency, added complexity

7. 📜 Event Sourcing Pattern

Store state as a sequence of events, not current values. Reconstruct state by replaying events.

Real-world example: Bank Account

typescript

// Events — immutable facts that happened
type AccountEvent =
  | { type: "AccountOpened"; initialBalance: number; timestamp: Date }
  | { type: "MoneyDeposited"; amount: number; timestamp: Date }
  | { type: "MoneyWithdrawn"; amount: number; timestamp: Date };

// Reconstruct state by replaying events (pure function)
function replayEvents(events: AccountEvent[]): { balance: number } {
  return events.reduce(
    (state, event) => {
      switch (event.type) {
        case "AccountOpened":
          return { balance: event.initialBalance };
        case "MoneyDeposited":
          return { balance: state.balance + event.amount };
        case "MoneyWithdrawn":
          return { balance: state.balance - event.amount };
      }
    },
    { balance: 0 }
  );
}

// Usage
const events = await eventStore.getEvents("account_42");
const currentState = replayEvents(events);
// { balance: 300 }

// Time travel — what was balance on Jan 10?
const jan10Events = events.filter((e) => e.timestamp <= new Date("2024-01-10"));
const historicState = replayEvents(jan10Events);

✅ Benefits: Full audit log, time-travel queries, event replay for debugging
⚠️ Tradeoff: Larger storage, complex to query current state

8. 🏗️ Data Lake vs Data Warehouse Pattern

	Data Lake	Data Warehouse
Schema	Schema-on-read	Schema-on-write
Format	Raw (JSON, CSV, Parquet)	Structured, columnar
Cost	Cheap storage (S3)	Expensive compute
Users	Data Scientists	Analysts, BI
Latency	Hours/batch	Seconds/interactive
Example	AWS S3 + Athena	BigQuery, Snowflake

9. 🔄 Replication Patterns

Primary-Replica (Read Scaling)

Multi-Primary (Write Scaling + High Availability)

Quorum-Based Replication (Consistency Control)

text

Formula: R + W > N (Quorum)
- N = total replicas (e.g., 3)
- W = write acknowledgments required (e.g., 2)
- R = read replicas to consult (e.g., 2)
- R + W > N = 4 > 3 ✅ Consistent reads guaranteed

10. 💾 Caching Storage Patterns

Cache-Aside (Most Common)

javascript

async function getUser(userId: string) {
  // 1. Check cache first
  const cached = await redis.get(`user:${userId}`);
  if (cached) return JSON.parse(cached); // Cache HIT

  // 2. Cache MISS — fetch from DB
  const user = await db.users.findOne({ id: userId });

  // 3. Populate cache for next time (TTL: 1 hour)
  await redis.setex(`user:${userId}`, 3600, JSON.stringify(user));

  return user;
}

// On update — invalidate cache
async function updateUser(userId: string, data: Partial<User>) {
  await db.users.updateOne({ id: userId }, data);
  await redis.del(`user:${userId}`); // Invalidate!
}

Write-Through

javascript

async function saveUserProfile(userId: string, profile: UserProfile) {
  // Write to BOTH cache and DB simultaneously
  await Promise.all([
    redis.setex(`user:${userId}`, 3600, JSON.stringify(profile)),
    db.users.updateOne({ id: userId }, profile)
  ]);
}
// ✅ Cache always fresh  ⚠️ Higher write latency

Write-Behind (Write-Back)

javascript

// Write to cache immediately, flush to DB asynchronously
async function updateCounter(key: string, value: number) {
  await redis.set(key, value); // Instant return
  await queue.push({ type: "DB_WRITE", key, value }); // Async flush
}

// Background worker flushes every N seconds
setInterval(async () => {
  const batch = await queue.drain();
  await db.bulkWrite(batch); // Efficient bulk write
}, 5000);
// ✅ Very fast writes  ⚠️ Risk of data loss if cache crashes

11. 🧩 Polyglot Persistence Pattern

Use multiple database technologies for different parts of the same system.

Why polyglot? No single database is best for everything. Optimize each component with the right tool.

Component	Storage	Reason
User Auth	PostgreSQL	ACID, relational integrity
Product Catalog	MongoDB	Flexible schema, nested docs
Shopping Cart	Redis	Sub-millisecond speed, TTL
Search	Elasticsearch	Full-text, faceted search
Analytics	ClickHouse	Columnar, OLAP queries
Media	S3 + CDN	Cheap blob storage, global edge
Recommendations	Neo4j	Graph traversal

12. 🌊 Lambda vs Kappa Architecture

Lambda Architecture (Batch + Streaming)

Kappa Architecture (Streaming Only — Simpler)

	Lambda	Kappa
Complexity	High (two systems)	Lower (one system)
Accuracy	High (batch reprocessing)	Depends on stream
Latency	Mixed (batch=hours, stream=ms)	Low (ms)
Reprocessing	Easy (batch layer)	Replay from Kafka

13. 🗺️ Pattern Selection Guide

✅ Checklist

[ ] I understand when to normalize vs denormalize
[ ] I can explain Key-Value, Document, Wide-Column, and Graph stores
[ ] I know how CQRS separates reads from writes
[ ] I can explain Event Sourcing and its benefits (audit trail, time-travel)
[ ] I understand Primary-Replica vs Multi-Primary replication
[ ] I know the Quorum formula: R + W > N
[ ] I can explain Cache-Aside, Write-Through, and Write-Behind
[ ] I understand why Polyglot Persistence is used in large systems
[ ] I can distinguish Lambda from Kappa architecture
[ ] I can pick the right storage pattern for a given use case

➡️ Next: Level 4 — Caching

🗃️ Data Storage Patterns in System Design ​

Overview ​

1. 📦 Normalization vs Denormalization ​

Normalization — Eliminate Redundancy ​

Denormalization — Optimize for Reads ​

2. 🔑 Key-Value Store Pattern ​

3. 📄 Document Store Pattern ​

4. 🏛️ Wide-Column Store Pattern ​

5. 🕸️ Graph Database Pattern ​

6. ⚡ CQRS Pattern (Command Query Responsibility Segregation) ​

7. 📜 Event Sourcing Pattern ​

8. 🏗️ Data Lake vs Data Warehouse Pattern ​

9. 🔄 Replication Patterns ​

Primary-Replica (Read Scaling) ​

Multi-Primary (Write Scaling + High Availability) ​

Quorum-Based Replication (Consistency Control) ​

10. 💾 Caching Storage Patterns ​

Cache-Aside (Most Common) ​

Write-Through ​

Write-Behind (Write-Back) ​

11. 🧩 Polyglot Persistence Pattern ​

12. 🌊 Lambda vs Kappa Architecture ​

Lambda Architecture (Batch + Streaming) ​

Kappa Architecture (Streaming Only — Simpler) ​

13. 🗺️ Pattern Selection Guide ​

✅ Checklist ​

📚 Related Topics ​

🗃️ Data Storage Patterns in System Design

Overview

1. 📦 Normalization vs Denormalization

Normalization — Eliminate Redundancy

Denormalization — Optimize for Reads

2. 🔑 Key-Value Store Pattern

3. 📄 Document Store Pattern

4. 🏛️ Wide-Column Store Pattern

5. 🕸️ Graph Database Pattern

6. ⚡ CQRS Pattern (Command Query Responsibility Segregation)

7. 📜 Event Sourcing Pattern

8. 🏗️ Data Lake vs Data Warehouse Pattern

9. 🔄 Replication Patterns

Primary-Replica (Read Scaling)

Multi-Primary (Write Scaling + High Availability)

Quorum-Based Replication (Consistency Control)

10. 💾 Caching Storage Patterns

Cache-Aside (Most Common)

Write-Through

Write-Behind (Write-Back)

11. 🧩 Polyglot Persistence Pattern

12. 🌊 Lambda vs Kappa Architecture

Lambda Architecture (Batch + Streaming)

Kappa Architecture (Streaming Only — Simpler)

13. 🗺️ Pattern Selection Guide

✅ Checklist

📚 Related Topics