Database Scaling — Core Concepts
Interview Relevance: Critical — "How would you scale this to 10× traffic?" is asked in every FAANG interview. Know replication, sharding, and consistent hashing cold.
A single database server has hard limits. At scale, you need strategies to spread load across multiple machines. There are two fundamental dimensions:
- Read Scaling → Replication (more readers)
- Write Scaling → Sharding / Partitioning (more writers)
The Scaling Ladder
Interview tip: Always go through this ladder. Jumping straight to sharding is over-engineering and a red flag.
Strategy 1 — Master-Slave Replication
How It Works
One node (Master) accepts all writes. Changes are streamed to one or more Slave (Replica) nodes which serve reads only.
Synchronous vs. Asynchronous Replication
| Mode | Durability | Write Latency | Use Case |
|---|---|---|---|
| Synchronous | ✅ Zero data loss | Higher (waits for replica ACK) | Financial transactions, critical data |
| Asynchronous | ⚠️ Small replication lag | Lower (no wait) | Most web apps, analytics |
| Semi-synchronous | ✅ At least 1 replica ACK | Moderate | Middle ground (MySQL default option) |
Failover: What Happens When Master Dies?
Tools for automatic failover:
- MySQL: Orchestrator, MHA (Master High Availability)
- PostgreSQL: Patroni, repmgr
- Redis: Redis Sentinel, Redis Cluster
Master-Slave Trade-offs
✅ PROS:
- Read throughput scales linearly (add more replicas)
- Read replicas can serve reporting/analytics without impacting production
- Simple to set up (built into MySQL, PostgreSQL, MongoDB)
- Replicas can be used for backups and disaster recovery
❌ CONS:
- Master is still a single point of failure for WRITES
- Replication lag → reads from replica may return stale data
- If master dies and async replicas are behind → potential data loss
- All write traffic still funnels through one nodeStrategy 2 — Multi-Master Replication
Every node can accept both reads and writes. Changes are replicated between all masters.
The Core Problem: Write Conflicts
Conflict Resolution Strategies:
| Strategy | How It Works | Risk |
|---|---|---|
| Last-Write-Wins (LWW) | Timestamp determines winner | Clock skew → wrong value wins |
| Application-level | App code detects and merges | Complex logic required |
| CRDT | Data structures designed for merging | Limited to specific data types |
| Avoid-conflicts | Route same entity to same master | Partially loses multi-master benefit |
Multi-Master is RARELY the right answer in interviews.
Use it only for:
✅ Active-active geo-distributed writes (global apps)
✅ When regional write autonomy is required
✅ When you can accept eventual consistency with conflict resolution
For most systems: Prefer Master-Slave + Sharding instead.Strategy 3 — Sharding (Horizontal Partitioning)
Split the dataset itself across multiple independent database servers. Each server (shard) owns a subset of the data.
Without Sharding vs. With Sharding
Shard Key Selection (Critical)
The shard key (also called partition key) determines which shard stores a piece of data. This is the most important sharding decision.
Good shard key properties:
- ✅ High cardinality — many distinct values (user_id, order_id, URL hash)
- ✅ Evenly distributed — no hot spots
- ✅ Immutable — never changes (moving data between shards is expensive)
- ❌ Avoid: timestamp, status, country/region (low cardinality → hot shards)
Consistent Hashing for Sharding
Simple hash(key) % N breaks catastrophically when N changes (add/remove a shard).
Simple Modulo Hashing:
3 shards: hash("alice") % 3 = 0 → Shard 0
Add shard: hash("alice") % 4 = 3 → Shard 3 ← MOVED!
On adding 1 shard to 3: ~75% of all keys move!
→ Massive data migration + cache invalidationConsistent hashing solves this by arranging servers on a virtual ring:
Consistent Hashing in the Real World
| System | Uses Consistent Hashing For |
|---|---|
| Cassandra | Token ring — each node owns a token range |
| DynamoDB | Partition key hashed to virtual node |
| Redis Cluster | 16,384 hash slots distributed across nodes |
| Chord / Dynamo | Academic foundations of modern DHTs |
| Memcached | Client-side consistent hashing |
Sharding Strategies
Range-Based Sharding
✅ Range queries are efficient (e.g., "get all orders from Jan")
✅ Simple to reason about
❌ Hot shards: recent data always hits the last shard
❌ Uneven distribution if data isn't uniformly spreadHash-Based Sharding
✅ Even distribution by design
✅ No hot shards
❌ Range queries are expensive (must query all shards and merge)
❌ Adding/removing shards causes key redistribution (use consistent hashing!)Directory-Based Sharding
A lookup table (shard map) stores which shard holds which data.
✅ Maximum flexibility — can move data between shards without changing key scheme
✅ Supports heterogeneous shard sizes
❌ Lookup table is itself a bottleneck and SPOF (must be cached)
❌ Extra network hop per requestThe Sharding Challenges
Cross-Shard Joins
Solution: Design your shard key so that data that's queried together is stored together (co-location).
Resharding (Shard Split)
When one shard grows too large, it must be split:
Mitigation: Use consistent hashing from the start — adding nodes only remaps ~1/N keys.
Worked Example: URL Shortener Scaling Journey
Phase 3 Cassandra Details:
| Decision | Choice | Reason |
|---|---|---|
| Shard key | short_code hash | High cardinality, immutable, evenly distributed |
| Consistency level | QUORUM reads/writes | Balance between consistency and availability |
| Replication factor | 3 per DC | Survive 1 node failure without data loss |
| Virtual nodes | 256 vnodes/node | Even data distribution across heterogeneous nodes |
| Compaction | LeveledCompaction | Optimized for read-heavy redirect workload |
Interview Cheat Sheet
One-Line Summaries
Master-Slave: 1 writer, N readers — scale reads linearly
Multi-Master: N writers — avoids write SPOF, complex conflict resolution
Sharding: Split data across servers — scale writes and storage
Range Sharding: Data split by value ranges — great for scans, hot shard risk
Hash Sharding: Data split by hash(key) — even distribution, no range queries
Directory Sharding: Lookup table maps keys to shards — flexible, extra hop
Consistent Hashing: Hash ring — only 1/N keys remapped when topology changes
Virtual Nodes: Multiple ring positions per server — even load distributionThe Interview Phrase
"First I'd add read replicas and Redis caching — together these can
handle 10x read traffic with zero schema changes. If writes become
the bottleneck, I'd shard on user_id using consistent hashing so
adding new shards only remaps ~1/N keys and avoids a full data
migration. Cross-shard joins would be handled by denormalizing and
co-locating related data on the same shard."Red Flags vs. Green Flags
| 🔴 Red Flag | 🟢 Green Flag |
|---|---|
| Jump straight to sharding | Follow the ladder: cache → replicas → sharding |
Use created_at as shard key | Pick high-cardinality, immutable key (user_id, UUID) |
| Forget cross-shard join problem | Address it with co-location or denormalization |
| Say "just add nodes" | Explain consistent hashing to minimize resharding pain |
| Ignore replication lag | State whether async/sync and consequences for reads |
| Multi-master for everything | Reserve multi-master for active-active geo-distributed systems |
IMPORTANT
In an interview, always explain why you chose the shard key. Say: "I'm sharding on user_id because it's high cardinality, immutable, and ensures all of a user's data is co-located, which avoids cross-shard joins for user-centric queries."
TIP
Mentioning the replication lag problem with Master-Slave and how you'd handle it (e.g., reading from master for critical paths, replica for non-critical) demonstrates real-world experience.
