Deep Dive & Detailed Design (Step 4 Deep Dive)
Time Budget: 15–20 minutes — The heart of the interview. This is where senior candidates separate themselves.
The Deep Dive phase is where you zoom into specific components of your High-Level Design, address bottlenecks, justify trade-offs, and show that you understand why systems fail at scale — not just what they look like on a whiteboard.
Why It Matters
The interviewer is now evaluating:
- Technical depth — Do you know how Redis eviction actually works? What happens during a Cassandra partition split?
- Trade-off reasoning — Can you articulate why you chose one approach over another?
- Failure thinking — Can you identify single points of failure and mitigate them?
- Seniority signals — Do you bring up edge cases the interviewer didn't even ask about?
The Deep Dive Flow
The 5 Deep Dive Lenses
For any component, examine it through these five lenses:
Worked Example: URL Shortener — Full Deep Dive
Bottleneck #1 — Database Scaling (Write Path)
The Problem: At 1,200 writes/sec, a single database node becomes a bottleneck. What's the sharding strategy?
Trade-off discussion:
| Sharding Key | Pro | Con |
|---|---|---|
short_code (hash) | Even distribution, no hotspots | Cannot range-scan by date |
user_id | Queries per-user are fast | Power users create hotspots |
created_at (time) | Easy time-range queries | Recent shard is always hottest |
Decision: Shard by
short_codeusing consistent hashing. The primary access pattern isGET by short_code, so this is the natural partition key. We don't need time-range queries in the redirect path.
Bottleneck #2 — Cache Deep Dive (Read Path)
The Problem: 116,000 reads/sec hits the database on every cache miss. How does the cache layer work precisely?
Cache sizing math revisited:
Target hit ratio: 95%
Hot URLs (20% of 100M active = 20M URLs):
20,000,000 × 500 bytes = 10 GB
Redis config:
maxmemory 12gb # 20% buffer above working set
maxmemory-policy allkeys-lru
Expected:
95% of 116,000 reads/sec = 110,200 served from cache
5% of 116,000 reads/sec = 5,800 reach Cassandra
→ Cassandra only sees ~5,800 reads/sec instead of 116,000 ✅Cache Stampede Problem & Solution:
Bottleneck #3 — ID Generation at Scale
The Problem: Multiple app servers generating IDs simultaneously — how do we avoid collisions?
Bottleneck #4 — High Availability & Failure Modes
The Problem: What are the single points of failure (SPOF) in our architecture?
Cassandra Replication Deep Dive:
Replication Factor (RF) = 3
→ Every row stored on 3 different nodes
Write Consistency: QUORUM (2 of 3 nodes must ACK)
→ If 1 node dies, writes still succeed ✅
Read Consistency: QUORUM (2 of 3 nodes must respond)
→ If 1 node dies, reads still succeed ✅
NODE FAILURE SCENARIO:
- Node 2 goes down
- Writes → Node 1 + Node 3 ACK → QUORUM met → SUCCESS
- Node 2 recovers → Hinted Handoff replays missed writes
- System self-heals with zero operator intervention ✅Bottleneck #5 — URL Expiration at Scale
The Problem: How do we efficiently expire and clean up 100M URLs/day without hammering the database?
Best approach: Use Cassandra's native TTL as the primary expiry mechanism. Add a weekly cleanup Lambda for tombstone compaction hygiene. This is O(1) per row — the database manages expiry with no full-table scans.
Master Trade-Off Table
This is the trade-off decision log that demonstrates seniority:
NOTE
Alternate production variant: The redirect type (302) and consistency level (QUORUM) selected below assume a requirement for strict server-side analytics and no dirty reads. This diverges from the baseline 5-step story which allows eventual consistency (ONE) and treats analytics as out-of-scope (allowing 301 redirects).
| Decision | Option A | Option B | We Chose | Reason |
|---|---|---|---|---|
| Redirect type | 301 Permanent | 302 Temporary | 302 | Need server-side analytics tracking |
| Database | MySQL (SQL) | Cassandra (NoSQL) | Cassandra | 182 TB scale, key-value access pattern, built-in sharding |
| Sharding key | user_id | short_code | short_code | Primary access is by short code, avoids hotspots |
| Consistency level | Strong (QUORUM all) | Eventual (ONE) | QUORUM | Dirty reads on redirects are unacceptable |
| Cache eviction | LFU | LRU | LRU | Recency is a better proxy for redirect popularity |
| ID generation | Hash + truncate | Snowflake + Base62 | Snowflake | No collision risk, no DB lookup for uniqueness check |
| Expiry mechanism | Cron scan | Cassandra TTL | Cassandra TTL | Native O(1) expiry, no full-table scans |
Failure Scenario Walkthrough
Practice narrating what happens when things go wrong:
Deep Dive Conversation Patterns
Use these phrases to drive the interview like a senior engineer:
SURFACING TRADE-OFFS:
"I chose [X] over [Y] here because at our scale,
[specific reason tied to our estimates]."
ACKNOWLEDGING LIMITATIONS:
"This design has a trade-off: [limitation].
We could address it with [alternative], but that
adds complexity. Given our requirements, I think
[current choice] is the right balance."
PROBING FOR DIRECTION:
"I can go deeper on the caching strategy, the
sharding approach, or the ID generation mechanism.
Which area is most interesting to you?"
FAILURE THINKING:
"One failure mode I want to call out: if Redis
goes down entirely, here's what happens and how
we recover..."
SCALABILITY EXTENSION:
"If our traffic grew 10x to 1 million reads/sec,
we'd need to [specific change], which would
require [trade-off]."Red Flags vs. Green Flags
| 🔴 Red Flag | 🟢 Green Flag |
|---|---|
| Only describe what a component does | Explain why it's needed at this specific scale |
| Present one option | Present 2–3 options, pick one, justify the choice |
| Ignore failure scenarios | Proactively call out SPOFs and mitigations |
| Say "it scales horizontally" without detail | Explain how: sharding key, replication factor, quorum |
| Treat every trade-off as equally important | Prioritize trade-offs that match your non-functional requirements |
| Wait for the interviewer to ask about edge cases | Surface edge cases yourself (cache stampede, hot partitions) |
Next Steps
With bottlenecks addressed and trade-offs justified, move to Step 5: Wrap-Up → where you summarize, identify remaining risks, and discuss future improvements.
IMPORTANT
Let the interviewer guide the depth. After completing each deep dive, pause and ask: "I can go deeper here or move to another area — which would be more valuable?" This shows collaborative problem-solving, not monologue delivery.
TIP
The magic phrase: "This is a trade-off between X and Y. Given our requirement for [non-functional requirement], I'd choose X because..." — Use this template for every major decision. It makes your reasoning transparent and easy to follow.
