Skip to content

Deep Dive & Detailed Design (Step 4 Deep Dive)

Time Budget: 15–20 minutes — The heart of the interview. This is where senior candidates separate themselves.

The Deep Dive phase is where you zoom into specific components of your High-Level Design, address bottlenecks, justify trade-offs, and show that you understand why systems fail at scale — not just what they look like on a whiteboard.


Why It Matters

The interviewer is now evaluating:

  • Technical depth — Do you know how Redis eviction actually works? What happens during a Cassandra partition split?
  • Trade-off reasoning — Can you articulate why you chose one approach over another?
  • Failure thinking — Can you identify single points of failure and mitigate them?
  • Seniority signals — Do you bring up edge cases the interviewer didn't even ask about?

The Deep Dive Flow


The 5 Deep Dive Lenses

For any component, examine it through these five lenses:


Worked Example: URL Shortener — Full Deep Dive

Bottleneck #1 — Database Scaling (Write Path)

The Problem: At 1,200 writes/sec, a single database node becomes a bottleneck. What's the sharding strategy?

Trade-off discussion:

Sharding KeyProCon
short_code (hash)Even distribution, no hotspotsCannot range-scan by date
user_idQueries per-user are fastPower users create hotspots
created_at (time)Easy time-range queriesRecent shard is always hottest

Decision: Shard by short_code using consistent hashing. The primary access pattern is GET by short_code, so this is the natural partition key. We don't need time-range queries in the redirect path.


Bottleneck #2 — Cache Deep Dive (Read Path)

The Problem: 116,000 reads/sec hits the database on every cache miss. How does the cache layer work precisely?

Cache sizing math revisited:

text
Target hit ratio: 95%

Hot URLs (20% of 100M active = 20M URLs):
  20,000,000 × 500 bytes = 10 GB

Redis config:
  maxmemory 12gb          # 20% buffer above working set
  maxmemory-policy allkeys-lru

Expected:
  95% of 116,000 reads/sec = 110,200 served from cache
  5% of 116,000 reads/sec  = 5,800 reach Cassandra
  → Cassandra only sees ~5,800 reads/sec instead of 116,000 ✅

Cache Stampede Problem & Solution:


Bottleneck #3 — ID Generation at Scale

The Problem: Multiple app servers generating IDs simultaneously — how do we avoid collisions?


Bottleneck #4 — High Availability & Failure Modes

The Problem: What are the single points of failure (SPOF) in our architecture?

Cassandra Replication Deep Dive:

text
Replication Factor (RF) = 3
  → Every row stored on 3 different nodes

Write Consistency: QUORUM (2 of 3 nodes must ACK)
  → If 1 node dies, writes still succeed ✅

Read Consistency: QUORUM (2 of 3 nodes must respond)
  → If 1 node dies, reads still succeed ✅

NODE FAILURE SCENARIO:
  - Node 2 goes down
  - Writes → Node 1 + Node 3 ACK → QUORUM met → SUCCESS
  - Node 2 recovers → Hinted Handoff replays missed writes
  - System self-heals with zero operator intervention ✅

Bottleneck #5 — URL Expiration at Scale

The Problem: How do we efficiently expire and clean up 100M URLs/day without hammering the database?

Best approach: Use Cassandra's native TTL as the primary expiry mechanism. Add a weekly cleanup Lambda for tombstone compaction hygiene. This is O(1) per row — the database manages expiry with no full-table scans.


Master Trade-Off Table

This is the trade-off decision log that demonstrates seniority:

NOTE

Alternate production variant: The redirect type (302) and consistency level (QUORUM) selected below assume a requirement for strict server-side analytics and no dirty reads. This diverges from the baseline 5-step story which allows eventual consistency (ONE) and treats analytics as out-of-scope (allowing 301 redirects).

DecisionOption AOption BWe ChoseReason
Redirect type301 Permanent302 Temporary302Need server-side analytics tracking
DatabaseMySQL (SQL)Cassandra (NoSQL)Cassandra182 TB scale, key-value access pattern, built-in sharding
Sharding keyuser_idshort_codeshort_codePrimary access is by short code, avoids hotspots
Consistency levelStrong (QUORUM all)Eventual (ONE)QUORUMDirty reads on redirects are unacceptable
Cache evictionLFULRULRURecency is a better proxy for redirect popularity
ID generationHash + truncateSnowflake + Base62SnowflakeNo collision risk, no DB lookup for uniqueness check
Expiry mechanismCron scanCassandra TTLCassandra TTLNative O(1) expiry, no full-table scans

Failure Scenario Walkthrough

Practice narrating what happens when things go wrong:


Deep Dive Conversation Patterns

Use these phrases to drive the interview like a senior engineer:

text
SURFACING TRADE-OFFS:
"I chose [X] over [Y] here because at our scale,
 [specific reason tied to our estimates]."

ACKNOWLEDGING LIMITATIONS:
"This design has a trade-off: [limitation].
 We could address it with [alternative], but that
 adds complexity. Given our requirements, I think
 [current choice] is the right balance."

PROBING FOR DIRECTION:
"I can go deeper on the caching strategy, the
 sharding approach, or the ID generation mechanism.
 Which area is most interesting to you?"

FAILURE THINKING:
"One failure mode I want to call out: if Redis
 goes down entirely, here's what happens and how
 we recover..."

SCALABILITY EXTENSION:
"If our traffic grew 10x to 1 million reads/sec,
 we'd need to [specific change], which would
 require [trade-off]."

Red Flags vs. Green Flags

🔴 Red Flag🟢 Green Flag
Only describe what a component doesExplain why it's needed at this specific scale
Present one optionPresent 2–3 options, pick one, justify the choice
Ignore failure scenariosProactively call out SPOFs and mitigations
Say "it scales horizontally" without detailExplain how: sharding key, replication factor, quorum
Treat every trade-off as equally importantPrioritize trade-offs that match your non-functional requirements
Wait for the interviewer to ask about edge casesSurface edge cases yourself (cache stampede, hot partitions)

Next Steps

With bottlenecks addressed and trade-offs justified, move to Step 5: Wrap-Up → where you summarize, identify remaining risks, and discuss future improvements.

IMPORTANT

Let the interviewer guide the depth. After completing each deep dive, pause and ask: "I can go deeper here or move to another area — which would be more valuable?" This shows collaborative problem-solving, not monologue delivery.

TIP

The magic phrase: "This is a trade-off between X and Y. Given our requirement for [non-functional requirement], I'd choose X because..." — Use this template for every major decision. It makes your reasoning transparent and easy to follow.

Released under the ISC License.