Deep Dive & Detailed Design (Step 4 Deep Dive)

Time Budget: 15–20 minutes — The heart of the interview. This is where senior candidates separate themselves.

The Deep Dive phase is where you zoom into specific components of your High-Level Design, address bottlenecks, justify trade-offs, and show that you understand why systems fail at scale — not just what they look like on a whiteboard.

Why It Matters

The interviewer is now evaluating:

Technical depth — Do you know how Redis eviction actually works? What happens during a Cassandra partition split?
Trade-off reasoning — Can you articulate why you chose one approach over another?
Failure thinking — Can you identify single points of failure and mitigate them?
Seniority signals — Do you bring up edge cases the interviewer didn't even ask about?

The Deep Dive Flow

The 5 Deep Dive Lenses

For any component, examine it through these five lenses:

Worked Example: URL Shortener — Full Deep Dive

Bottleneck #1 — Database Scaling (Write Path)

The Problem: At 1,200 writes/sec, a single database node becomes a bottleneck. What's the sharding strategy?

Trade-off discussion:

Sharding Key	Pro	Con
`short_code` (hash)	Even distribution, no hotspots	Cannot range-scan by date
`user_id`	Queries per-user are fast	Power users create hotspots
`created_at` (time)	Easy time-range queries	Recent shard is always hottest

Decision: Shard by short_code using consistent hashing. The primary access pattern is GET by short_code, so this is the natural partition key. We don't need time-range queries in the redirect path.

Bottleneck #2 — Cache Deep Dive (Read Path)

The Problem: 116,000 reads/sec hits the database on every cache miss. How does the cache layer work precisely?

Cache sizing math revisited:

text

Target hit ratio: 95%

Hot URLs (20% of 100M active = 20M URLs):
  20,000,000 × 500 bytes = 10 GB

Redis config:
  maxmemory 12gb          # 20% buffer above working set
  maxmemory-policy allkeys-lru

Expected:
  95% of 116,000 reads/sec = 110,200 served from cache
  5% of 116,000 reads/sec  = 5,800 reach Cassandra
  → Cassandra only sees ~5,800 reads/sec instead of 116,000 ✅

Cache Stampede Problem & Solution:

Bottleneck #3 — ID Generation at Scale

The Problem: Multiple app servers generating IDs simultaneously — how do we avoid collisions?

Bottleneck #4 — High Availability & Failure Modes

The Problem: What are the single points of failure (SPOF) in our architecture?

Cassandra Replication Deep Dive:

text

Replication Factor (RF) = 3
  → Every row stored on 3 different nodes

Write Consistency: QUORUM (2 of 3 nodes must ACK)
  → If 1 node dies, writes still succeed ✅

Read Consistency: QUORUM (2 of 3 nodes must respond)
  → If 1 node dies, reads still succeed ✅

NODE FAILURE SCENARIO:
  - Node 2 goes down
  - Writes → Node 1 + Node 3 ACK → QUORUM met → SUCCESS
  - Node 2 recovers → Hinted Handoff replays missed writes
  - System self-heals with zero operator intervention ✅

Bottleneck #5 — URL Expiration at Scale

The Problem: How do we efficiently expire and clean up 100M URLs/day without hammering the database?

Best approach: Use Cassandra's native TTL as the primary expiry mechanism. Add a weekly cleanup Lambda for tombstone compaction hygiene. This is O(1) per row — the database manages expiry with no full-table scans.

Master Trade-Off Table

This is the trade-off decision log that demonstrates seniority:

NOTE

Alternate production variant: The redirect type (302) and consistency level (QUORUM) selected below assume a requirement for strict server-side analytics and no dirty reads. This diverges from the baseline 5-step story which allows eventual consistency (ONE) and treats analytics as out-of-scope (allowing 301 redirects).

Decision	Option A	Option B	We Chose	Reason
Redirect type	301 Permanent	302 Temporary	302	Need server-side analytics tracking
Database	MySQL (SQL)	Cassandra (NoSQL)	Cassandra	182 TB scale, key-value access pattern, built-in sharding
Sharding key	`user_id`	`short_code`	`short_code`	Primary access is by short code, avoids hotspots
Consistency level	Strong (QUORUM all)	Eventual (ONE)	QUORUM	Dirty reads on redirects are unacceptable
Cache eviction	LFU	LRU	LRU	Recency is a better proxy for redirect popularity
ID generation	Hash + truncate	Snowflake + Base62	Snowflake	No collision risk, no DB lookup for uniqueness check
Expiry mechanism	Cron scan	Cassandra TTL	Cassandra TTL	Native O(1) expiry, no full-table scans

Failure Scenario Walkthrough

Practice narrating what happens when things go wrong:

Deep Dive Conversation Patterns

Use these phrases to drive the interview like a senior engineer:

text

SURFACING TRADE-OFFS:
"I chose [X] over [Y] here because at our scale,
 [specific reason tied to our estimates]."

ACKNOWLEDGING LIMITATIONS:
"This design has a trade-off: [limitation].
 We could address it with [alternative], but that
 adds complexity. Given our requirements, I think
 [current choice] is the right balance."

PROBING FOR DIRECTION:
"I can go deeper on the caching strategy, the
 sharding approach, or the ID generation mechanism.
 Which area is most interesting to you?"

FAILURE THINKING:
"One failure mode I want to call out: if Redis
 goes down entirely, here's what happens and how
 we recover..."

SCALABILITY EXTENSION:
"If our traffic grew 10x to 1 million reads/sec,
 we'd need to [specific change], which would
 require [trade-off]."

Red Flags vs. Green Flags

🔴 Red Flag	🟢 Green Flag
Only describe what a component does	Explain why it's needed at this specific scale
Present one option	Present 2–3 options, pick one, justify the choice
Ignore failure scenarios	Proactively call out SPOFs and mitigations
Say "it scales horizontally" without detail	Explain how: sharding key, replication factor, quorum
Treat every trade-off as equally important	Prioritize trade-offs that match your non-functional requirements
Wait for the interviewer to ask about edge cases	Surface edge cases yourself (cache stampede, hot partitions)

Next Steps

With bottlenecks addressed and trade-offs justified, move to Step 5: Wrap-Up → where you summarize, identify remaining risks, and discuss future improvements.

IMPORTANT

Let the interviewer guide the depth. After completing each deep dive, pause and ask: "I can go deeper here or move to another area — which would be more valuable?" This shows collaborative problem-solving, not monologue delivery.

TIP

The magic phrase: "This is a trade-off between X and Y. Given our requirement for [non-functional requirement], I'd choose X because..." — Use this template for every major decision. It makes your reasoning transparent and easy to follow.

Deep Dive & Detailed Design (Step 4 Deep Dive) ​

Why It Matters ​

The Deep Dive Flow ​

The 5 Deep Dive Lenses ​

Worked Example: URL Shortener — Full Deep Dive ​

Bottleneck #1 — Database Scaling (Write Path) ​

Bottleneck #2 — Cache Deep Dive (Read Path) ​

Bottleneck #3 — ID Generation at Scale ​

Bottleneck #4 — High Availability & Failure Modes ​

Bottleneck #5 — URL Expiration at Scale ​

Master Trade-Off Table ​

Failure Scenario Walkthrough ​

Deep Dive Conversation Patterns ​

Red Flags vs. Green Flags ​

Next Steps ​

Deep Dive & Detailed Design (Step 4 Deep Dive)

Why It Matters

The Deep Dive Flow

The 5 Deep Dive Lenses

Worked Example: URL Shortener — Full Deep Dive

Bottleneck #1 — Database Scaling (Write Path)

Bottleneck #2 — Cache Deep Dive (Read Path)

Bottleneck #3 — ID Generation at Scale

Bottleneck #4 — High Availability & Failure Modes

Bottleneck #5 — URL Expiration at Scale

Master Trade-Off Table

Failure Scenario Walkthrough

Deep Dive Conversation Patterns

Red Flags vs. Green Flags

Next Steps