Case Study: URL Shortener
Difficulty: Beginner | Category: Read-Heavy | Similar Systems: Pastebin
Designing a URL shortening service like TinyURL or Bitly is the most common beginner system design interview question. It teaches the fundamentals of hashing, database indexing, and caching.
1. Requirements Clarification
Functional Requirements
- Shorten: Given a long URL, return a shorter, unique alias.
- Redirect: When users click the short link, redirect them to the original long URL.
- Custom Links: Users can optionally pick a custom short link.
- Expiration: Links should expire after a standard timespan (e.g., 5 years) or a user-defined time.
Non-Functional Requirements
- High Availability: If the service goes down, links break across the internet.
- Low Latency: URL redirection should happen in real-time with minimal delay.
- Read-Heavy: Redirection requests will heavily outnumber URL creation requests (typically a 100:1 read-to-write ratio).
2. Back-of-the-Envelope Estimation
Assume 100 million new URLs are generated per month. Read-to-write ratio is 100:1.
- Write Operations: 100 M / (30 days _ 24 hrs _ 3600 sec) ≈ 40 URLs/second.
- Read Operations: 40 * 100 = 4,000 URLs/second.
- Storage: If we’ve stored URLs for 10 years: 100 M _ 12 months _ 10 years = 12 Billion records.
- Storage Size: Assuming each record is 500 bytes: 12 Billion * 500 bytes = 6 TB.
TIP
6 TB is easily manageable. A modern relational database or a NoSQL key-value store can handle this scale with basic sharding or a single robust cluster.
3. API Design
We need two primary REST endpoints:
1. Create Short URL:
POST /api/v1/data/shortenPayload: { "original_url": "https://www.example.com/very/long/path", "custom_alias": "my-link" }Response (201 Created): { "short_url": "https://tin.y/aB3x9" }
2. Redirect Short URL:
GET /aB3x9Response: HTTP 301 (Moved Permanently) or 302 (Found) with Location: https://www.example.com/very/long/path
301 vs 302 Redirect:
301 Moved Permanently: The browser caches the redirect. Reduces a load on our servers, but we can't track analytics (click rates).302 Found: The browser contacts our server every time. Better for analytics but increases server load.
4. The Shortening Algorithm (Base62)
To create a short URL, we need a unique hash. We use Base62 encoding (A-Z, a-z, 0–9 = 62 characters).
If we use a 7-character short URL, how many unique URLs can we generate? (More than enough for our 12 Billion record estimates).
Approach: Unique ID Generator + Base62
Instead of hashing the long URL (which causes collisions), we generate a unique globally incrementing Integer ID (using a database auto-increment or Twitter Snowflake) and convert that Base10 ID to Base62.
- ID 10000 → Base62 →
2Bi - ID 1000000 → Base62 →
4C92
5. High-Level Architecture
6. Database Design & Deep Dive
Database Choice: NoSQL (Key-Value)
Since the primary access pattern is simply looking up a long URL by a short hash (Key-Value lookup), a NoSQL database like DynamoDB or Cassandra is perfect. It provides high availability and easy horizontal scaling. Relational databases (PostgreSQL/MySQL) are also fine for 6 TB, but NoSQL is heavily optimized for fast single-key reads.
Table Schema (URL_Mapping):
short_url_hash(Primary Key, String, e.g., "aB3x9")original_url(String)created_at(Timestamp)expiration_date(Timestamp)
Caching Strategy (Critical)
Since the system is read-heavy (4,000 req/sec), hitting the database for every redirect is inefficient. We place a Redis or Memcached layer between the API servers and the Database.
- Policy: LRU (Least Recently Used). Since 20% of links usually generate 80% of the traffic (Pareto Principle), caching the most active links heavily reduces a DB load.
Bottleneck Mitigation
- ID Generator SPOF: If the ID generator goes down, we can't create new URLs. We can mitigate this using a distributed ID generation system like Twitter Snowflake, or by using ZooKeeper to allocate ranges of IDs to different API servers.
- Malicious Links: Integrate a rate limiter to prevent abuse and a background worker to scan
original_urlagainst databases of known malware/phishing sites.
