🎬 Netflix System Design & Architecture
A deep-dive into how Netflix serves 260M+ subscribers across 190 countries, streaming ~15% of global internet traffic at peak hours.
📐 Table of Contents
- High-Level Architecture Overview
- Client Layer
- API Gateway & Load Balancing
- Microservices Architecture
- Video Encoding Pipeline
- Content Delivery Network (Open Connect)
- Data Storage Strategy
- Recommendation Engine
- Real-Time Streaming Data Pipeline
- Fault Tolerance & Resilience
- Security Architecture
- Capacity & Scale Estimates
- Complete System Flow
- Technology Stack Summary
1. High-Level Architecture Overview
Netflix is a cloud-native platform built entirely on AWS, using a microservices architecture. It is split into three major zones:
- Client Zone — Apps (TV, mobile, web, game consoles)
- Backend Zone — AWS-hosted microservices + data stores
- CDN Zone — Open Connect Appliances (OCA) deployed at ISPs globally
2. Client Layer
Netflix supports 2000+ device types. Each client handles adaptive streaming differently but follows a common protocol.
Key Responsibilities
| Component | Role |
|---|---|
| Player SDK | Adaptive bitrate (ABR) switching using custom algorithms |
| DASH / HLS | Streaming protocols for video delivery |
| DRM Client | Widevine (Android), FairPlay (Apple), PlayReady (Microsoft) |
| Pre-fetching | Downloads next episode proactively on Wi-Fi |
Example: Adaptive Bitrate Logic
User starts stream at 4K (15 Mbps)
└─> Network drops to 5 Mbps
└─> Player detects buffer under-run
└─> Switches to 1080p (8 Mbps) segment
└─> Network recovers
└─> Gradually steps back up to 4K3. API Gateway & Load Balancing
Netflix open-sourced much of its infrastructure stack.
Netflix OSS Stack
| Tool | Function |
|---|---|
| Zuul | Edge proxy, routing, auth, rate limiting |
| Eureka | Service discovery registry |
| Ribbon | Client-side load balancing |
| Hystrix | Circuit breaker pattern |
| Archaius | Dynamic property management |
| Feign | Declarative HTTP client |
Example Flow — User Clicks "Play"
1. Client sends GET /playback?contentId=tt1234 to Zuul
2. Zuul validates JWT token with Auth Service
3. Zuul looks up Playback Service instances in Eureka
4. Ribbon selects healthy instance (least connections)
5. Playback Service fetches manifest from Content Metadata
6. Returns CDN URLs for video chunks → Client starts streaming4. Microservices Architecture
Each Netflix service owns its data and is independently deployable.
Decomposition Strategy
Netflix uses domain-driven design to split services:
netflix-services/
├── user-service/ # Signup, login, profile CRUD
├── billing-service/ # Subscriptions, payments (Stripe)
├── content-service/ # Titles, metadata, thumbnails
├── recommendation-svc/ # ML-based recommendations
├── search-service/ # ElasticSearch-backed search
├── playback-service/ # Streaming manifest, DRM
├── encoding-service/ # Video transcoding pipeline
└── notification-service/ # Email, push, in-app alerts5. Video Encoding Pipeline
Netflix encodes every title into ~1200+ format variants for different devices, resolutions, and codecs.
Encoding Details
| Codec | Use Case | Compression |
|---|---|---|
| H.264 (AVC) | Legacy devices, wide compat | Baseline |
| H.265 (HEVC) | 4K HDR content | 40% better than H.264 |
| VP9 | Android, Chrome | 35% better than H.264 |
| AV1 | New smart TVs, future-proof | 50% better than H.264 |
Per-Encoding Profiles (example for 1 title)
Resolution Bitrate Codec File Size
-----------------------------------------------
240p 0.235 Mbps H.264 ~100 MB
360p 0.560 Mbps H.264 ~240 MB
480p 1.050 Mbps H.264 ~450 MB
720p 2.800 Mbps H.264 ~1.2 GB
1080p 5.800 Mbps H.265 ~1.5 GB (HEVC savings)
4K HDR 15.600 Mbps AV1 ~3.0 GB (AV1 savings)
-----------------------------------------------
Total for 2hr movie: ~120 variants × avg 800MB = ~96 GB/title6. Content Delivery Network (Open Connect)
Netflix built its own CDN — Open Connect — instead of paying Akamai/Cloudflare.
How Steering Works
User clicks Play in New York:
1. DNS query → Netflix Steering Service
2. Netflix checks: which OCA is closest to user's IP?
3. Also checks: what's the OCA's load and cache hit rate?
4. Returns CDN URL pointing to nearest healthy OCA
5. Client fetches video chunks from OCA directly
If OCA Cache Miss:
OCA → IXP cluster → Origin (S3) → fill OCA cache → serve userOpen Connect Appliances (OCAs)
| Tier | Hardware | Storage | Location |
|---|---|---|---|
| Small | 36TB HDD | Cache popular content | ISP co-location |
| Large | 100TB HDD + SSD | Cache entire catalog | IXP data centers |
| Flash | All-NVMe SSD | Ultra-low latency | Top-tier ISPs |
7. Data Storage Strategy
Netflix uses polyglot persistence — the right database for each use case.
Database Selection Rationale
| Database | Use Case | Why? |
|---|---|---|
| MySQL | Billing, user accounts | ACID, strong consistency needed |
| Cassandra | Viewing history, ratings | Write-heavy, globally distributed, eventual consistency OK |
| Redis | Sessions, rate limit counters | Sub-millisecond reads, TTL support |
| ElasticSearch | Title search, autocomplete | Full-text search, faceting |
| S3 | Video files, thumbnails | Infinite storage, cheap, durable |
| Kafka | Event streaming | High-throughput, durable log |
Cassandra Data Model Example
Table: viewing_history
Partition Key: user_id
Clustering Key: watched_at (DESC)
user_id | watched_at | content_id | progress_pct | device
------------+----------------------+------------+--------------+--------
usr_123 | 2024-01-15 20:30:00 | tt0944947 | 45% | TV
usr_123 | 2024-01-14 21:00:00 | tt0903747 | 100% | Mobile
usr_123 | 2024-01-13 19:45:00 | tt1254207 | 23% | Web8. Recommendation Engine
Netflix's recommendation system drives ~80% of content watched.
Recommendation Algorithms
| Algorithm | Description | Example |
|---|---|---|
| Collaborative Filtering | Users with similar taste | "Users who watched Stranger Things also liked Dark" |
| Content-Based | Match content attributes | Genre, actors, director, tone |
| Matrix Factorization | Latent factor decomposition | Finds hidden preference patterns |
| Contextual Bandits | Real-time exploration | Tests new content on similar users |
| Thumbnail Personalization | Shows best image per user | Action fan sees action shot vs. romance fan sees love scene |
Thumbnail A/B Test Example
Title: "The Crown"
├── Thumbnail A: Queen Elizabeth II portrait → CTR: 3.2%
├── Thumbnail B: Scene of drama/conflict → CTR: 4.8%
└── Thumbnail C: Family sitting together → CTR: 2.9%
Winner: Thumbnail B → served to all users (until next test)
Personalization: History + period drama fans get Thumbnail A9. Real-Time Streaming Data Pipeline
Netflix processes ~700 billion events per day using a unified streaming platform called Keystone.
Event Types & Volume
| Event Type | Approx Volume/Day | Use Case |
|---|---|---|
play_start | 50M | Playback analytics |
buffer_event | 200M | CDN quality monitoring |
view_segment | 400M | Progress tracking |
search_query | 20M | Search improvement |
thumbnail_impression | 5B | A/B test measurement |
payment_event | 2M | Billing reconciliation |
10. Fault Tolerance & Resilience
Netflix invented the Chaos Engineering discipline and created tools to validate resilience.
Circuit Breaker Example
Recommendation Service calls ML Model Service:
CLOSED state (normal):
Request → ML Model → Success → Return recommendations
OPEN state (ML Model failing >50% in 10s):
Request → Circuit OPEN → Return cached/fallback recommendations
(Fallback: "Popular in your region" list)
HALF-OPEN state (after 30s):
1 probe request → ML Model →
Success? → CLOSE circuit
Fail? → OPEN again for another 30sFallback Strategy Hierarchy
1. Primary: Personalized ML recommendations
↓ (if fails)
2. Secondary: Pre-computed popular list per genre
↓ (if fails)
3. Tertiary: Global trending (cached in Redis, 5 min TTL)
↓ (if fails)
4. Static: Hardcoded editorial picks (always works)11. Security Architecture
12. Capacity & Scale Estimates
Traffic Assumptions
Users: 260 million subscribers
Daily active: ~130 million (50% DAU)
Peak concurrent: ~20 million streams at once
Average bitrate: 5 Mbps (mix of 1080p / 4K)Bandwidth Calculation
Peak bandwidth = 20M streams × 5 Mbps
= 100,000 Gbps
= 100 Tbps (at peak)
Netflix = ~15% of global internet traffic at peakStorage Calculation
Catalog size: ~36,000 titles
Encoding variants: ~1,200 per title
Average variant: ~1.5 GB
Total storage: 36,000 × 1,200 × 1.5 GB
= ~65 Petabytes (video only)
User data (history, profiles): ~5 PB
Event logs (Kafka + S3): ~20 PB/yearRequest Rates
API Gateway (Zuul): ~2 million requests/sec (peak)
Kafka events: ~8 million events/sec
Redis operations: ~50 million ops/sec
Cassandra reads: ~20 million reads/sec13. Complete System Flow: "User Presses Play"
14. Technology Stack Summary
| Category | Technologies |
|---|---|
| Cloud | AWS (EC2, S3, RDS, Lambda, ECS) |
| API Gateway | Zuul 2.0 |
| Service Discovery | Eureka |
| Load Balancing | Ribbon, AWS ALB |
| Circuit Breaker | Hystrix, Resilience4j |
| Messaging | Apache Kafka (Keystone) |
| Stream Processing | Apache Flink, Spark Streaming |
| Batch Processing | Apache Spark, Hive |
| Databases | Cassandra, MySQL, CockroachDB |
| Cache | Redis, Memcached, EVCache |
| Search | ElasticSearch |
| CDN | Open Connect (proprietary) |
| Video Codecs | H.264, H.265, VP9, AV1 |
| DRM | Widevine, FairPlay, PlayReady |
| Monitoring | Atlas, Spectator, Mantis |
| Chaos Eng | Simian Army (Chaos Monkey, etc.) |
| Languages | Java, Python, JavaScript, Go |
| ML/AI | TensorFlow, PyTorch (recommendations) |
| Container | Docker, Titus (Netflix's own orchestrator) |
Sources: Netflix Tech Blog (netflixtechblog.com), AWS re:Invent sessions, Netflix OSS GitHub repositories.
