Skip to content

How I Learned System Design

Based on the popular system design roadmap and article "How I Learned System Design", this documentation breaks down the fundamental components, scalability strategies, and architectural decisions required to build large-scale applications.

1. Start with the Basics

Before designing scalable systems, you must have a strong grasp of foundational networking and internet concepts.

Network Protocols

  • TCP/IP: The fundamental suite of communication protocols used to interconnect network devices on the internet. It guarantees the reliable delivery of packets.
  • UDP: A simpler, connectionless communication protocol that does not guarantee delivery but is much faster. Ideal for streaming or gaming where speed is prioritized over perfect reliability.
  • HTTP/HTTPS: The foundation of data communication for the World Wide Web. HTTPS adds a crucial layer of security using SSL/TLS encryption.

DNS (Domain Name System)

DNS acts as the phonebook of the internet. It translates human-readable domain names (like google.com) into machine-readable IP addresses (like 192.0.2.1) that computers use to route requests correctly.

CDNs (Content Delivery Networks)

A CDN is a geographically distributed network of proxy servers and their data centers. By caching static content (HTML pages, JavaScript files, stylesheets, images, and videos) physically closer to end-users, CDNs drastically reduce latency, decrease origin server load, and improve overall page load times.

2. Scaling Your Application

As your application grows in users and data throughput, you need to handle more traffic effectively. There are two primary dimensions of scaling:

Vertical Scaling (Scale Up)

Vertical scaling involves adding more computational power (CPU, RAM, Storage) to an existing server machine.

  • Example: Upgrading your main PostgreSQL database server from 16GB of RAM to 128GB of RAM.

Pros: Easy to implement, no complex application code changes required.
Cons: Has a hard hardware limit (you can only buy a server so big), becomes extremely expensive at higher tiers, and leaves you with a single point of failure.

Horizontal Scaling (Scale Out)

Horizontal scaling involves adding more individual machines (servers) to your resource pool to distribute the overall load.

  • Example: Adding 5 new Node.js instances to your backend fleet so they can split the incoming web traffic.

Pros: Virtually limitless scaling capability, improved fault tolerance, and high availability.
Cons: Requires more complex backend architecture, intelligent software load balancing, and complex distributed data management.

3. Load Balancing & Proxies

When utilizing horizontal scaling, you need software mechanisms to route and manage traffic efficiently across your multiple servers.

Load Balancer

A Load Balancer systematically distributes incoming network traffic across a group of backend servers (a server farm). This ensures no single server is ever overwhelmed, maximizing throughput, minimizing response time, and avoiding system overload crashes.

  • Example: An AWS Elastic Load Balancer (ELB) receiving 10,000 requests per second and seamlessly distributing 2,500 requests equally to four different EC2 instances using a Round Robin algorithm.

Proxies

  • Forward Proxy: Sits in front of client machines and routes their outgoing requests to the internet. It is often used for regional client caching, internet anonymity, or institutional access control.
  • Reverse Proxy: Sits in front of backend web servers and routes incoming client requests to those servers. It provides immediate load balancing, enhanced security, traffic compression, and SSL termination.

4. Caching Strategies

Caching involves storing frequently accessed computations or data in rapid, temporary storage (like RAM) to reduce expensive database queries and significantly executed read speeds. Popular in-memory caching solutions include Redis and Memcached.

Common Caching Strategies:

  1. Cache-Aside: The application checks the cache first; if a cache miss occurs, it queries the primary database, updates the cache with the retrieved data, and then returns it to the user.
    • Example: Storing frequently accessed user profiles in Redis. When a profile is requested, check Redis first. If it isn't there, fetch it from PostgreSQL, return it, and simultaneously store it in Redis for the next visitor.
  2. Read-Through: The application queries the cache directly; if a miss occurs, the cache component itself queries the database, updates its own entry, and returns the data.
  3. Write-Through: Data is simultaneously written to both the cache and the primary database in real-time. Good for strong consistency but inherently adds write latency.
  4. Write-Back (Write-Behind): Data is written only to the temporary cache immediately, and is asynchronously written back to the database at a later time. Offers the highest write performance but risks permanent data loss if the cache crashes before the database backend update.

5. Dive into Databases

Choosing the correct database paradigm is one of the most critical system design decisions.

SQL (Relational) Databases

  • Examples: MySQL, PostgreSQL.
  • Characteristics: Data is structured tightly into tables with strict schemas. They adhere strictly to ACID properties (Atomicity, Consistency, Isolation, Durability), making them ideal for applications requiring robust multi-step transactions (e.g., banking/financial systems or CRMs).

NoSQL (Non-Relational) Databases

  • Examples: MongoDB (Document), Cassandra (Wide-Column), Redis (Key-Value), Neo4j (Graph).
  • Characteristics: Best suited for unstructured or semi-structured data, rapid agile startup development, and large-scale applications requiring high availability and easy horizontal scaling out-of-the-box.

Database Scaling Techniques

  • Replication: Continuously copying data across multiple servers (master-slave or master-master topologies) to improve read availability, geographically reduce latency, and provide a backup failover mechanism.
  • Sharding (Data Partitioning): Splitting a single massive database into smaller, faster, and more easily managed parts categorized as "shards", which are then distributed and stored across multiple autonomous backend servers.

6. Understand Distributed Systems

The CAP Theorem

The CAP Theorem fundamentally states that it is physically impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:

  1. Consistency: Every distinct read receives the most recent write or immediately throws an error.
  2. Availability: Every request receives a successful (non-error) response, without the absolute guarantee that it contains the most recent write.
  3. Partition Tolerance: The system continues to operate completely despite an arbitrary number of internal network messages being permanently dropped or delayed between nodes.

Note: In real-world modern distributed systems spanning multiple servers, Partition Tolerance is an absolute mandatory requirement, so software engineers must consciously trade off between strict Consistency (CP) and high Availability (AP).

Message Queues

Tools like Apache Kafka and RabbitMQ facilitate asynchronous messaging communication between decoupled microservices. They act as robust message buffers that smooth out heavy traffic spikes, systematically coordinate complex event-driven architectures, and ensure reliable task message delivery.

  • Example: When a user uploads a high-definition video to YouTube, the web server immediately replies "Upload successful" to the user and places a "video encoding task" message into a Kafka queue. A separate fleet of heavily-powered worker servers eventually pulls messages from this queue to encode the video in the background without forcing the user to leave their browser tab open waiting.

Microservices Architecture

This software architecture breaks down massive monolithic applications into much smaller, completely independent, and loosely coupled software services organized around extremely specific business capabilities. Each microservice can be uniquely developed, tested, deployed, and scaled independently by distinctly separate teams.


  • Books:
    • Designing Data-Intensive Applications by Martin Kleppmann (Widely considered the ultimate "bible" of system design).
    • System Design Interview by Alex Xu.
  • Courses:
    • Grokking the System Design Interview.

Released under the ISC License.