GitHub System Design: Hosting the World's Code
GitHub is more than just "Git hosting." It is a massive collaboration platform that manages petabytes of Git data while providing a high-availability interface for Pull Requests, Issues, and CI/CD.
1. Requirements
Functional
- Repository Hosting: Create, clone, push, and pull repositories (SSH/HTTPS).
- Collaboration: Pull Requests (PRs), Issues, and code reviews.
- Search: Search code across millions of repositories.
- Webhooks: Notify external services about repository events.
- Actions: Run CI/CD pipelines.
Non-Functional
- High Durability: Code must never be lost (Replication).
- High Availability: Developers need 24/7 access to their repositories.
- Scalability: Handle billions of Git objects and millions of concurrent users.
- Strong Consistency: Git operations must be strictly ordered.
2. High-Level Architecture
GitHub's architecture is divided into the Git Storage Layer and the Application/Metadata Layer.
3. Technical Deep Dives
A. Repository Storage: Spokes
GitHub uses a custom proprietary system called Spokes to manage Git repositories.
- Replication: Every repository is replicated across multiple "Spokes" nodes.
- Consistency: GitHub uses a three-node replica setup. One node is the "leader" for writes, and updates are synchronously replicated to at least one "follower" before confirming the push.
- Routing: The Git Proxy identifies which node holds the repository for a given request using a consistent hashing mechanism or a lookup table.
B. Git Internals in Storage
GitHub doesn't just store files; it stores Git Objects:
- Blobs: The content of a single file.
- Trees: A folder structure, mapping names to blob hashes.
- Commits: A pointer to a tree and a parent commit, providing history. They are stored as compressed, hashed objects. GitHub optimizes this using Packfiles to save disk space for similar versions of the same file.
C. Metadata Management (Vitess)
While code is in Git, collaboration data is stored in MySQL (scaled via Vitess).
- PRs/Issues: These are relational. A PR links to a repo, a user, and a list of comments.
- Permissions: Complex RBAC (Role-Based Access Control) for organizations is managed here.
4. Implementation Example: Simplified Git Object Store
This TypeScript example shows how Git identifies and stores content using SHA-1 hashing.
typescript
import * as crypto from "crypto";
class GitObjectStore {
private store: Map<string, Buffer> = new Map();
/**
* Stores a file content as a Git 'blob'
*/
async storeBlob(content: string): Promise<string> {
const header = `blob ${Buffer.byteLength(content)}\0`;
const fullContent = header + content;
// 1. Calculate the SHA-1 hash (The Object ID)
const hash = crypto.createHash("sha1").update(fullContent).digest("hex");
// 2. Store the content indexed by its hash
this.store.set(hash, Buffer.from(fullContent));
return hash;
}
/**
* Retrieves content by its hash
*/
getBlob(hash: string): string | null {
const data = this.store.get(hash);
if (!data) return null;
// Remove the null-terminated header to get raw content
const content = data.toString();
const nullByteIndex = content.indexOf("\0");
return content.substring(nullByteIndex + 1);
}
}
// Example Usage:
const repo = new GitObjectStore();
async function run() {
const hash = await repo.storeBlob("console.log('Hello GitHub!');");
console.log(`Blob stored with hash: ${hash}`);
console.log(`Content retrieved: ${repo.getBlob(hash)}`);
}
run();5. Summary: GitHub Architecture Trade-offs
| Component | Choice | Rationale |
|---|---|---|
| Storage | Repository Sharding | Git doesn't scale as a single DB; repositories must be split across servers. |
| Consistency | Strong Consistency for Git | We cannot allow "eventual consistency" where a push might disappear. |
| Availability | Multi-Replica (Spokes) | Ensuring code is accessible even if multiple hardware nodes fail. |
| Search | Content Indexing | Reading raw Git blobs for every search is too slow; separate indexing is required. |
