Google Search System Design: Indexing the Entire Web
Google Search is perhaps the most ambitious engineering feat in history. Its goal is to crawl trillions of web pages, index them, and retrieve the most relevant results in under 200 milliseconds.
1. Requirements
Functional
- Crawling: Discover and fetch billions of web pages.
- Indexing: Store processed page content for fast retrieval.
- Search: Parse user queries and return relevant documents.
- Ranking: Order results by quality and relevance (e.g., PageRank).
- Metadata: Handle images, snippets, and page metadata.
Non-Functional
- Scalability: Must handle the ever-growing World Wide Web.
- Freshness: New content should be discoverable quickly (News).
- Low Latency: Search results must appear almost instantly.
- High Availability: The search bar must always work.
2. High-Level Architecture
Google Search uses a "Pull-based" model for discovery and a "Retrieval" model for search.
3. Technical Deep Dives
A. Web Crawling: Googlebot
The crawler's job is to traverse the web.
- Discovery: It extracts links from known pages to find new ones.
- Freshness: It crawls popular sites (like news outlets) more frequently than static personal blogs.
- Politeness: It respects
robots.txtand ensures it doesn't overwhelm a site's server.
B. The Inverted Index: The Speed Secret
You cannot "search" the web by scanning trillions of documents. Instead, we use an Inverted Index.
- Normal Index: Document ID -> List of Words.
- Inverted Index: Word -> List of Document IDs. When you search for "System Design," the engine instantly retrieves the list of DocIDs associated with those words and calculates their intersection.
C. Ranking with PageRank
While modern Google uses thousands of signals (including AI), the foundation is PageRank:
- Concept: A page is important if many other important pages link to it.
- Mathematical Model: It treats the web as a graph where links are votes. The score for a page is distributed among the pages it links to.
4. Implementation Example: Basic Inverted Index
This TypeScript example shows how a simplified Inverted Index works for querying.
typescript
type DocID = number;
class InvertedIndex {
// Map of Word -> Array of DocIDs
private index: Map<string, DocID[]> = new Map();
private documents: Map<DocID, string> = new Map();
/**
* Adds a document to the index
*/
addDocument(id: DocID, content: string) {
this.documents.set(id, content);
const words = content.toLowerCase().split(/\W+/);
for (const word of new Set(words)) {
// Use Set to avoid duplicate DocIDs for one word
if (!this.index.has(word)) {
this.index.set(word, []);
}
this.index.get(word)?.push(id);
}
}
/**
* Searches for documents containing ALL provided words
*/
search(query: string): string[] {
const queryWords = query.toLowerCase().split(/\W+/);
// Get lists of DocIDs for each word
const resultsLists = queryWords.map((word) => this.index.get(word) || []);
if (resultsLists.length === 0) return [];
// Simple Intersection (Find DocIDs that appear in ALL lists)
const intersection = resultsLists.reduce((a, b) =>
a.filter((id) => b.includes(id))
);
return intersection.map((id) => this.documents.get(id) || "");
}
}
// Example Usage:
const engine = new InvertedIndex();
engine.addDocument(1, "System Design of YouTube");
engine.addDocument(2, "System Design of Google Search");
console.log(engine.search("Google Search")); // Output: ["System Design of Google Search"]5. Summary: Search Trade-offs
| Component | Choice | Rationale |
|---|---|---|
| Data Structure | Inverted Index | Essential for sub-second retrieval at web scale. |
| Consistency | Eventual Consistency | New pages don't need to be in the index instantly (minutes/hours is fine). |
| Storage | Distributed NoSQL | Relational DBs cannot handle the petabyte-scale web graph. |
| Efficiency | Document Sharding | The index is split across thousands of servers to parallelize query processing. |
