Skip to content

Google Search System Design: Indexing the Entire Web

Google Search is perhaps the most ambitious engineering feat in history. Its goal is to crawl trillions of web pages, index them, and retrieve the most relevant results in under 200 milliseconds.


1. Requirements

Functional

  • Crawling: Discover and fetch billions of web pages.
  • Indexing: Store processed page content for fast retrieval.
  • Search: Parse user queries and return relevant documents.
  • Ranking: Order results by quality and relevance (e.g., PageRank).
  • Metadata: Handle images, snippets, and page metadata.

Non-Functional

  • Scalability: Must handle the ever-growing World Wide Web.
  • Freshness: New content should be discoverable quickly (News).
  • Low Latency: Search results must appear almost instantly.
  • High Availability: The search bar must always work.

2. High-Level Architecture

Google Search uses a "Pull-based" model for discovery and a "Retrieval" model for search.


3. Technical Deep Dives

A. Web Crawling: Googlebot

The crawler's job is to traverse the web.

  • Discovery: It extracts links from known pages to find new ones.
  • Freshness: It crawls popular sites (like news outlets) more frequently than static personal blogs.
  • Politeness: It respects robots.txt and ensures it doesn't overwhelm a site's server.

B. The Inverted Index: The Speed Secret

You cannot "search" the web by scanning trillions of documents. Instead, we use an Inverted Index.

  • Normal Index: Document ID -> List of Words.
  • Inverted Index: Word -> List of Document IDs. When you search for "System Design," the engine instantly retrieves the list of DocIDs associated with those words and calculates their intersection.

C. Ranking with PageRank

While modern Google uses thousands of signals (including AI), the foundation is PageRank:

  • Concept: A page is important if many other important pages link to it.
  • Mathematical Model: It treats the web as a graph where links are votes. The score for a page is distributed among the pages it links to.

4. Implementation Example: Basic Inverted Index

This TypeScript example shows how a simplified Inverted Index works for querying.

typescript
type DocID = number;

class InvertedIndex {
  // Map of Word -> Array of DocIDs
  private index: Map<string, DocID[]> = new Map();
  private documents: Map<DocID, string> = new Map();

  /**
   * Adds a document to the index
   */
  addDocument(id: DocID, content: string) {
    this.documents.set(id, content);
    const words = content.toLowerCase().split(/\W+/);

    for (const word of new Set(words)) {
      // Use Set to avoid duplicate DocIDs for one word
      if (!this.index.has(word)) {
        this.index.set(word, []);
      }
      this.index.get(word)?.push(id);
    }
  }

  /**
   * Searches for documents containing ALL provided words
   */
  search(query: string): string[] {
    const queryWords = query.toLowerCase().split(/\W+/);

    // Get lists of DocIDs for each word
    const resultsLists = queryWords.map((word) => this.index.get(word) || []);

    if (resultsLists.length === 0) return [];

    // Simple Intersection (Find DocIDs that appear in ALL lists)
    const intersection = resultsLists.reduce((a, b) =>
      a.filter((id) => b.includes(id))
    );

    return intersection.map((id) => this.documents.get(id) || "");
  }
}

// Example Usage:
const engine = new InvertedIndex();
engine.addDocument(1, "System Design of YouTube");
engine.addDocument(2, "System Design of Google Search");

console.log(engine.search("Google Search")); // Output: ["System Design of Google Search"]

5. Summary: Search Trade-offs

ComponentChoiceRationale
Data StructureInverted IndexEssential for sub-second retrieval at web scale.
ConsistencyEventual ConsistencyNew pages don't need to be in the index instantly (minutes/hours is fine).
StorageDistributed NoSQLRelational DBs cannot handle the petabyte-scale web graph.
EfficiencyDocument ShardingThe index is split across thousands of servers to parallelize query processing.

Released under the ISC License.