Skip to content

📁 System Design: Dropbox / Google Drive

A cloud storage and file synchronization service supporting offline sync, versioning, and real-time collaboration.


Step 1: Requirements

Functional

  • Upload/Download: Users can upload, download, and update files from any device.
  • Sync: Automated file synchronization across multiple devices (Desktop, Mobile, Web).
  • Offline Mode: Users can edit files offline, and edits automatically sync when the device comes back online.
  • File Sharing: Users can share files and folders with other users, managing read/write permissions.
  • Versioning: Support for file version history, allowing users to view changes and roll back to previous versions.

Non-Functional

  • Durability: High durability of data (99.999999999% / 11 9's). Files must never be lost.
  • Availability: High availability for file access (99.99%).
  • Consistency: Strong consistency for file metadata across devices (a user must not see older folder structures after syncing).
  • Low Latency: Real-time sync notifications with minimal data transfer (delta updates).
  • Storage Optimization: Support deduplication to avoid storing identical file chunks multiple times.

Step 2: Capacity Estimation

Traffic & Storage Calculations

text
Assume:
  - 500M registered users, 100M Daily Active Users (DAU).
  - Each active user uploads/syncs an average of 2 files daily.
  - Average file size: 5 MB.

Daily Writes (Uploads):
  100M DAU × 2 files = 200M uploads/day (~2,315 uploads/sec)

Daily Storage Volume:
  200M files/day × 5 MB = 1 PB of raw data per day!

With Deduplication & Compression:
  Assuming average 30% reduction via chunk deduplication and compression:
  1 PB × 70% = 700 TB/day

Metadata Traffic:
  Assuming metadata updates are small (~500 bytes per file update):
  200M files/day × 500 bytes = 100 GB metadata writes/day

Step 3: Core Problem — Delta Sync & Chunking

Saving bandwidth and storage is the main challenge. Uploading an entire 100MB file after editing a single line is highly inefficient.

text
SOLUTION: Block-Level Chunking & Delta Sync

1. File Chunking:
   - Split files into smaller, fixed-size chunks (e.g., 4MB chunks).
   - Calculate a unique cryptographic hash (SHA-256) for each chunk.

2. Delta Sync:
   - When a file is modified, calculate the new hashes of all chunks.
   - Compare new hashes against the server-side metadata database.
   - Only upload chunks whose hashes are not present on the server.

3. Deduplication:
   - If User A and User B both upload the same 4MB chunk (e.g., a common library or system file), the block storage only stores it once.
   - Metadata entries for both users simply point to the same chunk ID.

Example: Node.js File Chunking & Hashing

javascript
import fs from "node:fs";
import crypto from "node:crypto";

const CHUNK_SIZE = 4 * 1024 * 1024; // 4MB chunks

async function analyzeAndChunkFile(filePath) {
  const chunks = [];
  const fileStream = fs.createReadStream(filePath, {
    highWaterMark: CHUNK_SIZE,
  });
  let offset = 0;

  for await (const data of fileStream) {
    const hash = crypto.createHash("sha256").update(data).digest("hex");
    chunks.push({
      hash,
      size: data.length,
      offset,
    });
    offset += data.length;
  }

  return chunks;
}

// Resulting schema format:
// [
//   { hash: "e3b0c442...", size: 4194304, offset: 0 },
//   { hash: "f9b2c34d...", size: 1048576, offset: 4194304 }
// ]

Step 4: High-Level Architecture

Key Components

  1. Client App: Manages local workspace, monitors changes, processes file chunking, manages index cache, and uploads blocks.
  2. Metadata Service: Handles database operations for directory structures, file versions, sharing permissions, and users.
  3. Sync Service: Receives chunk hashes from clients, determines missing chunks, and orchestrates sync operations.
  4. Block Storage: Scalable object storage (like AWS S3 or custom distributed file system) to host raw chunks.
  5. Notification Service: Pushes changes real-time using long polling or WebSockets to connected client apps.

Step 5: Upload & Sync Flow

The process of uploading a modified file requires coordination between the client metadata cache, the server-side sync service, and the block storage:

Example: Sync Service Chunk Check API Handler

javascript
import express from "express";
const app = express();
app.use(express.json());

// Mock DB and Object Storage Clients
const metadataDb = {
  checkHashes: async (hashes) => {
    // Return list of hashes that DO NOT exist in database
    return hashes.filter((h) => h === "missing-chunk-sha256");
  },
};

app.post("/api/sync/check-chunks", async (req, res) => {
  const { fileId, chunks } = req.body; // chunks: [{ hash, size }]
  const hashes = chunks.map((c) => c.hash);

  try {
    const missingHashes = await metadataDb.checkHashes(hashes);

    const uploadUrls = missingHashes.map((hash) => ({
      hash,
      uploadUrl: `https://storage.provider.com/upload-chunk/${hash}?token=temp-signed-token`,
    }));

    res.json({
      status: "SUCCESS",
      missingHashes,
      uploadUrls, // Client will upload only these chunks directly to S3
    });
  } catch (error) {
    res.status(500).json({ error: error.message });
  }
});

Step 6: Conflict Resolution

Since users can edit files offline, synchronization conflicts occur when two devices modify the same file concurrently.

text
CONFLICT SCENARIO:
  1. File starts at Version 1.
  2. Device A goes offline, updates file (creates Local Version 2a).
  3. Device B updates file online (commits Server Version 2b).
  4. Device A goes online, attempts to commit Version 2a.

RESOLUTION: Vector Clocks & Branching
  - Every file has a version number and device timestamp.
  - When Device A tries to commit a version based on a stale base version:
    1. Server rejects direct override to prevent data loss.
    2. Server saves Device A's changes as a separate sibling file named:
       "filename (Device A's conflicted copy 2026-06-07).txt"
    3. User is notified to merge changes manually.

Example: Conflict Detection Logic

javascript
function resolveConflict(serverFile, clientCommit) {
  // If client's base version matches server's current version, update is clean
  if (clientCommit.baseVersion === serverFile.currentVersion) {
    return {
      action: "MERGE_CLEAN",
      newVersion: serverFile.currentVersion + 1,
    };
  }

  // Conflict detected: server has moved ahead since the client pulled
  const dateStr = new Date().toISOString().split("T")[0];
  const conflictedName = `${serverFile.name} (Conflicted copy ${dateStr})`;

  return {
    action: "CREATE_CONFLICT_COPY",
    originalFileId: serverFile.id,
    newName: conflictedName,
    reason: `Client committed from version ${clientCommit.baseVersion}, but server is at version ${serverFile.currentVersion}`,
  };
}

// Usage Example
const serverFile = { id: "f-101", name: "budget.xlsx", currentVersion: 5 };
const clientCommit = { baseVersion: 4, fileId: "f-101" }; // Outdated pull request
console.log(resolveConflict(serverFile, clientCommit));

Step 7: Database Schema Design

A relational or highly consistent SQL database (like PostgreSQL with sharding or CockroachDB) is used for metadata to handle transactions and ACID compliance.

Table structures:

Files

ColumnTypeDescription
idVARCHAR(64) [PK]Unique identifier of the file
nameVARCHAR(255)Name of the file
parent_folder_idVARCHAR(64)Parent directory reference
owner_idVARCHAR(64)Owner user ID
versionINTCurrent version increment
is_deletedBOOLEANSoft delete status

FileChunks (Mapping chunks to files)

ColumnTypeDescription
idVARCHAR(64) [PK]Chunk relation ID
file_idVARCHAR(64) [FK]Reference to Files table
chunk_hashVARCHAR(64)SHA-256 hash of chunk
chunk_orderINTSequence placement (0, 1, 2...)

Chunks (Global storage tracking for Deduplication)

ColumnTypeDescription
hashVARCHAR(64) [PK]Cryptographic signature of binary chunk
sizeINTSize in bytes
storage_pathVARCHAR(512)S3 key/link to chunk payload

Step 8: Client Architecture (Sync Engine)

The client-side daemon performs critical work to reduce server load:

text
Client Device
 ├── Watcher Service (detects local OS file changes in directory)
 ├── Index Builder (local SQLite database tracking metadata/hashes of local workspace)
 ├── Chunking Engine (slices files, computes cryptographic digests)
 └── Sync Thread (contacts Sync Service to push updates and downloads remote deltas)

By keeping a local SQLite database (the index), the client knows exactly what state its folder is in without querying the server for every file state comparison.


📊 Summary

ComponentTechnology
Object / Block StorageAmazon S3, Google Cloud Storage, Ceph
Metadata DatabaseCockroachDB, Sharded PostgreSQL
Caching EngineRedis (hashing cache / session tracker)
Real-Time NotificationWebSockets / HTTP Long Polling
Message BrokerApache Kafka / RabbitMQ
Local Client DatabaseSQLite (for local change comparisons)

Key takeaway: Chunking files down and executing delta-syncing algorithms prevents bottlenecks and limits network overhead. The metadata database tracks the relationships, leaving the heavy block assets to be stored in cost-effective object stores.

Released under the ISC License.