Skip to content

What Happens When Servers Crash?

In a monolithic world, a server crash is a catastrophe. In distributed systems, it's just Tuesday.

Designing for failure is the core of modern system design. This guide explains how we detect, isolate, and recover from server crashes at scale.


1. Anatomy of a Server Crash

A "crash" can be caused by many factors:

  • OOM (Out of Memory): The process consumes more RAM than available.
  • Hardware Failure: A physical disk dies or a CPU overheats.
  • Network Partition: The server is fine, but it can't talk to anyone.
  • Software Bug: A segmentation fault or an unhandled exception.

The Impact

Without a strategy, a single crash causes:

  1. Service Downtime: Users get 502/504 errors.
  2. Data Loss: In-flight requests are lost.
  3. Cascading Failure: Other servers get overwhelmed by the shifted load and crash themselves (Thundering Herd).

2. Detection: The Health Check

How does the system know a server is dead? We use Health Checks.

A Load Balancer (LB) or Service Mesh periodically "pings" an endpoint on each server.


3. High Availability Architecture

To survive crashes, we use Redundancy and Failover.

Pattern A: Active-Active (Standard)

Multiple servers handle traffic simultaneously. If one crashes, the others pick up the slack.

Pattern B: Statelessness

This is the most important rule. If a server is stateless (does not store user sessions or data on its own disk), it can be replaced instantly.

  • Bad: Storing user login sessions in Server RAM.
  • Good: Storing sessions in a shared Redis cache.

4. Code Examples

A. The Server Side: Health Check Endpoint

Every service should expose a simple endpoint that returns a 200 OK if the system is healthy.

javascript
// Express.js Example
const express = require("express");
const app = express();

app.get("/health", (req, res) => {
  // Check DB connections, memory usage, etc.
  const isDBConnected = checkDB();

  if (isDBConnected) {
    res.status(200).send({ status: "UP" });
  } else {
    res.status(503).send({ status: "DOWN" });
  }
});

app.listen(3000);

B. The Client Side: Retry with Backoff

If a server crashes while processing a request, the client (or an internal service) should retry carefully.

typescript
async function fetchWithRetry(url, retries = 3, delay = 1000) {
  try {
    return await axios.get(url);
  } catch (error) {
    if (retries > 0 && error.status >= 500) {
      console.log(`Server error. Retrying in ${delay}ms...`);
      await new Promise((res) => setTimeout(res, delay));
      return fetchWithRetry(url, retries - 1, delay * 2); // Exponential Backoff
    }
    throw error;
  }
}

5. Summary: The Resiliency Checklist

When designing for crashes, always ask:

  • [ ] Can I detect it? (Do I have health checks?)
  • [ ] Can I isolate it? (Will the Load Balancer stop sending traffic?)
  • [ ] Can I recover it? (Is the service stateless so it can restart?)
  • [ ] Will it spread? (Do I have Circuit Breakers to stop cascading failures?)

Released under the ISC License.