What Happens When Servers Crash?
In a monolithic world, a server crash is a catastrophe. In distributed systems, it's just Tuesday.
Designing for failure is the core of modern system design. This guide explains how we detect, isolate, and recover from server crashes at scale.
1. Anatomy of a Server Crash
A "crash" can be caused by many factors:
- OOM (Out of Memory): The process consumes more RAM than available.
- Hardware Failure: A physical disk dies or a CPU overheats.
- Network Partition: The server is fine, but it can't talk to anyone.
- Software Bug: A segmentation fault or an unhandled exception.
The Impact
Without a strategy, a single crash causes:
- Service Downtime: Users get 502/504 errors.
- Data Loss: In-flight requests are lost.
- Cascading Failure: Other servers get overwhelmed by the shifted load and crash themselves (Thundering Herd).
2. Detection: The Health Check
How does the system know a server is dead? We use Health Checks.
A Load Balancer (LB) or Service Mesh periodically "pings" an endpoint on each server.
3. High Availability Architecture
To survive crashes, we use Redundancy and Failover.
Pattern A: Active-Active (Standard)
Multiple servers handle traffic simultaneously. If one crashes, the others pick up the slack.
Pattern B: Statelessness
This is the most important rule. If a server is stateless (does not store user sessions or data on its own disk), it can be replaced instantly.
- Bad: Storing user login sessions in Server RAM.
- Good: Storing sessions in a shared Redis cache.
4. Code Examples
A. The Server Side: Health Check Endpoint
Every service should expose a simple endpoint that returns a 200 OK if the system is healthy.
// Express.js Example
const express = require("express");
const app = express();
app.get("/health", (req, res) => {
// Check DB connections, memory usage, etc.
const isDBConnected = checkDB();
if (isDBConnected) {
res.status(200).send({ status: "UP" });
} else {
res.status(503).send({ status: "DOWN" });
}
});
app.listen(3000);B. The Client Side: Retry with Backoff
If a server crashes while processing a request, the client (or an internal service) should retry carefully.
async function fetchWithRetry(url, retries = 3, delay = 1000) {
try {
return await axios.get(url);
} catch (error) {
if (retries > 0 && error.status >= 500) {
console.log(`Server error. Retrying in ${delay}ms...`);
await new Promise((res) => setTimeout(res, delay));
return fetchWithRetry(url, retries - 1, delay * 2); // Exponential Backoff
}
throw error;
}
}5. Summary: The Resiliency Checklist
When designing for crashes, always ask:
- [ ] Can I detect it? (Do I have health checks?)
- [ ] Can I isolate it? (Will the Load Balancer stop sending traffic?)
- [ ] Can I recover it? (Is the service stateless so it can restart?)
- [ ] Will it spread? (Do I have Circuit Breakers to stop cascading failures?)
