Skip to content

Error Rate Alerts

In system design, an Error Rate describes the percentage of failed requests compared to the total number of requests handled by your service. Setting alerts based on raw error counts is a common trap; ten errors might be catastrophic if your system handles only twenty requests, but insignificant if it handles a million.

To create resilient alerting, we trigger alerts based on the Error Rate (Error Percentage) across a rolling time window.

What is an Error Rate Alert?

  • Error Rate = (Failed Requests / Total Requests) * 100
  • A Failed Request is usually defined by specific HTTP status codes (like 500 Internal Server Error, 502 Bad Gateway, 503 Service Unavailable, or 504 Gateway Timeout).
  • 4xx Client errors (like 400 Bad Request or 404 Not Found) are generally not included in server error rates because they are usually problems with the client's payload, not the server's health.

Architectural Visualization

Here is a typical monitoring pipeline designed to track errors and dispatch real-time alerts to engineering teams using PagerDuty or Slack.

Code Example: Tracking Error Rates

Below is an architectural example of tracking total requests vs. error requests in an Express application using prom-client. We use a Counter metric, which only goes up. The monitoring tool (e.g., Prometheus) calculates the rate over time.

javascript
const express = require("express");
const promClient = require("prom-client");

const app = express();
const registry = new promClient.Registry();

// 1. Define a Counter for all requests
const httpRequestsTotal = new promClient.Counter({
  name: "http_requests_total",
  help: "Total number of HTTP requests made",
  labelNames: ["method", "route", "status_code"],
});

// Register the metric
registry.registerMetric(httpRequestsTotal);

// 2. Middleware to count API traffic and identify errors
app.use((req, res, next) => {
  // Hook into the finish event to capture the response status
  res.on("finish", () => {
    // Increment the total request counter and label the response status code
    httpRequestsTotal.inc({
      method: req.method,
      route: req.route ? req.route.path : req.path,
      status_code: res.statusCode,
    });
  });

  next();
});

// 3. Application Routes
app.get("/api/success", (req, res) => {
  res.status(200).send({ message: "Everything is fine!" });
});

app.get("/api/flaky", (req, res) => {
  // Simulating a service that occasionally fails (server error)
  const isError = Math.random() < 0.2; // 20% chance to fail

  if (isError) {
    res.status(500).send({ error: "Internal Server Error" });
  } else {
    res.status(200).send({ message: "Success for now!" });
  }
});

app.get("/api/bad-request", (req, res) => {
  // A client error (should usually be excluded from critical server alerts)
  res.status(400).send({ error: "Invalid Payload" });
});

// 4. Metrics endpoint for Prometheus to scrape
app.get("/metrics", async (req, res) => {
  res.set("Content-Type", registry.contentType);
  res.end(await registry.metrics());
});

app.listen(3000, () => {
  console.log("Server listening with Error Tracking enabled on port 3000");
});

How Prometheus Evaluates the Alert

Once the data is flowing into Prometheus, you can write an alert rule using PromQL to evaluate the error rate mathematically over a rolling window of time (e.g., the last 5 minutes).

A typical PromQL alert looks like this:

txt
# Alert if 5xx Error Rate > 5% over the last 5 minutes

ALERT HighErrorRate
IF (
  sum(rate(http_requests_total{status_code=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) > 0.05
FOR 1m
LABELS { severity = "critical" }
ANNOTATIONS { summary = "High error rate detected in API Gateway" }

Best Practices

  1. Calculate Ratios, Not Raw Totals: Always divide your errors by your total request volume using .rate(). 100 errors out of 1,000 requests is an emergency; 100 errors out of 1,000,000 requests is standard noise.
  2. Exclude 4xx Expected Errors: Do not penalize your microservice for 400 errors, which represent invalid input by the client. Group them separately unless you are tracking anomalies in client behavior.
  3. Use Sliding Time Windows: Use functions like [5m] or [1m] to ensure the alert recovers automatically once the system stabilizes.

Released under the ISC License.