Error Rate Alerts
In system design, an Error Rate describes the percentage of failed requests compared to the total number of requests handled by your service. Setting alerts based on raw error counts is a common trap; ten errors might be catastrophic if your system handles only twenty requests, but insignificant if it handles a million.
To create resilient alerting, we trigger alerts based on the Error Rate (Error Percentage) across a rolling time window.
What is an Error Rate Alert?
- Error Rate = (Failed Requests / Total Requests) * 100
- A Failed Request is usually defined by specific HTTP status codes (like
500 Internal Server Error,502 Bad Gateway,503 Service Unavailable, or504 Gateway Timeout). 4xxClient errors (like400 Bad Requestor404 Not Found) are generally not included in server error rates because they are usually problems with the client's payload, not the server's health.
Architectural Visualization
Here is a typical monitoring pipeline designed to track errors and dispatch real-time alerts to engineering teams using PagerDuty or Slack.
Code Example: Tracking Error Rates
Below is an architectural example of tracking total requests vs. error requests in an Express application using prom-client. We use a Counter metric, which only goes up. The monitoring tool (e.g., Prometheus) calculates the rate over time.
const express = require("express");
const promClient = require("prom-client");
const app = express();
const registry = new promClient.Registry();
// 1. Define a Counter for all requests
const httpRequestsTotal = new promClient.Counter({
name: "http_requests_total",
help: "Total number of HTTP requests made",
labelNames: ["method", "route", "status_code"],
});
// Register the metric
registry.registerMetric(httpRequestsTotal);
// 2. Middleware to count API traffic and identify errors
app.use((req, res, next) => {
// Hook into the finish event to capture the response status
res.on("finish", () => {
// Increment the total request counter and label the response status code
httpRequestsTotal.inc({
method: req.method,
route: req.route ? req.route.path : req.path,
status_code: res.statusCode,
});
});
next();
});
// 3. Application Routes
app.get("/api/success", (req, res) => {
res.status(200).send({ message: "Everything is fine!" });
});
app.get("/api/flaky", (req, res) => {
// Simulating a service that occasionally fails (server error)
const isError = Math.random() < 0.2; // 20% chance to fail
if (isError) {
res.status(500).send({ error: "Internal Server Error" });
} else {
res.status(200).send({ message: "Success for now!" });
}
});
app.get("/api/bad-request", (req, res) => {
// A client error (should usually be excluded from critical server alerts)
res.status(400).send({ error: "Invalid Payload" });
});
// 4. Metrics endpoint for Prometheus to scrape
app.get("/metrics", async (req, res) => {
res.set("Content-Type", registry.contentType);
res.end(await registry.metrics());
});
app.listen(3000, () => {
console.log("Server listening with Error Tracking enabled on port 3000");
});How Prometheus Evaluates the Alert
Once the data is flowing into Prometheus, you can write an alert rule using PromQL to evaluate the error rate mathematically over a rolling window of time (e.g., the last 5 minutes).
A typical PromQL alert looks like this:
# Alert if 5xx Error Rate > 5% over the last 5 minutes
ALERT HighErrorRate
IF (
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.05
FOR 1m
LABELS { severity = "critical" }
ANNOTATIONS { summary = "High error rate detected in API Gateway" }Best Practices
- Calculate Ratios, Not Raw Totals: Always divide your errors by your total request volume using
.rate(). 100 errors out of 1,000 requests is an emergency; 100 errors out of 1,000,000 requests is standard noise. - Exclude 4xx Expected Errors: Do not penalize your microservice for
400errors, which represent invalid input by the client. Group them separately unless you are tracking anomalies in client behavior. - Use Sliding Time Windows: Use functions like
[5m]or[1m]to ensure the alert recovers automatically once the system stabilizes.
