Latency Tracking (P50, P95, P99)

In system design, measuring average latency can be deeply misleading. Instead, we use percentiles (like P50, P95, and P99) to accurately understand user experience.

What Are Percentiles?

P50 (Median): 50% of the requests complete faster than this value. It represents the typical experience of your median user.
P95: 95% of requests complete faster than this value. It represents the "tail end" of normal traffic and highlights performance for the slowest 5% of users.
P99: 99% of requests complete faster than this value. This represents the worst-case scenario. If a user hits a cold cache or a complex query, their experience usually falls here.

Why not average? An average hides outliers. If 99 users get a 10ms response but 1 user gets a 10,000ms response, the average shoots up, but it doesn't accurately describe anyone's experience.

Architectural Visualization

Here is a typical monitoring pipeline for gathering latency metrics using an application server, Prometheus, and Grafana.

How It Works Under the Hood

To calculate percentiles continuously at scale without storing millions of individual data points in memory, metrics systems like Prometheus use Histograms.

Histogram Buckets: The application sorts response times into predefined buckets (e.g., <= 10ms, <= 50ms, <= 100ms, <= 500ms).
Counters: Instead of saving each timestamp, the application just increments the counter for the matched bucket.
Scraping: Prometheus regularly pulls (scrapes) these bucket counts from the application.
Estimation: When calculating the P99, Prometheus interpolates within the bucket that contains the 99th percentile, giving a highly accurate estimation without massive memory overhead.

Code Example: Implementing Latency Tracking

Here is an architectural example of tracking latency percentiles in a typical Node.js / Express backend using prom-client.

javascript

const express = require("express");
const promClient = require("prom-client");

const app = express();
const registry = new promClient.Registry();

// 1. Define a Histogram metric for HTTP requests
// The 'buckets' define the boundaries for our latency intervals (in seconds)
const httpRequestDurationMicroseconds = new promClient.Histogram({
  name: "http_request_duration_seconds",
  help: "Duration of HTTP requests in seconds",
  labelNames: ["method", "route", "status_code"],
  // Buckets for 10ms, 50ms, 100ms, 500ms, 1s, 2s, 5s
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
});

// Register the metric
registry.registerMetric(httpRequestDurationMicroseconds);

// 2. Middleware to track latency for every request
app.use((req, res, next) => {
  // Start the timer
  const end = httpRequestDurationMicroseconds.startTimer();

  // Hook into the finish event of the response to record the time
  res.on("finish", () => {
    end({
      method: req.method,
      route: req.route ? req.route.path : req.path,
      status_code: res.statusCode,
    });
  });

  next();
});

// 3. Application Routes
app.get("/api/fast", (req, res) => {
  // Simulating a fast response
  setTimeout(() => res.send({ status: "Fast!" }), 20);
});

app.get("/api/slow", (req, res) => {
  // Simulating a slow and unpredictable response
  setTimeout(() => res.send({ status: "Slow!" }), 800);
});

// 4. Metrics Endpoint for Prometheus to Scrape
app.get("/metrics", async (req, res) => {
  res.set("Content-Type", registry.contentType);
  res.end(await registry.metrics());
});

app.listen(3000, () => {
  console.log("Server is running on port 3000");
});

Best Practices

Focus on P99 over P50: While P50 tells you what the "average" user feels, the P99 is where timeouts, cascading failures, and resource exhaustion usually begin. Setting up alerts on your P99 latency is a proven strategy for proactive monitoring.
Define Sensible Buckets: The accuracy of percentiles calculated from Histograms heavily depends on the buckets you configure. If all your requests take 10ms-20ms, but your buckets are [100ms, 500ms], your P99 calculation will be wildly inaccurate.
Trace Outliers: Percentiles tell you that something is slow, not why. Couple your metrics with Distributed Tracing (like OpenTelemetry or Jaeger) to inspect why the P99 requests are slow.

Latency Tracking (P50, P95, P99) ​

What Are Percentiles? ​

Architectural Visualization ​

How It Works Under the Hood ​

Code Example: Implementing Latency Tracking ​