Skip to content

Application Performance Monitoring (APM): Prometheus + Grafana

While Elastic APM traces individual requests through your code, Prometheus + Grafana is the industry-standard stack for metrics-based monitoring — tracking numeric measurements over time across your entire infrastructure, from Kubernetes node CPU utilization down to application-level request rates and error counts. Together they form the backbone of observability for thousands of production systems at companies like Google, GitLab, Uber, and SoundCloud.


The Metrics Paradigm

A metric is a numeric measurement captured at a point in time, identified by a name and a set of key-value labels:

http_requests_total{service="order-service", method="POST", status="500"} 47 1717900800000
│──────────────────┘│──────────────────────────────────────────────────────┘│──┘│───────────────┘
    Metric Name                        Labels                              Value  Timestamp (ms)

Unlike traces (which record individual requests) or logs (which record events as text), metrics are pre-aggregated numerical data that can be stored and queried at massive scale with minimal storage overhead.

The Four Prometheus Metric Types

Key Design Insight: Use a Histogram (not a Summary) whenever possible. Histograms can be aggregated across multiple instances with histogram_quantile(). Summaries are computed per-client and cannot be correctly aggregated — a critical distinction in distributed systems.


Full System Architecture

The following diagram shows the complete production architecture of a Prometheus + Grafana stack deployed in Kubernetes:


Core Components Deep Dive

1. Prometheus Server: Pull-Based Architecture

Prometheus uses a pull model — it actively scrapes /metrics HTTP endpoints on a configured interval, rather than waiting for services to push data to it. This is a deliberate architectural choice with major implications:

Why pull over push?

DimensionPull (Prometheus)Push (StatsD, InfluxDB)
Health visibilityA missing scrape = target is down (instant alerting)Silent failure if agent stops pushing
ConfigurationCentralized in Prometheus configDistributed — each service must know the address
Back-pressurePrometheus controls the rate; no thundering herdServices can overwhelm the collector
Service discoveryFirst-class: K8s, Consul, EC2, DNSMust be baked into the push agent
Short-lived jobsMisses jobs that complete between scrapesPushgateway needed for ephemeral jobs

2. The /metrics Exposition Format

Every Prometheus-compatible service exposes a human-readable text endpoint:

# HELP http_requests_total Total number of HTTP requests processed
# TYPE http_requests_total counter
http_requests_total{service="order-service",method="GET",status="200"} 58432
http_requests_total{service="order-service",method="POST",status="201"} 12049
http_requests_total{service="order-service",method="POST",status="500"} 47

# HELP http_request_duration_seconds HTTP request duration in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.005"} 8432
http_request_duration_seconds_bucket{le="0.01"}  15029
http_request_duration_seconds_bucket{le="0.025"} 28471
http_request_duration_seconds_bucket{le="0.05"}  41203
http_request_duration_seconds_bucket{le="0.1"}   52190
http_request_duration_seconds_bucket{le="0.25"}  57438
http_request_duration_seconds_bucket{le="0.5"}   58301
http_request_duration_seconds_bucket{le="1.0"}   58420
http_request_duration_seconds_bucket{le="+Inf"}  58432
http_request_duration_seconds_sum   4821.3
http_request_duration_seconds_count 58432

# HELP active_db_connections Current number of open database connections
# TYPE active_db_connections gauge
active_db_connections{pool="primary"} 23
active_db_connections{pool="replica"} 8

This format is simple enough to be generated by a shell script and parsed by Prometheus in microseconds.


3. Prometheus TSDB: Time-Series Database Internals

Prometheus stores data in its own embedded TSDB (Time-Series Database), optimized specifically for write-heavy, time-ordered data:

Storage efficiency: Prometheus uses double-delta encoding (XOR for floats) on sample values and timestamps, achieving approximately 1–2 bytes per sample — roughly 3.5 KB per time-series per day at 15s scrape intervals.


4. PromQL: The Query Language

PromQL (Prometheus Query Language) is a functional expression language purpose-built for time-series data. It is the query engine that powers every Grafana panel and every alerting rule.

Instant Vector vs. Range Vector

# Instant vector: current value of all http_requests_total series
http_requests_total

# Range vector: last 5 minutes of samples for that metric
http_requests_total[5m]

The Golden Signal Queries

Rate (Requests per second):

txt
# Per-second rate of requests over the last 5 minutes
sum(rate(http_requests_total[5m])) by (service)

Error Rate (%):

txt
# Percentage of 5xx errors over all requests
100 * (
  sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
  /
  sum(rate(http_requests_total[5m])) by (service)
)

Latency Percentile (P99):

txt
# 99th percentile request duration over the last 5 minutes
histogram_quantile(
  0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)

Saturation (CPU utilization):

txt
# CPU utilization per pod in percentage
100 - (
  avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100
)

Recording Rules: Pre-computing Expensive Queries

For frequently-used, expensive PromQL expressions (like P99 across a large cluster), use recording rules to pre-compute and store results as a new metric:

yaml
# prometheus-rules.yaml
groups:
  - name: slo_recording_rules
    interval: 60s
    rules:
      # Pre-compute P99 latency for every service every 60s
      - record: job:http_request_duration_p99:rate5m
        expr: |
          histogram_quantile(
            0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          )

      # Pre-compute error rate for every service every 60s
      - record: job:http_error_rate:rate5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)

Grafana then queries the pre-computed metric job:http_request_duration_p99:rate5m instead of recalculating it on every dashboard load — a massive performance win for large clusters.


5. Alertmanager: Alert Routing Pipeline

Prometheus fires alerts but does not send notifications itself. That job belongs to Alertmanager, which provides deduplication, grouping, inhibition, and routing:

Alertmanager Configuration Example

yaml
# alertmanager.yaml
global:
  slack_api_url: "https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXX"
  pagerduty_url: "https://events.pagerduty.com/v2/enqueue"

route:
  receiver: "slack-default"
  group_by: ["alertname", "service", "namespace"]
  group_wait: 30s # Wait to batch related alerts
  group_interval: 5m # How often to re-send grouped alert
  repeat_interval: 4h # Resend if still firing after 4h

  routes:
    - match:
        severity: critical
      receiver: pagerduty-oncall
      continue: true # Also send to default receiver

    - match_re:
        team: "dba|infra"
      receiver: email-team

receivers:
  - name: "slack-default"
    slack_configs:
      - channel: "#alerts"
        title: "{{ .GroupLabels.alertname }} | {{ .GroupLabels.service }}"
        text: "{{ range .Alerts }}• {{ .Annotations.summary }}\n{{ end }}"

  - name: "pagerduty-oncall"
    pagerduty_configs:
      - routing_key: '{{ env "PD_ROUTING_KEY" }}'
        severity: "{{ .CommonLabels.severity }}"
        description: "{{ .CommonAnnotations.description }}"

inhibit_rules:
  # If the whole cluster is down, suppress individual pod alerts
  - source_match:
      alertname: ClusterDown
    target_match_re:
      alertname: "Pod.*"
    equal: ["cluster"]

Prometheus Alerting Rules Example

yaml
# prometheus-alerts.yaml
groups:
  - name: slo_alerts
    rules:
      - alert: HighErrorRate
        expr: job:http_error_rate:rate5m > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% for {{ $labels.service }}"
          description: "Service {{ $labels.service }} has {{ $value | humanizePercentage }} error rate"

      - alert: HighP99Latency
        expr: job:http_request_duration_p99:rate5m > 0.5
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency above 500ms for {{ $labels.service }}"
          description: "P99 = {{ $value | humanizeDuration }} for {{ $labels.service }}"

      - alert: PodCrashLooping
        expr: |
          increase(kube_pod_container_status_restarts_total[15m]) > 5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.pod }} is crash-looping"

Grafana: Visualization Architecture

Grafana is a multi-data-source visualization platform. It does not store metrics itself — it queries Prometheus (or Thanos, Mimir, Loki, etc.) and renders the results as panels on dashboards.

Grafana Architecture


Service Instrumentation: Code Examples

Go (prom/client_golang)

go
package main

import (
    "net/http"
    "time"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    // Counter: monotonically increasing request count
    requestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests processed",
        },
        []string{"method", "path", "status"},
    )

    // Histogram: request duration with SLO-aligned buckets
    requestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration in seconds",
            // Buckets aligned to SLO thresholds (50ms, 100ms, 250ms, 500ms, 1s, 2s)
            Buckets: []float64{0.05, 0.1, 0.25, 0.5, 1.0, 2.0, 5.0},
        },
        []string{"method", "path"},
    )

    // Gauge: current snapshot value
    activeConnections = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "active_db_connections",
            Help: "Current number of active database connections",
        },
        []string{"pool"},
    )
)

// Middleware that wraps any http.Handler with instrumentation
func instrumentHandler(path string, next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        // Wrap ResponseWriter to capture status code
        rw := newResponseWriter(w)
        next.ServeHTTP(rw, r)

        // Record metrics after handler completes
        duration := time.Since(start).Seconds()
        status := http.StatusText(rw.statusCode)

        requestsTotal.WithLabelValues(r.Method, path, status).Inc()
        requestDuration.WithLabelValues(r.Method, path).Observe(duration)
    })
}

func main() {
    mux := http.NewServeMux()
    mux.Handle("/orders", instrumentHandler("/orders", handleOrders()))

    // Prometheus scrape endpoint — expose on a separate port for security
    metricsMux := http.NewServeMux()
    metricsMux.Handle("/metrics", promhttp.Handler())
    go http.ListenAndServe(":9090", metricsMux)

    http.ListenAndServe(":8080", mux)
}

Java (Micrometer + Spring Boot Actuator)

java
// build.gradle
// implementation 'io.micrometer:micrometer-registry-prometheus'
// implementation 'org.springframework.boot:spring-boot-starter-actuator'

// application.yaml
management:
  endpoints:
    web:
      exposure:
        include: prometheus, health, info
  metrics:
    tags:
      application: order-service   # Common label on ALL metrics
      environment: production
    distribution:
      percentiles-histogram:
        http.server.requests: true  # Enable histogram buckets for latency
      slo:
        http.server.requests: 50ms, 100ms, 250ms, 500ms, 1s

// Custom business metric in a Spring Service
@Service
public class OrderService {

    private final Counter orderCounter;
    private final Timer  orderTimer;

    public OrderService(MeterRegistry registry) {
        this.orderCounter = Counter.builder("orders.processed.total")
            .description("Total orders processed")
            .tag("region", "us-east-1")
            .register(registry);

        this.orderTimer = Timer.builder("orders.processing.duration")
            .description("Time to fully process an order")
            .publishPercentileHistogram()   // enables histogram_quantile in PromQL
            .sla(Duration.ofMillis(200), Duration.ofMillis(500))
            .register(registry);
    }

    public Order processOrder(OrderRequest req) {
        return orderTimer.record(() -> {
            Order order = validateAndPersist(req);
            orderCounter.increment();
            return order;
        });
    }
}

Python (prometheus_client + FastAPI)

python
from prometheus_client import Counter, Histogram, Gauge, make_asgi_app
from fastapi import FastAPI, Request
import time

# Define metrics at module level (registered globally)
REQUEST_COUNT = Counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "endpoint", "status_code"],
)
REQUEST_LATENCY = Histogram(
    "http_request_duration_seconds",
    "HTTP request latency",
    ["method", "endpoint"],
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
)
IN_PROGRESS = Gauge(
    "http_requests_in_progress",
    "Number of HTTP requests currently being processed",
)

app = FastAPI()

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    IN_PROGRESS.inc()
    start = time.perf_counter()
    response = await call_next(request)
    duration = time.perf_counter() - start

    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.url.path,
        status_code=response.status_code,
    ).inc()
    REQUEST_LATENCY.labels(
        method=request.method,
        endpoint=request.url.path,
    ).observe(duration)
    IN_PROGRESS.dec()
    return response

# Mount /metrics endpoint on the same app
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

Exporters: Instrumenting Third-Party Systems

Not all software can be modified to expose /metrics. Exporters are sidecar processes that translate a third-party system's native API into the Prometheus exposition format:

Common official exporters:

ExporterTarget SystemKey Metrics Exposed
node_exporterLinux hostCPU, memory, disk I/O, network, filesystem
kube-state-metricsKubernetes APIPod status, deployment replicas, PVC status, node taints
postgres_exporterPostgreSQLConnections, cache hit rate, table bloat, lock waits
redis_exporterRedisHit rate, evictions, connected clients, memory usage
kafka_exporterApache KafkaConsumer lag, topic offset, partition count
blackbox_exporterExternal HTTP/TCP/DNSUp/down status, SSL cert expiry, DNS resolution time
jmx_exporterJVM applicationsGC pause time, heap usage, thread count

Long-Term Storage: Thanos Architecture

Prometheus's local TSDB is limited by disk space. For long-term retention, global query views, and HA, the standard solution is Thanos (or Grafana Mimir/Cortex):


Grafana Dashboard Design Patterns

Dashboard as Code (Grafonnet / JSON)

Production teams store dashboards in Git, not manually in the UI. The standard approach is Grafonnet (Jsonnet library) or Grafana's JSON model:

json
{
  "title": "Order Service — SLO Dashboard",
  "uid": "order-slo-v2",
  "tags": ["production", "order-service", "slo"],
  "time": { "from": "now-1h", "to": "now" },
  "refresh": "30s",
  "panels": [
    {
      "id": 1,
      "title": "Request Rate (req/s)",
      "type": "timeseries",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{service='order-service'}[5m])) by (method)",
          "legendFormat": "{{ method }}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "reqps",
          "thresholds": {
            "steps": [
              { "color": "green", "value": null },
              { "color": "red", "value": 1000 }
            ]
          }
        }
      }
    },
    {
      "id": 2,
      "title": "P99 Latency (ms)",
      "type": "timeseries",
      "datasource": "Prometheus",
      "targets": [
        {
          "expr": "job:http_request_duration_p99:rate5m{service='order-service'} * 1000",
          "legendFormat": "P99"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "ms",
          "thresholds": {
            "steps": [
              { "color": "green", "value": null },
              { "color": "orange", "value": 250 },
              { "color": "red", "value": 500 }
            ]
          }
        }
      }
    }
  ]
}

The RED Method Dashboard Layout

Every service should have a standard RED dashboard (Rate, Errors, Duration):


Production Kubernetes Deployment

Prometheus Operator (kube-prometheus-stack)

The standard deployment uses the Prometheus Operator, which extends Kubernetes with CRDs for managing Prometheus instances declaratively:

yaml
# ServiceMonitor CRD — tells Prometheus to scrape order-service pods
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: order-service-monitor
  namespace: monitoring
  labels:
    release: kube-prometheus-stack # Must match Prometheus' serviceMonitorSelector
spec:
  selector:
    matchLabels:
      app: order-service
  namespaceSelector:
    matchNames: ["production"]
  endpoints:
    - port: metrics # Named port in the Service spec
      path: /metrics
      interval: 15s
      scrapeTimeout: 10s
      relabelings:
        # Add a 'cluster' label to every scraped metric
        - targetLabel: cluster
          replacement: prod-us-east-1
yaml
# PrometheusRule CRD — defines alerting and recording rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: order-service-rules
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: order-service.slo
      interval: 60s
      rules:
        - record: job:http_request_duration_p99:rate5m
          expr: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{job="order-service"}[5m]))
              by (le, service)
            )
        - alert: OrderServiceHighErrorRate
          expr: job:http_error_rate:rate5m{job="order-service"} > 0.05
          for: 5m
          labels:
            severity: critical
            team: backend
          annotations:
            summary: "Order service error rate > 5%"

Helm Values for kube-prometheus-stack

yaml
# values.yaml (trimmed for key settings)
prometheus:
  prometheusSpec:
    retention: 15d
    retentionSize: "50GB"
    replicas: 2 # HA pair
    shards: 1
    resources:
      requests: { cpu: "1", memory: "4Gi" }
      limits: { cpu: "4", memory: "8Gi" }
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: premium-ssd
          resources:
            requests:
              storage: 200Gi
    remoteWrite:
      - url: "http://thanos-receive:10908/api/v1/receive"

grafana:
  replicas: 2
  persistence:
    enabled: true
    size: 10Gi
  grafana.ini:
    server:
      domain: grafana.company.com
    auth.google:
      enabled: true
      client_id: "${GOOGLE_CLIENT_ID}"
      client_secret: "${GOOGLE_CLIENT_SECRET}"
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
        - name: "git-dashboards"
          type: file
          options:
            path: /var/lib/grafana/dashboards
  dashboardsConfigMaps:
    order-service: "order-service-dashboards" # Mounted from ConfigMap

alertmanager:
  alertmanagerSpec:
    replicas: 3
    resources:
      requests: { memory: "256Mi" }

SLO Monitoring with Prometheus: Burn Rate Alerting

Modern SRE practices define Service Level Objectives (SLOs) and alert on burn rate — how fast you are consuming your error budget — rather than raw thresholds. This dramatically reduces alert fatigue.

Burn Rate Alert Rules (PromQL):

yaml
# 99.9% SLO = 0.001 error budget
# Burn rate = current error rate / (1 - SLO target)

- alert: SLOBudgetBurnCritical
  expr: |
    (
      sum(rate(http_requests_total{status=~"5..",service="order-service"}[1h]))
      /
      sum(rate(http_requests_total{service="order-service"}[1h]))
    ) > (14 * 0.001)
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "SLO burn rate critical — 14x budget burn over 1h window"

- alert: SLOBudgetBurnWarning
  expr: |
    (
      sum(rate(http_requests_total{status=~"5..",service="order-service"}[6h]))
      /
      sum(rate(http_requests_total{service="order-service"}[6h]))
    ) > (6 * 0.001)
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "SLO burn rate elevated — 6x budget burn over 6h window"

Prometheus + Grafana vs Elastic APM: Complementary, Not Competing

These two stacks are designed to work together, not replace each other. They answer different questions:

QuestionTool
Is our API error rate above 5%?Prometheus + Grafana
Which specific endpoint is causing errors?Prometheus + Grafana
Which line of code threw the exception?Elastic APM
How long did the DB query in request #X take?Elastic APM
What is the P99 latency across all 50 pods?Prometheus + Grafana
What is the full trace of that 3s slow request?Elastic APM
Is our Kafka consumer lag growing?Prometheus + Grafana
Did this deploy increase error rates?Prometheus + Grafana

Summary: Prometheus + Grafana Architecture at a Glance

Decision Checklist: When to Use Prometheus + Grafana

  • ✅ You need infrastructure and application metrics monitoring (not traces)
  • ✅ You want 100% open-source, self-hosted, no vendor lock-in
  • ✅ You operate Kubernetes and need native cluster observability
  • ✅ You need SLO / burn rate alerting with multi-window rules
  • ✅ You have third-party systems (DBs, queues) needing exporters
  • ✅ Long-term metric retention (years) via Thanos/Mimir + object storage
  • ⚠️ For code-level request tracing → add Elastic APM or Jaeger
  • ⚠️ For log aggregation alongside metrics → add Grafana Loki

Next: Elastic APM (Distributed Tracing) | Latency Percentiles | Error Rate Alerts

Released under the ISC License.