Application Performance Monitoring (APM): Prometheus + Grafana
While Elastic APM traces individual requests through your code, Prometheus + Grafana is the industry-standard stack for metrics-based monitoring — tracking numeric measurements over time across your entire infrastructure, from Kubernetes node CPU utilization down to application-level request rates and error counts. Together they form the backbone of observability for thousands of production systems at companies like Google, GitLab, Uber, and SoundCloud.
The Metrics Paradigm
A metric is a numeric measurement captured at a point in time, identified by a name and a set of key-value labels:
http_requests_total{service="order-service", method="POST", status="500"} 47 1717900800000
│──────────────────┘│──────────────────────────────────────────────────────┘│──┘│───────────────┘
Metric Name Labels Value Timestamp (ms)Unlike traces (which record individual requests) or logs (which record events as text), metrics are pre-aggregated numerical data that can be stored and queried at massive scale with minimal storage overhead.
The Four Prometheus Metric Types
Key Design Insight: Use a Histogram (not a Summary) whenever possible. Histograms can be aggregated across multiple instances with
histogram_quantile(). Summaries are computed per-client and cannot be correctly aggregated — a critical distinction in distributed systems.
Full System Architecture
The following diagram shows the complete production architecture of a Prometheus + Grafana stack deployed in Kubernetes:
Core Components Deep Dive
1. Prometheus Server: Pull-Based Architecture
Prometheus uses a pull model — it actively scrapes /metrics HTTP endpoints on a configured interval, rather than waiting for services to push data to it. This is a deliberate architectural choice with major implications:
Why pull over push?
| Dimension | Pull (Prometheus) | Push (StatsD, InfluxDB) |
|---|---|---|
| Health visibility | A missing scrape = target is down (instant alerting) | Silent failure if agent stops pushing |
| Configuration | Centralized in Prometheus config | Distributed — each service must know the address |
| Back-pressure | Prometheus controls the rate; no thundering herd | Services can overwhelm the collector |
| Service discovery | First-class: K8s, Consul, EC2, DNS | Must be baked into the push agent |
| Short-lived jobs | Misses jobs that complete between scrapes | Pushgateway needed for ephemeral jobs |
2. The /metrics Exposition Format
Every Prometheus-compatible service exposes a human-readable text endpoint:
# HELP http_requests_total Total number of HTTP requests processed
# TYPE http_requests_total counter
http_requests_total{service="order-service",method="GET",status="200"} 58432
http_requests_total{service="order-service",method="POST",status="201"} 12049
http_requests_total{service="order-service",method="POST",status="500"} 47
# HELP http_request_duration_seconds HTTP request duration in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.005"} 8432
http_request_duration_seconds_bucket{le="0.01"} 15029
http_request_duration_seconds_bucket{le="0.025"} 28471
http_request_duration_seconds_bucket{le="0.05"} 41203
http_request_duration_seconds_bucket{le="0.1"} 52190
http_request_duration_seconds_bucket{le="0.25"} 57438
http_request_duration_seconds_bucket{le="0.5"} 58301
http_request_duration_seconds_bucket{le="1.0"} 58420
http_request_duration_seconds_bucket{le="+Inf"} 58432
http_request_duration_seconds_sum 4821.3
http_request_duration_seconds_count 58432
# HELP active_db_connections Current number of open database connections
# TYPE active_db_connections gauge
active_db_connections{pool="primary"} 23
active_db_connections{pool="replica"} 8This format is simple enough to be generated by a shell script and parsed by Prometheus in microseconds.
3. Prometheus TSDB: Time-Series Database Internals
Prometheus stores data in its own embedded TSDB (Time-Series Database), optimized specifically for write-heavy, time-ordered data:
Storage efficiency: Prometheus uses double-delta encoding (XOR for floats) on sample values and timestamps, achieving approximately 1–2 bytes per sample — roughly 3.5 KB per time-series per day at 15s scrape intervals.
4. PromQL: The Query Language
PromQL (Prometheus Query Language) is a functional expression language purpose-built for time-series data. It is the query engine that powers every Grafana panel and every alerting rule.
Instant Vector vs. Range Vector
# Instant vector: current value of all http_requests_total series
http_requests_total
# Range vector: last 5 minutes of samples for that metric
http_requests_total[5m]The Golden Signal Queries
Rate (Requests per second):
# Per-second rate of requests over the last 5 minutes
sum(rate(http_requests_total[5m])) by (service)Error Rate (%):
# Percentage of 5xx errors over all requests
100 * (
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
)Latency Percentile (P99):
# 99th percentile request duration over the last 5 minutes
histogram_quantile(
0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)Saturation (CPU utilization):
# CPU utilization per pod in percentage
100 - (
avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100
)Recording Rules: Pre-computing Expensive Queries
For frequently-used, expensive PromQL expressions (like P99 across a large cluster), use recording rules to pre-compute and store results as a new metric:
# prometheus-rules.yaml
groups:
- name: slo_recording_rules
interval: 60s
rules:
# Pre-compute P99 latency for every service every 60s
- record: job:http_request_duration_p99:rate5m
expr: |
histogram_quantile(
0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
# Pre-compute error rate for every service every 60s
- record: job:http_error_rate:rate5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)Grafana then queries the pre-computed metric job:http_request_duration_p99:rate5m instead of recalculating it on every dashboard load — a massive performance win for large clusters.
5. Alertmanager: Alert Routing Pipeline
Prometheus fires alerts but does not send notifications itself. That job belongs to Alertmanager, which provides deduplication, grouping, inhibition, and routing:
Alertmanager Configuration Example
# alertmanager.yaml
global:
slack_api_url: "https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXX"
pagerduty_url: "https://events.pagerduty.com/v2/enqueue"
route:
receiver: "slack-default"
group_by: ["alertname", "service", "namespace"]
group_wait: 30s # Wait to batch related alerts
group_interval: 5m # How often to re-send grouped alert
repeat_interval: 4h # Resend if still firing after 4h
routes:
- match:
severity: critical
receiver: pagerduty-oncall
continue: true # Also send to default receiver
- match_re:
team: "dba|infra"
receiver: email-team
receivers:
- name: "slack-default"
slack_configs:
- channel: "#alerts"
title: "{{ .GroupLabels.alertname }} | {{ .GroupLabels.service }}"
text: "{{ range .Alerts }}• {{ .Annotations.summary }}\n{{ end }}"
- name: "pagerduty-oncall"
pagerduty_configs:
- routing_key: '{{ env "PD_ROUTING_KEY" }}'
severity: "{{ .CommonLabels.severity }}"
description: "{{ .CommonAnnotations.description }}"
inhibit_rules:
# If the whole cluster is down, suppress individual pod alerts
- source_match:
alertname: ClusterDown
target_match_re:
alertname: "Pod.*"
equal: ["cluster"]Prometheus Alerting Rules Example
# prometheus-alerts.yaml
groups:
- name: slo_alerts
rules:
- alert: HighErrorRate
expr: job:http_error_rate:rate5m > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 5% for {{ $labels.service }}"
description: "Service {{ $labels.service }} has {{ $value | humanizePercentage }} error rate"
- alert: HighP99Latency
expr: job:http_request_duration_p99:rate5m > 0.5
for: 2m
labels:
severity: warning
annotations:
summary: "P99 latency above 500ms for {{ $labels.service }}"
description: "P99 = {{ $value | humanizeDuration }} for {{ $labels.service }}"
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total[15m]) > 5
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash-looping"Grafana: Visualization Architecture
Grafana is a multi-data-source visualization platform. It does not store metrics itself — it queries Prometheus (or Thanos, Mimir, Loki, etc.) and renders the results as panels on dashboards.
Grafana Architecture
Service Instrumentation: Code Examples
Go (prom/client_golang)
package main
import (
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
// Counter: monotonically increasing request count
requestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests processed",
},
[]string{"method", "path", "status"},
)
// Histogram: request duration with SLO-aligned buckets
requestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
// Buckets aligned to SLO thresholds (50ms, 100ms, 250ms, 500ms, 1s, 2s)
Buckets: []float64{0.05, 0.1, 0.25, 0.5, 1.0, 2.0, 5.0},
},
[]string{"method", "path"},
)
// Gauge: current snapshot value
activeConnections = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: "active_db_connections",
Help: "Current number of active database connections",
},
[]string{"pool"},
)
)
// Middleware that wraps any http.Handler with instrumentation
func instrumentHandler(path string, next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// Wrap ResponseWriter to capture status code
rw := newResponseWriter(w)
next.ServeHTTP(rw, r)
// Record metrics after handler completes
duration := time.Since(start).Seconds()
status := http.StatusText(rw.statusCode)
requestsTotal.WithLabelValues(r.Method, path, status).Inc()
requestDuration.WithLabelValues(r.Method, path).Observe(duration)
})
}
func main() {
mux := http.NewServeMux()
mux.Handle("/orders", instrumentHandler("/orders", handleOrders()))
// Prometheus scrape endpoint — expose on a separate port for security
metricsMux := http.NewServeMux()
metricsMux.Handle("/metrics", promhttp.Handler())
go http.ListenAndServe(":9090", metricsMux)
http.ListenAndServe(":8080", mux)
}Java (Micrometer + Spring Boot Actuator)
// build.gradle
// implementation 'io.micrometer:micrometer-registry-prometheus'
// implementation 'org.springframework.boot:spring-boot-starter-actuator'
// application.yaml
management:
endpoints:
web:
exposure:
include: prometheus, health, info
metrics:
tags:
application: order-service # Common label on ALL metrics
environment: production
distribution:
percentiles-histogram:
http.server.requests: true # Enable histogram buckets for latency
slo:
http.server.requests: 50ms, 100ms, 250ms, 500ms, 1s
// Custom business metric in a Spring Service
@Service
public class OrderService {
private final Counter orderCounter;
private final Timer orderTimer;
public OrderService(MeterRegistry registry) {
this.orderCounter = Counter.builder("orders.processed.total")
.description("Total orders processed")
.tag("region", "us-east-1")
.register(registry);
this.orderTimer = Timer.builder("orders.processing.duration")
.description("Time to fully process an order")
.publishPercentileHistogram() // enables histogram_quantile in PromQL
.sla(Duration.ofMillis(200), Duration.ofMillis(500))
.register(registry);
}
public Order processOrder(OrderRequest req) {
return orderTimer.record(() -> {
Order order = validateAndPersist(req);
orderCounter.increment();
return order;
});
}
}Python (prometheus_client + FastAPI)
from prometheus_client import Counter, Histogram, Gauge, make_asgi_app
from fastapi import FastAPI, Request
import time
# Define metrics at module level (registered globally)
REQUEST_COUNT = Counter(
"http_requests_total",
"Total HTTP requests",
["method", "endpoint", "status_code"],
)
REQUEST_LATENCY = Histogram(
"http_request_duration_seconds",
"HTTP request latency",
["method", "endpoint"],
buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
)
IN_PROGRESS = Gauge(
"http_requests_in_progress",
"Number of HTTP requests currently being processed",
)
app = FastAPI()
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
IN_PROGRESS.inc()
start = time.perf_counter()
response = await call_next(request)
duration = time.perf_counter() - start
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.url.path,
status_code=response.status_code,
).inc()
REQUEST_LATENCY.labels(
method=request.method,
endpoint=request.url.path,
).observe(duration)
IN_PROGRESS.dec()
return response
# Mount /metrics endpoint on the same app
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)Exporters: Instrumenting Third-Party Systems
Not all software can be modified to expose /metrics. Exporters are sidecar processes that translate a third-party system's native API into the Prometheus exposition format:
Common official exporters:
| Exporter | Target System | Key Metrics Exposed |
|---|---|---|
node_exporter | Linux host | CPU, memory, disk I/O, network, filesystem |
kube-state-metrics | Kubernetes API | Pod status, deployment replicas, PVC status, node taints |
postgres_exporter | PostgreSQL | Connections, cache hit rate, table bloat, lock waits |
redis_exporter | Redis | Hit rate, evictions, connected clients, memory usage |
kafka_exporter | Apache Kafka | Consumer lag, topic offset, partition count |
blackbox_exporter | External HTTP/TCP/DNS | Up/down status, SSL cert expiry, DNS resolution time |
jmx_exporter | JVM applications | GC pause time, heap usage, thread count |
Long-Term Storage: Thanos Architecture
Prometheus's local TSDB is limited by disk space. For long-term retention, global query views, and HA, the standard solution is Thanos (or Grafana Mimir/Cortex):
Grafana Dashboard Design Patterns
Dashboard as Code (Grafonnet / JSON)
Production teams store dashboards in Git, not manually in the UI. The standard approach is Grafonnet (Jsonnet library) or Grafana's JSON model:
{
"title": "Order Service — SLO Dashboard",
"uid": "order-slo-v2",
"tags": ["production", "order-service", "slo"],
"time": { "from": "now-1h", "to": "now" },
"refresh": "30s",
"panels": [
{
"id": 1,
"title": "Request Rate (req/s)",
"type": "timeseries",
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(rate(http_requests_total{service='order-service'}[5m])) by (method)",
"legendFormat": "{{ method }}"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps",
"thresholds": {
"steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1000 }
]
}
}
}
},
{
"id": 2,
"title": "P99 Latency (ms)",
"type": "timeseries",
"datasource": "Prometheus",
"targets": [
{
"expr": "job:http_request_duration_p99:rate5m{service='order-service'} * 1000",
"legendFormat": "P99"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"thresholds": {
"steps": [
{ "color": "green", "value": null },
{ "color": "orange", "value": 250 },
{ "color": "red", "value": 500 }
]
}
}
}
}
]
}The RED Method Dashboard Layout
Every service should have a standard RED dashboard (Rate, Errors, Duration):
Production Kubernetes Deployment
Prometheus Operator (kube-prometheus-stack)
The standard deployment uses the Prometheus Operator, which extends Kubernetes with CRDs for managing Prometheus instances declaratively:
# ServiceMonitor CRD — tells Prometheus to scrape order-service pods
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: order-service-monitor
namespace: monitoring
labels:
release: kube-prometheus-stack # Must match Prometheus' serviceMonitorSelector
spec:
selector:
matchLabels:
app: order-service
namespaceSelector:
matchNames: ["production"]
endpoints:
- port: metrics # Named port in the Service spec
path: /metrics
interval: 15s
scrapeTimeout: 10s
relabelings:
# Add a 'cluster' label to every scraped metric
- targetLabel: cluster
replacement: prod-us-east-1# PrometheusRule CRD — defines alerting and recording rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: order-service-rules
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
groups:
- name: order-service.slo
interval: 60s
rules:
- record: job:http_request_duration_p99:rate5m
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="order-service"}[5m]))
by (le, service)
)
- alert: OrderServiceHighErrorRate
expr: job:http_error_rate:rate5m{job="order-service"} > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "Order service error rate > 5%"Helm Values for kube-prometheus-stack
# values.yaml (trimmed for key settings)
prometheus:
prometheusSpec:
retention: 15d
retentionSize: "50GB"
replicas: 2 # HA pair
shards: 1
resources:
requests: { cpu: "1", memory: "4Gi" }
limits: { cpu: "4", memory: "8Gi" }
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: premium-ssd
resources:
requests:
storage: 200Gi
remoteWrite:
- url: "http://thanos-receive:10908/api/v1/receive"
grafana:
replicas: 2
persistence:
enabled: true
size: 10Gi
grafana.ini:
server:
domain: grafana.company.com
auth.google:
enabled: true
client_id: "${GOOGLE_CLIENT_ID}"
client_secret: "${GOOGLE_CLIENT_SECRET}"
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: "git-dashboards"
type: file
options:
path: /var/lib/grafana/dashboards
dashboardsConfigMaps:
order-service: "order-service-dashboards" # Mounted from ConfigMap
alertmanager:
alertmanagerSpec:
replicas: 3
resources:
requests: { memory: "256Mi" }SLO Monitoring with Prometheus: Burn Rate Alerting
Modern SRE practices define Service Level Objectives (SLOs) and alert on burn rate — how fast you are consuming your error budget — rather than raw thresholds. This dramatically reduces alert fatigue.
Burn Rate Alert Rules (PromQL):
# 99.9% SLO = 0.001 error budget
# Burn rate = current error rate / (1 - SLO target)
- alert: SLOBudgetBurnCritical
expr: |
(
sum(rate(http_requests_total{status=~"5..",service="order-service"}[1h]))
/
sum(rate(http_requests_total{service="order-service"}[1h]))
) > (14 * 0.001)
for: 2m
labels:
severity: critical
annotations:
summary: "SLO burn rate critical — 14x budget burn over 1h window"
- alert: SLOBudgetBurnWarning
expr: |
(
sum(rate(http_requests_total{status=~"5..",service="order-service"}[6h]))
/
sum(rate(http_requests_total{service="order-service"}[6h]))
) > (6 * 0.001)
for: 15m
labels:
severity: warning
annotations:
summary: "SLO burn rate elevated — 6x budget burn over 6h window"Prometheus + Grafana vs Elastic APM: Complementary, Not Competing
These two stacks are designed to work together, not replace each other. They answer different questions:
| Question | Tool |
|---|---|
| Is our API error rate above 5%? | Prometheus + Grafana |
| Which specific endpoint is causing errors? | Prometheus + Grafana |
| Which line of code threw the exception? | Elastic APM |
| How long did the DB query in request #X take? | Elastic APM |
| What is the P99 latency across all 50 pods? | Prometheus + Grafana |
| What is the full trace of that 3s slow request? | Elastic APM |
| Is our Kafka consumer lag growing? | Prometheus + Grafana |
| Did this deploy increase error rates? | Prometheus + Grafana |
Summary: Prometheus + Grafana Architecture at a Glance
Decision Checklist: When to Use Prometheus + Grafana
- ✅ You need infrastructure and application metrics monitoring (not traces)
- ✅ You want 100% open-source, self-hosted, no vendor lock-in
- ✅ You operate Kubernetes and need native cluster observability
- ✅ You need SLO / burn rate alerting with multi-window rules
- ✅ You have third-party systems (DBs, queues) needing exporters
- ✅ Long-term metric retention (years) via Thanos/Mimir + object storage
- ⚠️ For code-level request tracing → add Elastic APM or Jaeger
- ⚠️ For log aggregation alongside metrics → add Grafana Loki
Next: Elastic APM (Distributed Tracing) | Latency Percentiles | Error Rate Alerts
