Monitoring & Observability — Core Concepts
Interview Relevance: High — "How do you know if your system is healthy?" is asked in every deep-dive round. Know the three pillars and how they work together.
Observability vs. Monitoring
The key insight: Monitoring tells you that something is broken. Observability tells you why. You need both.
The Three Pillars
Pillar 1 — Metrics
Metrics are numerical measurements over time — counts, gauges, histograms.
The Four Golden Signals (Google SRE)
Why P99 > Average
Scenario: 100 requests/second
Average latency = 50ms ← looks great!
But:
P50 (median): 10ms ← 50% of users
P95: 120ms ← 5% of users = 5 users/sec
P99: 800ms ← 1% of users = 1 user/sec is very slow
P99.9: 5000ms ← 0.1% = 1 user every 10 seconds gets 5s response
If you have 1M users/day:
P99 = 10,000 users/day experiencing 800ms latency
→ You would never see this in averages!
Always alert on percentiles, never on averages alone.Metric Types
| Type | Description | Example |
|---|---|---|
| Counter | Always increases — reset on restart | Total HTTP requests, errors |
| Gauge | Goes up and down | Current CPU %, active connections |
| Histogram | Bucket counts for distributions | Request duration (P50/P99) |
| Summary | Pre-calculated quantiles | Like histogram but computed client-side |
Prometheus Architecture
Pillar 2 — Logs
Logs are immutable, timestamped records of discrete events — what happened, when, and to whom.
Structured vs. Unstructured Logs
The Log Aggregation Pipeline
Log Levels — What to Log
TRACE: Extremely detailed (DB queries, function entry/exit) — dev only
DEBUG: Debugging info — dev/staging only
INFO: Normal operations (request received, order created) — production ✅
WARN: Unexpected but handled (retry succeeded, cache miss) — production ✅
ERROR: Something failed (payment declined, DB timeout) — production ✅ ALERT
FATAL: Service cannot continue — production ✅ PAGE ON-CALLPillar 3 — Distributed Tracing
Tracing tracks a single request as it flows across multiple services, showing exactly where time was spent.
The Problem Without Tracing
Trace + Span Model
How Tracing Works (Context Propagation)
Tracing tools:
| Tool | Type | Notes |
|---|---|---|
| Jaeger | Open source | CNCF project, Uber-originated |
| Zipkin | Open source | Twitter-originated, simple |
| AWS X-Ray | Managed | Native AWS integration |
| Datadog APM | Commercial | Full-stack observability |
| OpenTelemetry | Standard | Vendor-neutral SDK (collect once, send anywhere) |
Alerting Strategy
The SLO / SLA / SLI Framework
Error Budget
SLO: 99.9% availability over 30 days
Total minutes in 30 days: 43,200
Error budget = (1 - 0.999) × 43,200 = 43.2 minutes of downtime allowed
If you've used 40 of 43.2 minutes:
→ Freeze new deployments (protect remaining budget)
→ Focus on reliability work
If you've used only 10 of 43.2 minutes:
→ You can afford to take risks (deploy new features)
→ Error budget = permission to innovateEffective Alert Design
Alert design principles:
- Alert on symptoms (user experience), not causes (CPU%)
- Use sustained conditions (5-min window), not instantaneous spikes
- Aim for < 5 actionable pages/week per on-call engineer
- Every alert must have a runbook (what to do when it fires)
Worked Example: Observability for the URL Shortener
Dashboard structure:
| Dashboard | Key Panels | Alert Threshold |
|---|---|---|
| Service Health | RPS, Error Rate, P99 latency | Error rate > 0.1% for 5min |
| Cache Performance | Hit rate, evictions, memory % | Hit rate < 90% |
| Database | Query duration P95, connection pool | P95 > 50ms |
| Infrastructure | CPU, memory, disk I/O per node | CPU > 80% sustained |
| SLO Dashboard | Error budget remaining (30-day) | Burn rate > 2× for 1hr |
Interview Cheat Sheet
One-Line Summaries
Metrics: Numerical time-series — cheap, aggregated (Prometheus/Datadog)
Logs: Structured event records — detailed, queryable (ELK/CloudWatch)
Traces: Request journey across services — reveals bottlenecks (Jaeger/X-Ray)
Four Golden Signals: Latency, Traffic, Errors, Saturation (Google SRE)
P99 vs Average: Average hides tail latency — always alert on percentiles
SLI: The metric (e.g., P95 latency)
SLO: Internal target (e.g., P95 < 200ms for 99.9% of requests)
SLA: External contract with financial penalty
Error Budget: (1 - SLO) × time period — permission to take risk
OpenTelemetry: Vendor-neutral SDK — instrument once, export anywhereThe Interview Phrase
"I'd instrument the system with the four golden signals: latency
(P50/P95/P99), traffic (req/sec), error rate (5xx %), and saturation
(CPU/memory/connection pool). Metrics go to Prometheus + Grafana.
All services emit structured JSON logs with a trace_id field,
shipped via Fluentd to Elasticsearch. Distributed traces use
OpenTelemetry — context propagated via HTTP headers — collected by
Jaeger so I can see the end-to-end request waterfall across services.
Alerts fire on SLO burn rate — if we're burning the error budget
2× faster than normal for an hour, the on-call is paged with a
runbook link."Red Flags vs. Green Flags
| 🔴 Red Flag | 🟢 Green Flag |
|---|---|
| "Just add logging" | Describe the three pillars: metrics + logs + traces |
| Alert on average latency | Alert on P99 — averages hide tail latency |
| Unstructured plain-text logs | Structured JSON with trace_id, user_id, service name |
| No tracing between services | OpenTelemetry + Jaeger for cross-service visibility |
| Alert on every error | Alert on error rate over a time window |
| No mention of SLOs | Define SLI → SLO → error budget |
| Alert fatigue (too many alerts) | < 5 actionable alerts/week; symptom-based alerting |
IMPORTANT
Always include a trace_id in every log line and every API response header (X-Trace-ID). This single field connects metrics, logs, and traces together and makes debugging a slow request possible in seconds.
TIP
Mentioning OpenTelemetry as the instrumentation standard (vendor-neutral, collects once, exports to any backend) is a strong senior signal. It shows you've thought about avoiding vendor lock-in in your observability stack.
