Monitoring & Observability — Core Concepts

Interview Relevance: High — "How do you know if your system is healthy?" is asked in every deep-dive round. Know the three pillars and how they work together.

Observability vs. Monitoring

The key insight: Monitoring tells you that something is broken. Observability tells you why. You need both.

The Three Pillars

Pillar 1 — Metrics

Metrics are numerical measurements over time — counts, gauges, histograms.

The Four Golden Signals (Google SRE)

Why P99 > Average

Scenario: 100 requests/second

Average latency = 50ms  ← looks great!

But:
  P50 (median):  10ms  ← 50% of users
  P95:          120ms  ← 5% of users = 5 users/sec
  P99:          800ms  ← 1% of users = 1 user/sec is very slow
  P99.9:       5000ms  ← 0.1% = 1 user every 10 seconds gets 5s response

If you have 1M users/day:
  P99 = 10,000 users/day experiencing 800ms latency
  → You would never see this in averages!

Always alert on percentiles, never on averages alone.

Metric Types

Type	Description	Example
Counter	Always increases — reset on restart	Total HTTP requests, errors
Gauge	Goes up and down	Current CPU %, active connections
Histogram	Bucket counts for distributions	Request duration (P50/P99)
Summary	Pre-calculated quantiles	Like histogram but computed client-side

Prometheus Architecture

Pillar 2 — Logs

Logs are immutable, timestamped records of discrete events — what happened, when, and to whom.

Structured vs. Unstructured Logs

The Log Aggregation Pipeline

Log Levels — What to Log

TRACE:   Extremely detailed (DB queries, function entry/exit) — dev only
DEBUG:   Debugging info — dev/staging only
INFO:    Normal operations (request received, order created) — production ✅
WARN:    Unexpected but handled (retry succeeded, cache miss) — production ✅
ERROR:   Something failed (payment declined, DB timeout) — production ✅ ALERT
FATAL:   Service cannot continue — production ✅ PAGE ON-CALL

Pillar 3 — Distributed Tracing

Tracing tracks a single request as it flows across multiple services, showing exactly where time was spent.

The Problem Without Tracing

Trace + Span Model

How Tracing Works (Context Propagation)

Tracing tools:

Tool	Type	Notes
Jaeger	Open source	CNCF project, Uber-originated
Zipkin	Open source	Twitter-originated, simple
AWS X-Ray	Managed	Native AWS integration
Datadog APM	Commercial	Full-stack observability
OpenTelemetry	Standard	Vendor-neutral SDK (collect once, send anywhere)

Alerting Strategy

The SLO / SLA / SLI Framework

Error Budget

SLO: 99.9% availability over 30 days
Total minutes in 30 days: 43,200

Error budget = (1 - 0.999) × 43,200 = 43.2 minutes of downtime allowed

If you've used 40 of 43.2 minutes:
  → Freeze new deployments (protect remaining budget)
  → Focus on reliability work

If you've used only 10 of 43.2 minutes:
  → You can afford to take risks (deploy new features)
  → Error budget = permission to innovate

Effective Alert Design

Alert design principles:

Alert on symptoms (user experience), not causes (CPU%)
Use sustained conditions (5-min window), not instantaneous spikes
Aim for < 5 actionable pages/week per on-call engineer
Every alert must have a runbook (what to do when it fires)

Worked Example: Observability for the URL Shortener

Dashboard structure:

Dashboard	Key Panels	Alert Threshold
Service Health	RPS, Error Rate, P99 latency	Error rate > 0.1% for 5min
Cache Performance	Hit rate, evictions, memory %	Hit rate < 90%
Database	Query duration P95, connection pool	P95 > 50ms
Infrastructure	CPU, memory, disk I/O per node	CPU > 80% sustained
SLO Dashboard	Error budget remaining (30-day)	Burn rate > 2× for 1hr

Interview Cheat Sheet

One-Line Summaries

Metrics:          Numerical time-series — cheap, aggregated (Prometheus/Datadog)
Logs:             Structured event records — detailed, queryable (ELK/CloudWatch)
Traces:           Request journey across services — reveals bottlenecks (Jaeger/X-Ray)
Four Golden Signals: Latency, Traffic, Errors, Saturation (Google SRE)
P99 vs Average:   Average hides tail latency — always alert on percentiles
SLI:              The metric (e.g., P95 latency)
SLO:              Internal target (e.g., P95 < 200ms for 99.9% of requests)
SLA:              External contract with financial penalty
Error Budget:     (1 - SLO) × time period — permission to take risk
OpenTelemetry:    Vendor-neutral SDK — instrument once, export anywhere

The Interview Phrase

"I'd instrument the system with the four golden signals: latency
 (P50/P95/P99), traffic (req/sec), error rate (5xx %), and saturation
 (CPU/memory/connection pool). Metrics go to Prometheus + Grafana.
 All services emit structured JSON logs with a trace_id field,
 shipped via Fluentd to Elasticsearch. Distributed traces use
 OpenTelemetry — context propagated via HTTP headers — collected by
 Jaeger so I can see the end-to-end request waterfall across services.
 Alerts fire on SLO burn rate — if we're burning the error budget
 2× faster than normal for an hour, the on-call is paged with a
 runbook link."

Red Flags vs. Green Flags

🔴 Red Flag	🟢 Green Flag
"Just add logging"	Describe the three pillars: metrics + logs + traces
Alert on average latency	Alert on P99 — averages hide tail latency
Unstructured plain-text logs	Structured JSON with trace_id, user_id, service name
No tracing between services	OpenTelemetry + Jaeger for cross-service visibility
Alert on every error	Alert on error rate over a time window
No mention of SLOs	Define SLI → SLO → error budget
Alert fatigue (too many alerts)	< 5 actionable alerts/week; symptom-based alerting

IMPORTANT

Always include a trace_id in every log line and every API response header (X-Trace-ID). This single field connects metrics, logs, and traces together and makes debugging a slow request possible in seconds.

TIP

Mentioning OpenTelemetry as the instrumentation standard (vendor-neutral, collects once, exports to any backend) is a strong senior signal. It shows you've thought about avoiding vendor lock-in in your observability stack.

Monitoring & Observability — Core Concepts ​

Observability vs. Monitoring ​

The Three Pillars ​

Pillar 1 — Metrics ​

The Four Golden Signals (Google SRE) ​

Why P99 > Average ​

Metric Types ​

Prometheus Architecture ​

Pillar 2 — Logs ​

Structured vs. Unstructured Logs ​

The Log Aggregation Pipeline ​

Log Levels — What to Log ​

Pillar 3 — Distributed Tracing ​

The Problem Without Tracing ​

Trace + Span Model ​

How Tracing Works (Context Propagation) ​

Alerting Strategy ​

The SLO / SLA / SLI Framework ​

Error Budget ​

Effective Alert Design ​

Worked Example: Observability for the URL Shortener ​

Interview Cheat Sheet ​

One-Line Summaries ​

The Interview Phrase ​

Red Flags vs. Green Flags ​