Skip to content

Monitoring & Observability — Core Concepts

Interview Relevance: High — "How do you know if your system is healthy?" is asked in every deep-dive round. Know the three pillars and how they work together.


Observability vs. Monitoring

The key insight: Monitoring tells you that something is broken. Observability tells you why. You need both.


The Three Pillars


Pillar 1 — Metrics

Metrics are numerical measurements over time — counts, gauges, histograms.

The Four Golden Signals (Google SRE)

Why P99 > Average

Scenario: 100 requests/second

Average latency = 50ms  ← looks great!

But:
  P50 (median):  10ms  ← 50% of users
  P95:          120ms  ← 5% of users = 5 users/sec
  P99:          800ms  ← 1% of users = 1 user/sec is very slow
  P99.9:       5000ms  ← 0.1% = 1 user every 10 seconds gets 5s response

If you have 1M users/day:
  P99 = 10,000 users/day experiencing 800ms latency
  → You would never see this in averages!

Always alert on percentiles, never on averages alone.

Metric Types

TypeDescriptionExample
CounterAlways increases — reset on restartTotal HTTP requests, errors
GaugeGoes up and downCurrent CPU %, active connections
HistogramBucket counts for distributionsRequest duration (P50/P99)
SummaryPre-calculated quantilesLike histogram but computed client-side

Prometheus Architecture


Pillar 2 — Logs

Logs are immutable, timestamped records of discrete events — what happened, when, and to whom.

Structured vs. Unstructured Logs

The Log Aggregation Pipeline

Log Levels — What to Log

TRACE:   Extremely detailed (DB queries, function entry/exit) — dev only
DEBUG:   Debugging info — dev/staging only
INFO:    Normal operations (request received, order created) — production ✅
WARN:    Unexpected but handled (retry succeeded, cache miss) — production ✅
ERROR:   Something failed (payment declined, DB timeout) — production ✅ ALERT
FATAL:   Service cannot continue — production ✅ PAGE ON-CALL

Pillar 3 — Distributed Tracing

Tracing tracks a single request as it flows across multiple services, showing exactly where time was spent.

The Problem Without Tracing

Trace + Span Model

How Tracing Works (Context Propagation)

Tracing tools:

ToolTypeNotes
JaegerOpen sourceCNCF project, Uber-originated
ZipkinOpen sourceTwitter-originated, simple
AWS X-RayManagedNative AWS integration
Datadog APMCommercialFull-stack observability
OpenTelemetryStandardVendor-neutral SDK (collect once, send anywhere)

Alerting Strategy

The SLO / SLA / SLI Framework

Error Budget

SLO: 99.9% availability over 30 days
Total minutes in 30 days: 43,200

Error budget = (1 - 0.999) × 43,200 = 43.2 minutes of downtime allowed

If you've used 40 of 43.2 minutes:
  → Freeze new deployments (protect remaining budget)
  → Focus on reliability work

If you've used only 10 of 43.2 minutes:
  → You can afford to take risks (deploy new features)
  → Error budget = permission to innovate

Effective Alert Design

Alert design principles:

  • Alert on symptoms (user experience), not causes (CPU%)
  • Use sustained conditions (5-min window), not instantaneous spikes
  • Aim for < 5 actionable pages/week per on-call engineer
  • Every alert must have a runbook (what to do when it fires)

Worked Example: Observability for the URL Shortener

Dashboard structure:

DashboardKey PanelsAlert Threshold
Service HealthRPS, Error Rate, P99 latencyError rate > 0.1% for 5min
Cache PerformanceHit rate, evictions, memory %Hit rate < 90%
DatabaseQuery duration P95, connection poolP95 > 50ms
InfrastructureCPU, memory, disk I/O per nodeCPU > 80% sustained
SLO DashboardError budget remaining (30-day)Burn rate > 2× for 1hr

Interview Cheat Sheet

One-Line Summaries

Metrics:          Numerical time-series — cheap, aggregated (Prometheus/Datadog)
Logs:             Structured event records — detailed, queryable (ELK/CloudWatch)
Traces:           Request journey across services — reveals bottlenecks (Jaeger/X-Ray)
Four Golden Signals: Latency, Traffic, Errors, Saturation (Google SRE)
P99 vs Average:   Average hides tail latency — always alert on percentiles
SLI:              The metric (e.g., P95 latency)
SLO:              Internal target (e.g., P95 < 200ms for 99.9% of requests)
SLA:              External contract with financial penalty
Error Budget:     (1 - SLO) × time period — permission to take risk
OpenTelemetry:    Vendor-neutral SDK — instrument once, export anywhere

The Interview Phrase

"I'd instrument the system with the four golden signals: latency
 (P50/P95/P99), traffic (req/sec), error rate (5xx %), and saturation
 (CPU/memory/connection pool). Metrics go to Prometheus + Grafana.
 All services emit structured JSON logs with a trace_id field,
 shipped via Fluentd to Elasticsearch. Distributed traces use
 OpenTelemetry — context propagated via HTTP headers — collected by
 Jaeger so I can see the end-to-end request waterfall across services.
 Alerts fire on SLO burn rate — if we're burning the error budget
 2× faster than normal for an hour, the on-call is paged with a
 runbook link."

Red Flags vs. Green Flags

🔴 Red Flag🟢 Green Flag
"Just add logging"Describe the three pillars: metrics + logs + traces
Alert on average latencyAlert on P99 — averages hide tail latency
Unstructured plain-text logsStructured JSON with trace_id, user_id, service name
No tracing between servicesOpenTelemetry + Jaeger for cross-service visibility
Alert on every errorAlert on error rate over a time window
No mention of SLOsDefine SLI → SLO → error budget
Alert fatigue (too many alerts)< 5 actionable alerts/week; symptom-based alerting

IMPORTANT

Always include a trace_id in every log line and every API response header (X-Trace-ID). This single field connects metrics, logs, and traces together and makes debugging a slow request possible in seconds.

TIP

Mentioning OpenTelemetry as the instrumentation standard (vendor-neutral, collects once, exports to any backend) is a strong senior signal. It shows you've thought about avoiding vendor lock-in in your observability stack.

Released under the ISC License.