Application Performance Monitoring (APM): Elastic APM
Modern distributed systems are composed of dozens of microservices, queues, and databases. When a user reports slowness, pinpointing the root cause across that complexity requires Application Performance Monitoring (APM). Elastic APM is an open-source APM solution that plugs natively into the ELK Stack (Elasticsearch, Logstash, Kibana), giving engineering teams end-to-end transaction tracing, service dependency mapping, and anomaly detection — all in one platform.
What is APM?
APM is the practice of automatically instrumenting your application to capture telemetry data at the code level, not just the infrastructure level. This includes:
| Signal | What it captures | Example |
|---|---|---|
| Traces | Full lifecycle of a single request across all services | User → API → Auth → DB → Cache → Response |
| Spans | Individual units of work within a trace | A single SQL query taking 45 ms |
| Transactions | Top-level operations (HTTP request, message consumer job) | GET /api/orders/{id} |
| Errors | Captured exceptions with full stack trace | NullPointerException in OrderService |
| Metrics | JVM heap, CPU, GC pauses, HTTP request rates at agent level | system.cpu.total.norm.pct = 0.82 |
Key insight: Infrastructure monitoring (Prometheus/Grafana) tells you that a server is at 95% CPU. APM tells you which specific line of code caused it.
The ELK Stack + APM: A Unified Observability Platform
Before diving into architecture, it's important to understand how Elastic APM extends the classic ELK stack:
The APM Server acts as a protocol translator: it receives compact binary payloads from language agents and converts them into structured Elasticsearch documents, keeping agents lightweight and Elasticsearch as the single source of truth.
Full System Architecture
The following diagram shows the complete end-to-end production architecture of Elastic APM deployed in a Kubernetes environment:
Core Components Deep Dive
1. APM Agents
APM Agents are language-native libraries that auto-instrument your application code. They use techniques like bytecode manipulation (Java), monkey-patching (Python/Node.js), and middleware hooks (Go) to transparently capture traces without requiring you to change your business logic.
Key design decisions in agents:
| Decision | Rationale |
|---|---|
| Async batch flushing | Agents never block the application's request thread |
| Head-based sampling | Sampling decision made at trace start to ensure consistent trace capture |
| W3C TraceContext | Standard HTTP header (traceparent) enables cross-service correlation |
| Circuit breaker | Agent self-disables if APM Server is unreachable (zero production impact) |
2. APM Server
The APM Server is a stateless Go binary that sits between agents and Elasticsearch. Its responsibilities:
Tail-based sampling is a critical APM Server feature: instead of deciding at the start of a trace whether to record it (head-based), the server can wait until the full trace is complete, then keep it if it contains an error or exceeds a latency threshold. This maximizes the value of every stored trace.
3. Elasticsearch Data Streams
Elastic APM uses Elasticsearch Data Streams — a time-series optimized index management strategy. Each signal type gets its own stream:
| Data Stream Pattern | Content |
|---|---|
.ds-traces-apm-{service}-* | Distributed trace spans & transactions |
.ds-metrics-apm-{service}-* | Agent-collected CPU, heap, GC metrics |
.ds-logs-error-apm-{service}-* | Captured exceptions with stack traces |
.ds-metrics-apm-internal-* | APM Server's own health metrics |
ILM (Index Lifecycle Management) automatically rolls over and deletes old indices, ensuring storage costs stay predictable:
Hot (SSD, 7 days) → Warm (HDD, 30 days) → Cold (Snapshot, 90 days) → Delete4. Kibana APM UI
Kibana is the operational frontend. Its APM UI surfaces the data in four key views:
Distributed Tracing: End-to-End Example
This is the heart of APM. Consider a user placing an order. Let's trace the entire journey:
Scenario: POST /orders with a slow payment provider
The waterfall immediately reveals the bottleneck: the Stripe API call takes 540ms, which is 64% of the total user-perceived latency. Without distributed tracing, an engineer would see "the orders endpoint is slow" but have no idea which downstream dependency is responsible.
Trace Propagation via W3C traceparent Header
When Order Service calls Payment Service, it injects a traceparent header:
POST /payments HTTP/1.1
Host: payment-service.internal
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^ ^^
ver trace-id (same across ALL hops) parent-span-id flags
Content-Type: application/json
{ "orderId": "ord_9a2b", "amount": 99.99 }Every service reads this header and creates child spans under the same trace-id, which is how Kibana stitches together the full waterfall from data written independently by four different services.
Real User Monitoring (RUM)
Elastic APM includes a JavaScript RUM agent for browser-side performance monitoring. It captures:
- Page load performance (using Navigation Timing API)
- Long tasks (JS that blocks the main thread > 50ms)
- User interactions (clicks, route changes in SPAs)
- AJAX/Fetch calls (correlating browser XHR back to backend traces)
Service Map: Auto-Generated Dependency Graph
One of Elastic APM's most powerful features is the Service Map — automatically constructed from trace data, no manual configuration required.
The service map is a living, real-time topology of your system. Latency degradation in Payment Service is immediately visually surfaced — no runbook required.
Alerting Architecture
Elastic APM integrates with Kibana's alerting engine to trigger alerts based on APM-specific rules:
Alert Rule Example: Transaction Error Rate
# Kibana Alerting Rule (conceptual YAML representation)
rule:
name: "Payment Service Error Rate > 5%"
type: "apm.transaction_error_rate"
schedule: "every 1 minute"
params:
windowSize: 5
windowUnit: "minutes"
threshold: 5 # 5% error rate
serviceName: "payment-service"
transactionType: "request"
actions:
- connector: "pagerduty"
params:
severity: "critical"
summary: "Payment service error rate exceeded 5% threshold"
- connector: "slack"
params:
channel: "#incidents"
message: "🚨 payment-service error rate: {{context.errorRate}}%"Production Deployment: Kubernetes Helm Architecture
In production, all Elastic Stack components are deployed via the Elastic Cloud on Kubernetes (ECK) operator:
ECK Elasticsearch Custom Resource
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: production-apm
namespace: elastic-system
spec:
version: 8.13.0
nodeSets:
- name: masters
count: 3
config:
node.roles: ["master"]
podTemplate:
spec:
containers:
- name: elasticsearch
resources:
requests: { memory: "4Gi", cpu: "1" }
limits: { memory: "4Gi", cpu: "2" }
- name: data-hot
count: 3
config:
node.roles: ["data_hot", "data_content", "ingest"]
podTemplate:
spec:
containers:
- name: elasticsearch
resources:
requests: { memory: "16Gi", cpu: "4" }
limits: { memory: "16Gi", cpu: "8" }
volumeClaimTemplates:
- metadata: { name: elasticsearch-data }
spec:
storageClassName: "premium-ssd"
resources:
requests: { storage: "500Gi" }Agent Integration: Code Examples
Java (Spring Boot)
The Elastic Java agent requires zero code changes — attach it as a JVM argument:
# In your Kubernetes Deployment's container args:
java \
-javaagent:/opt/elastic-apm/elastic-apm-agent-1.49.0.jar \
-Delastic.apm.service_name=order-service \
-Delastic.apm.server_url=http://apm-server:8200 \
-Delastic.apm.secret_token=${APM_SECRET_TOKEN} \
-Delastic.apm.environment=production \
-Delastic.apm.transaction_sample_rate=0.1 \
-jar app.jarTo create custom spans inside a method:
import co.elastic.apm.api.ElasticApm;
import co.elastic.apm.api.Span;
@Service
public class OrderService {
public Order processOrder(OrderRequest req) {
// Custom span wraps a critical business operation
Span span = ElasticApm.currentTransaction()
.startSpan("db", "postgresql", "query");
span.setName("fetch-user-credit-limit");
try (span) {
return userRepository.findCreditLimit(req.getUserId());
} catch (Exception e) {
span.captureException(e);
throw e;
}
}
}Node.js (Express)
// MUST be the very first import — before express, http, pg, etc.
require("elastic-apm-node").start({
serviceName: "payment-service",
serverUrl: "http://apm-server:8200",
secretToken: process.env.APM_SECRET_TOKEN,
environment: process.env.NODE_ENV,
transactionSampleRate: 0.2,
});
const express = require("express");
const app = express();
app.post("/payments", async (req, res) => {
// The APM agent auto-instruments this HTTP handler as a Transaction.
// All DB queries and outgoing HTTP calls inside become child Spans.
const result = await chargeCard(req.body.token, req.body.amount);
res.json(result);
});Python (FastAPI)
import elasticapm
from elasticapm.contrib.starlette import make_apm_client, ElasticAPM
from fastapi import FastAPI
apm = make_apm_client({
'SERVICE_NAME': 'inventory-service',
'SERVER_URL': 'http://apm-server:8200',
'SECRET_TOKEN': os.environ['APM_SECRET_TOKEN'],
'ENVIRONMENT': 'production',
})
app = FastAPI()
app.add_middleware(ElasticAPM, client=apm) # Auto-instruments all routes
@app.get("/inventory/{product_id}")
async def get_inventory(product_id: str):
# Manually capture a custom span for Redis lookup
with elasticapm.capture_span('redis.get', span_type='cache'):
stock = await redis.get(f"stock:{product_id}")
return {"product_id": product_id, "stock": stock}Sampling Strategy
Capturing 100% of traces at high traffic volumes is prohibitively expensive. Elastic APM provides two complementary strategies:
| Strategy | Pros | Cons |
|---|---|---|
| Head-based (agent) | Zero server overhead, low network traffic | May miss rare slow traces if sampled out early |
| Tail-based (server) | Always captures errors and slow traces | APM Server must buffer full traces in memory first |
| Combined (recommended) | Low volume + high-value traces guaranteed | Slightly more complex configuration |
Key Metrics & Golden Signals (APM Perspective)
Elastic APM surfaces all four Golden Signals (Google SRE Book) natively:
Elastic APM vs Competitors
| Feature | Elastic APM | Datadog APM | New Relic APM | Jaeger (OSS) |
|---|---|---|---|---|
| Open source | ✅ Yes | ❌ No | ❌ No | ✅ Yes |
| Self-hosted option | ✅ Yes | ❌ No | ❌ No | ✅ Yes |
| ELK integration | ✅ Native | ⚠️ Connector only | ⚠️ Connector only | ❌ No |
| Service Map | ✅ Auto-generated | ✅ Auto-generated | ✅ Auto-generated | ⚠️ Limited |
| Machine learning anomaly | ✅ Built-in ML | ✅ Built-in | ✅ Built-in | ❌ No |
| RUM (Browser tracing) | ✅ Yes | ✅ Yes | ✅ Yes | ❌ No |
| Tail-based sampling | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes |
| Pricing | Free (self-hosted) | Per-host/GB | Per-host | Free |
| Log correlation | ✅ Native (ECS) | ✅ Yes | ✅ Yes | ❌ No |
Log Correlation: The Killer Feature of ELK + APM
Because Elastic APM and Elasticsearch share the same data platform, a single click in a trace waterfall can jump directly to the correlated application logs for that exact transaction:
Log correlation requires your application logs to include trace.id and span.id. The Elastic Common Schema (ECS) logging libraries (e.g., ecs-logging-java, ecs-logging-nodejs) inject these automatically.
Summary: Elastic APM Architecture at a Glance
Decision Checklist: When to Use Elastic APM
- ✅ You already run or plan to run the ELK stack
- ✅ You need self-hosted, open-source APM (data sovereignty, cost control)
- ✅ You want automatic log-trace-metric correlation in one platform
- ✅ You operate a polyglot microservices architecture (Java + Node + Python + Go)
- ✅ You need browser-side Real User Monitoring correlated with backend traces
- ⚠️ For managed cloud APM with minimal ops overhead → consider Elastic Cloud (paid)
- ⚠️ For pure tracing without the ELK stack → consider Jaeger + Prometheus
Next: Latency Percentiles (P50, P95, P99) | Error Rate Alerts | Real-time Dashboards
