Skip to content

Application Performance Monitoring (APM): Elastic APM

Modern distributed systems are composed of dozens of microservices, queues, and databases. When a user reports slowness, pinpointing the root cause across that complexity requires Application Performance Monitoring (APM). Elastic APM is an open-source APM solution that plugs natively into the ELK Stack (Elasticsearch, Logstash, Kibana), giving engineering teams end-to-end transaction tracing, service dependency mapping, and anomaly detection — all in one platform.


What is APM?

APM is the practice of automatically instrumenting your application to capture telemetry data at the code level, not just the infrastructure level. This includes:

SignalWhat it capturesExample
TracesFull lifecycle of a single request across all servicesUser → API → Auth → DB → Cache → Response
SpansIndividual units of work within a traceA single SQL query taking 45 ms
TransactionsTop-level operations (HTTP request, message consumer job)GET /api/orders/{id}
ErrorsCaptured exceptions with full stack traceNullPointerException in OrderService
MetricsJVM heap, CPU, GC pauses, HTTP request rates at agent levelsystem.cpu.total.norm.pct = 0.82

Key insight: Infrastructure monitoring (Prometheus/Grafana) tells you that a server is at 95% CPU. APM tells you which specific line of code caused it.


The ELK Stack + APM: A Unified Observability Platform

Before diving into architecture, it's important to understand how Elastic APM extends the classic ELK stack:

The APM Server acts as a protocol translator: it receives compact binary payloads from language agents and converts them into structured Elasticsearch documents, keeping agents lightweight and Elasticsearch as the single source of truth.


Full System Architecture

The following diagram shows the complete end-to-end production architecture of Elastic APM deployed in a Kubernetes environment:


Core Components Deep Dive

1. APM Agents

APM Agents are language-native libraries that auto-instrument your application code. They use techniques like bytecode manipulation (Java), monkey-patching (Python/Node.js), and middleware hooks (Go) to transparently capture traces without requiring you to change your business logic.

Key design decisions in agents:

DecisionRationale
Async batch flushingAgents never block the application's request thread
Head-based samplingSampling decision made at trace start to ensure consistent trace capture
W3C TraceContextStandard HTTP header (traceparent) enables cross-service correlation
Circuit breakerAgent self-disables if APM Server is unreachable (zero production impact)

2. APM Server

The APM Server is a stateless Go binary that sits between agents and Elasticsearch. Its responsibilities:

Tail-based sampling is a critical APM Server feature: instead of deciding at the start of a trace whether to record it (head-based), the server can wait until the full trace is complete, then keep it if it contains an error or exceeds a latency threshold. This maximizes the value of every stored trace.


3. Elasticsearch Data Streams

Elastic APM uses Elasticsearch Data Streams — a time-series optimized index management strategy. Each signal type gets its own stream:

Data Stream PatternContent
.ds-traces-apm-{service}-*Distributed trace spans & transactions
.ds-metrics-apm-{service}-*Agent-collected CPU, heap, GC metrics
.ds-logs-error-apm-{service}-*Captured exceptions with stack traces
.ds-metrics-apm-internal-*APM Server's own health metrics

ILM (Index Lifecycle Management) automatically rolls over and deletes old indices, ensuring storage costs stay predictable:

Hot (SSD, 7 days) → Warm (HDD, 30 days) → Cold (Snapshot, 90 days) → Delete

4. Kibana APM UI

Kibana is the operational frontend. Its APM UI surfaces the data in four key views:


Distributed Tracing: End-to-End Example

This is the heart of APM. Consider a user placing an order. Let's trace the entire journey:

Scenario: POST /orders with a slow payment provider

The waterfall immediately reveals the bottleneck: the Stripe API call takes 540ms, which is 64% of the total user-perceived latency. Without distributed tracing, an engineer would see "the orders endpoint is slow" but have no idea which downstream dependency is responsible.

Trace Propagation via W3C traceparent Header

When Order Service calls Payment Service, it injects a traceparent header:

http
POST /payments HTTP/1.1
Host: payment-service.internal
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             ^^  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^^^  ^^
             ver  trace-id (same across ALL hops)  parent-span-id    flags
Content-Type: application/json

{ "orderId": "ord_9a2b", "amount": 99.99 }

Every service reads this header and creates child spans under the same trace-id, which is how Kibana stitches together the full waterfall from data written independently by four different services.


Real User Monitoring (RUM)

Elastic APM includes a JavaScript RUM agent for browser-side performance monitoring. It captures:

  • Page load performance (using Navigation Timing API)
  • Long tasks (JS that blocks the main thread > 50ms)
  • User interactions (clicks, route changes in SPAs)
  • AJAX/Fetch calls (correlating browser XHR back to backend traces)

Service Map: Auto-Generated Dependency Graph

One of Elastic APM's most powerful features is the Service Map — automatically constructed from trace data, no manual configuration required.

The service map is a living, real-time topology of your system. Latency degradation in Payment Service is immediately visually surfaced — no runbook required.


Alerting Architecture

Elastic APM integrates with Kibana's alerting engine to trigger alerts based on APM-specific rules:

Alert Rule Example: Transaction Error Rate

yaml
# Kibana Alerting Rule (conceptual YAML representation)
rule:
  name: "Payment Service Error Rate > 5%"
  type: "apm.transaction_error_rate"
  schedule: "every 1 minute"
  params:
    windowSize: 5
    windowUnit: "minutes"
    threshold: 5 # 5% error rate
    serviceName: "payment-service"
    transactionType: "request"
  actions:
    - connector: "pagerduty"
      params:
        severity: "critical"
        summary: "Payment service error rate exceeded 5% threshold"
    - connector: "slack"
      params:
        channel: "#incidents"
        message: "🚨 payment-service error rate: {{context.errorRate}}%"

Production Deployment: Kubernetes Helm Architecture

In production, all Elastic Stack components are deployed via the Elastic Cloud on Kubernetes (ECK) operator:

ECK Elasticsearch Custom Resource

yaml
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: production-apm
  namespace: elastic-system
spec:
  version: 8.13.0
  nodeSets:
    - name: masters
      count: 3
      config:
        node.roles: ["master"]
      podTemplate:
        spec:
          containers:
            - name: elasticsearch
              resources:
                requests: { memory: "4Gi", cpu: "1" }
                limits: { memory: "4Gi", cpu: "2" }
    - name: data-hot
      count: 3
      config:
        node.roles: ["data_hot", "data_content", "ingest"]
      podTemplate:
        spec:
          containers:
            - name: elasticsearch
              resources:
                requests: { memory: "16Gi", cpu: "4" }
                limits: { memory: "16Gi", cpu: "8" }
      volumeClaimTemplates:
        - metadata: { name: elasticsearch-data }
          spec:
            storageClassName: "premium-ssd"
            resources:
              requests: { storage: "500Gi" }

Agent Integration: Code Examples

Java (Spring Boot)

The Elastic Java agent requires zero code changes — attach it as a JVM argument:

bash
# In your Kubernetes Deployment's container args:
java \
  -javaagent:/opt/elastic-apm/elastic-apm-agent-1.49.0.jar \
  -Delastic.apm.service_name=order-service \
  -Delastic.apm.server_url=http://apm-server:8200 \
  -Delastic.apm.secret_token=${APM_SECRET_TOKEN} \
  -Delastic.apm.environment=production \
  -Delastic.apm.transaction_sample_rate=0.1 \
  -jar app.jar

To create custom spans inside a method:

java
import co.elastic.apm.api.ElasticApm;
import co.elastic.apm.api.Span;

@Service
public class OrderService {

    public Order processOrder(OrderRequest req) {
        // Custom span wraps a critical business operation
        Span span = ElasticApm.currentTransaction()
            .startSpan("db", "postgresql", "query");
        span.setName("fetch-user-credit-limit");

        try (span) {
            return userRepository.findCreditLimit(req.getUserId());
        } catch (Exception e) {
            span.captureException(e);
            throw e;
        }
    }
}

Node.js (Express)

javascript
// MUST be the very first import — before express, http, pg, etc.
require("elastic-apm-node").start({
  serviceName: "payment-service",
  serverUrl: "http://apm-server:8200",
  secretToken: process.env.APM_SECRET_TOKEN,
  environment: process.env.NODE_ENV,
  transactionSampleRate: 0.2,
});

const express = require("express");
const app = express();

app.post("/payments", async (req, res) => {
  // The APM agent auto-instruments this HTTP handler as a Transaction.
  // All DB queries and outgoing HTTP calls inside become child Spans.
  const result = await chargeCard(req.body.token, req.body.amount);
  res.json(result);
});

Python (FastAPI)

python
import elasticapm
from elasticapm.contrib.starlette import make_apm_client, ElasticAPM
from fastapi import FastAPI

apm = make_apm_client({
    'SERVICE_NAME': 'inventory-service',
    'SERVER_URL':   'http://apm-server:8200',
    'SECRET_TOKEN': os.environ['APM_SECRET_TOKEN'],
    'ENVIRONMENT':  'production',
})

app = FastAPI()
app.add_middleware(ElasticAPM, client=apm)  # Auto-instruments all routes

@app.get("/inventory/{product_id}")
async def get_inventory(product_id: str):
    # Manually capture a custom span for Redis lookup
    with elasticapm.capture_span('redis.get', span_type='cache'):
        stock = await redis.get(f"stock:{product_id}")
    return {"product_id": product_id, "stock": stock}

Sampling Strategy

Capturing 100% of traces at high traffic volumes is prohibitively expensive. Elastic APM provides two complementary strategies:

StrategyProsCons
Head-based (agent)Zero server overhead, low network trafficMay miss rare slow traces if sampled out early
Tail-based (server)Always captures errors and slow tracesAPM Server must buffer full traces in memory first
Combined (recommended)Low volume + high-value traces guaranteedSlightly more complex configuration

Key Metrics & Golden Signals (APM Perspective)

Elastic APM surfaces all four Golden Signals (Google SRE Book) natively:


Elastic APM vs Competitors

FeatureElastic APMDatadog APMNew Relic APMJaeger (OSS)
Open source✅ Yes❌ No❌ No✅ Yes
Self-hosted option✅ Yes❌ No❌ No✅ Yes
ELK integration✅ Native⚠️ Connector only⚠️ Connector only❌ No
Service Map✅ Auto-generated✅ Auto-generated✅ Auto-generated⚠️ Limited
Machine learning anomaly✅ Built-in ML✅ Built-in✅ Built-in❌ No
RUM (Browser tracing)✅ Yes✅ Yes✅ Yes❌ No
Tail-based sampling✅ Yes✅ Yes✅ Yes✅ Yes
PricingFree (self-hosted)Per-host/GBPer-hostFree
Log correlation✅ Native (ECS)✅ Yes✅ Yes❌ No

Log Correlation: The Killer Feature of ELK + APM

Because Elastic APM and Elasticsearch share the same data platform, a single click in a trace waterfall can jump directly to the correlated application logs for that exact transaction:

Log correlation requires your application logs to include trace.id and span.id. The Elastic Common Schema (ECS) logging libraries (e.g., ecs-logging-java, ecs-logging-nodejs) inject these automatically.


Summary: Elastic APM Architecture at a Glance

Decision Checklist: When to Use Elastic APM

  • ✅ You already run or plan to run the ELK stack
  • ✅ You need self-hosted, open-source APM (data sovereignty, cost control)
  • ✅ You want automatic log-trace-metric correlation in one platform
  • ✅ You operate a polyglot microservices architecture (Java + Node + Python + Go)
  • ✅ You need browser-side Real User Monitoring correlated with backend traces
  • ⚠️ For managed cloud APM with minimal ops overhead → consider Elastic Cloud (paid)
  • ⚠️ For pure tracing without the ELK stack → consider Jaeger + Prometheus

Next: Latency Percentiles (P50, P95, P99) | Error Rate Alerts | Real-time Dashboards

Released under the ISC License.