Application Performance Monitoring (APM): Elastic APM

Modern distributed systems are composed of dozens of microservices, queues, and databases. When a user reports slowness, pinpointing the root cause across that complexity requires Application Performance Monitoring (APM). Elastic APM is an open-source APM solution that plugs natively into the ELK Stack (Elasticsearch, Logstash, Kibana), giving engineering teams end-to-end transaction tracing, service dependency mapping, and anomaly detection — all in one platform.

What is APM?

APM is the practice of automatically instrumenting your application to capture telemetry data at the code level, not just the infrastructure level. This includes:

Signal	What it captures	Example
Traces	Full lifecycle of a single request across all services	User → API → Auth → DB → Cache → Response
Spans	Individual units of work within a trace	A single SQL query taking 45 ms
Transactions	Top-level operations (HTTP request, message consumer job)	`GET /api/orders/{id}`
Errors	Captured exceptions with full stack trace	`NullPointerException` in `OrderService`
Metrics	JVM heap, CPU, GC pauses, HTTP request rates at agent level	`system.cpu.total.norm.pct = 0.82`

Key insight: Infrastructure monitoring (Prometheus/Grafana) tells you that a server is at 95% CPU. APM tells you which specific line of code caused it.

The ELK Stack + APM: A Unified Observability Platform

Before diving into architecture, it's important to understand how Elastic APM extends the classic ELK stack:

The APM Server acts as a protocol translator: it receives compact binary payloads from language agents and converts them into structured Elasticsearch documents, keeping agents lightweight and Elasticsearch as the single source of truth.

Full System Architecture

The following diagram shows the complete end-to-end production architecture of Elastic APM deployed in a Kubernetes environment:

Core Components Deep Dive

1. APM Agents

APM Agents are language-native libraries that auto-instrument your application code. They use techniques like bytecode manipulation (Java), monkey-patching (Python/Node.js), and middleware hooks (Go) to transparently capture traces without requiring you to change your business logic.

Key design decisions in agents:

Decision	Rationale
Async batch flushing	Agents never block the application's request thread
Head-based sampling	Sampling decision made at trace start to ensure consistent trace capture
W3C TraceContext	Standard HTTP header (`traceparent`) enables cross-service correlation
Circuit breaker	Agent self-disables if APM Server is unreachable (zero production impact)

2. APM Server

The APM Server is a stateless Go binary that sits between agents and Elasticsearch. Its responsibilities:

Tail-based sampling is a critical APM Server feature: instead of deciding at the start of a trace whether to record it (head-based), the server can wait until the full trace is complete, then keep it if it contains an error or exceeds a latency threshold. This maximizes the value of every stored trace.

3. Elasticsearch Data Streams

Elastic APM uses Elasticsearch Data Streams — a time-series optimized index management strategy. Each signal type gets its own stream:

Data Stream Pattern	Content
`.ds-traces-apm-{service}-*`	Distributed trace spans & transactions
`.ds-metrics-apm-{service}-*`	Agent-collected CPU, heap, GC metrics
`.ds-logs-error-apm-{service}-*`	Captured exceptions with stack traces
`.ds-metrics-apm-internal-*`	APM Server's own health metrics

ILM (Index Lifecycle Management) automatically rolls over and deletes old indices, ensuring storage costs stay predictable:

Hot (SSD, 7 days) → Warm (HDD, 30 days) → Cold (Snapshot, 90 days) → Delete

4. Kibana APM UI

Kibana is the operational frontend. Its APM UI surfaces the data in four key views:

Distributed Tracing: End-to-End Example

This is the heart of APM. Consider a user placing an order. Let's trace the entire journey:

Scenario: `POST /orders` with a slow payment provider

The waterfall immediately reveals the bottleneck: the Stripe API call takes 540ms, which is 64% of the total user-perceived latency. Without distributed tracing, an engineer would see "the orders endpoint is slow" but have no idea which downstream dependency is responsible.

Trace Propagation via W3C `traceparent` Header

When Order Service calls Payment Service, it injects a traceparent header:

http

POST /payments HTTP/1.1
Host: payment-service.internal
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             ^^  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^^^  ^^
             ver  trace-id (same across ALL hops)  parent-span-id    flags
Content-Type: application/json

{ "orderId": "ord_9a2b", "amount": 99.99 }

Every service reads this header and creates child spans under the same trace-id, which is how Kibana stitches together the full waterfall from data written independently by four different services.

Real User Monitoring (RUM)

Elastic APM includes a JavaScript RUM agent for browser-side performance monitoring. It captures:

Page load performance (using Navigation Timing API)
Long tasks (JS that blocks the main thread > 50ms)
User interactions (clicks, route changes in SPAs)
AJAX/Fetch calls (correlating browser XHR back to backend traces)

Service Map: Auto-Generated Dependency Graph

One of Elastic APM's most powerful features is the Service Map — automatically constructed from trace data, no manual configuration required.

The service map is a living, real-time topology of your system. Latency degradation in Payment Service is immediately visually surfaced — no runbook required.

Alerting Architecture

Elastic APM integrates with Kibana's alerting engine to trigger alerts based on APM-specific rules:

Alert Rule Example: Transaction Error Rate

yaml

# Kibana Alerting Rule (conceptual YAML representation)
rule:
  name: "Payment Service Error Rate > 5%"
  type: "apm.transaction_error_rate"
  schedule: "every 1 minute"
  params:
    windowSize: 5
    windowUnit: "minutes"
    threshold: 5 # 5% error rate
    serviceName: "payment-service"
    transactionType: "request"
  actions:
    - connector: "pagerduty"
      params:
        severity: "critical"
        summary: "Payment service error rate exceeded 5% threshold"
    - connector: "slack"
      params:
        channel: "#incidents"
        message: "🚨 payment-service error rate: {{context.errorRate}}%"

Production Deployment: Kubernetes Helm Architecture

In production, all Elastic Stack components are deployed via the Elastic Cloud on Kubernetes (ECK) operator:

ECK Elasticsearch Custom Resource

yaml

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: production-apm
  namespace: elastic-system
spec:
  version: 8.13.0
  nodeSets:
    - name: masters
      count: 3
      config:
        node.roles: ["master"]
      podTemplate:
        spec:
          containers:
            - name: elasticsearch
              resources:
                requests: { memory: "4Gi", cpu: "1" }
                limits: { memory: "4Gi", cpu: "2" }
    - name: data-hot
      count: 3
      config:
        node.roles: ["data_hot", "data_content", "ingest"]
      podTemplate:
        spec:
          containers:
            - name: elasticsearch
              resources:
                requests: { memory: "16Gi", cpu: "4" }
                limits: { memory: "16Gi", cpu: "8" }
      volumeClaimTemplates:
        - metadata: { name: elasticsearch-data }
          spec:
            storageClassName: "premium-ssd"
            resources:
              requests: { storage: "500Gi" }

Agent Integration: Code Examples

Java (Spring Boot)

The Elastic Java agent requires zero code changes — attach it as a JVM argument:

bash

# In your Kubernetes Deployment's container args:
java \
  -javaagent:/opt/elastic-apm/elastic-apm-agent-1.49.0.jar \
  -Delastic.apm.service_name=order-service \
  -Delastic.apm.server_url=http://apm-server:8200 \
  -Delastic.apm.secret_token=${APM_SECRET_TOKEN} \
  -Delastic.apm.environment=production \
  -Delastic.apm.transaction_sample_rate=0.1 \
  -jar app.jar

To create custom spans inside a method:

java

import co.elastic.apm.api.ElasticApm;
import co.elastic.apm.api.Span;

@Service
public class OrderService {

    public Order processOrder(OrderRequest req) {
        // Custom span wraps a critical business operation
        Span span = ElasticApm.currentTransaction()
            .startSpan("db", "postgresql", "query");
        span.setName("fetch-user-credit-limit");

        try (span) {
            return userRepository.findCreditLimit(req.getUserId());
        } catch (Exception e) {
            span.captureException(e);
            throw e;
        }
    }
}

Node.js (Express)

javascript

// MUST be the very first import — before express, http, pg, etc.
require("elastic-apm-node").start({
  serviceName: "payment-service",
  serverUrl: "http://apm-server:8200",
  secretToken: process.env.APM_SECRET_TOKEN,
  environment: process.env.NODE_ENV,
  transactionSampleRate: 0.2,
});

const express = require("express");
const app = express();

app.post("/payments", async (req, res) => {
  // The APM agent auto-instruments this HTTP handler as a Transaction.
  // All DB queries and outgoing HTTP calls inside become child Spans.
  const result = await chargeCard(req.body.token, req.body.amount);
  res.json(result);
});

Python (FastAPI)

python

import elasticapm
from elasticapm.contrib.starlette import make_apm_client, ElasticAPM
from fastapi import FastAPI

apm = make_apm_client({
    'SERVICE_NAME': 'inventory-service',
    'SERVER_URL':   'http://apm-server:8200',
    'SECRET_TOKEN': os.environ['APM_SECRET_TOKEN'],
    'ENVIRONMENT':  'production',
})

app = FastAPI()
app.add_middleware(ElasticAPM, client=apm)  # Auto-instruments all routes

@app.get("/inventory/{product_id}")
async def get_inventory(product_id: str):
    # Manually capture a custom span for Redis lookup
    with elasticapm.capture_span('redis.get', span_type='cache'):
        stock = await redis.get(f"stock:{product_id}")
    return {"product_id": product_id, "stock": stock}

Sampling Strategy

Capturing 100% of traces at high traffic volumes is prohibitively expensive. Elastic APM provides two complementary strategies:

Strategy	Pros	Cons
Head-based (agent)	Zero server overhead, low network traffic	May miss rare slow traces if sampled out early
Tail-based (server)	Always captures errors and slow traces	APM Server must buffer full traces in memory first
Combined (recommended)	Low volume + high-value traces guaranteed	Slightly more complex configuration

Key Metrics & Golden Signals (APM Perspective)

Elastic APM surfaces all four Golden Signals (Google SRE Book) natively:

Elastic APM vs Competitors

Feature	Elastic APM	Datadog APM	New Relic APM	Jaeger (OSS)
Open source	✅ Yes	❌ No	❌ No	✅ Yes
Self-hosted option	✅ Yes	❌ No	❌ No	✅ Yes
ELK integration	✅ Native	⚠️ Connector only	⚠️ Connector only	❌ No
Service Map	✅ Auto-generated	✅ Auto-generated	✅ Auto-generated	⚠️ Limited
Machine learning anomaly	✅ Built-in ML	✅ Built-in	✅ Built-in	❌ No
RUM (Browser tracing)	✅ Yes	✅ Yes	✅ Yes	❌ No
Tail-based sampling	✅ Yes	✅ Yes	✅ Yes	✅ Yes
Pricing	Free (self-hosted)	Per-host/GB	Per-host	Free
Log correlation	✅ Native (ECS)	✅ Yes	✅ Yes	❌ No

Log Correlation: The Killer Feature of ELK + APM

Because Elastic APM and Elasticsearch share the same data platform, a single click in a trace waterfall can jump directly to the correlated application logs for that exact transaction:

Log correlation requires your application logs to include trace.id and span.id. The Elastic Common Schema (ECS) logging libraries (e.g., ecs-logging-java, ecs-logging-nodejs) inject these automatically.

Summary: Elastic APM Architecture at a Glance

Decision Checklist: When to Use Elastic APM

✅ You already run or plan to run the ELK stack
✅ You need self-hosted, open-source APM (data sovereignty, cost control)
✅ You want automatic log-trace-metric correlation in one platform
✅ You operate a polyglot microservices architecture (Java + Node + Python + Go)
✅ You need browser-side Real User Monitoring correlated with backend traces
⚠️ For managed cloud APM with minimal ops overhead → consider Elastic Cloud (paid)
⚠️ For pure tracing without the ELK stack → consider Jaeger + Prometheus

Next: Latency Percentiles (P50, P95, P99) | Error Rate Alerts | Real-time Dashboards

Application Performance Monitoring (APM): Elastic APM ​

What is APM? ​

The ELK Stack + APM: A Unified Observability Platform ​

Full System Architecture ​

Core Components Deep Dive ​

1. APM Agents ​

2. APM Server ​

3. Elasticsearch Data Streams ​

4. Kibana APM UI ​

Distributed Tracing: End-to-End Example ​

Scenario: POST /orders with a slow payment provider ​

Trace Propagation via W3C traceparent Header ​

Real User Monitoring (RUM) ​

Service Map: Auto-Generated Dependency Graph ​

Alerting Architecture ​

Alert Rule Example: Transaction Error Rate ​

Production Deployment: Kubernetes Helm Architecture ​

ECK Elasticsearch Custom Resource ​

Agent Integration: Code Examples ​

Java (Spring Boot) ​

Node.js (Express) ​

Python (FastAPI) ​

Sampling Strategy ​

Key Metrics & Golden Signals (APM Perspective) ​

Elastic APM vs Competitors ​

Log Correlation: The Killer Feature of ELK + APM ​

Summary: Elastic APM Architecture at a Glance ​

Decision Checklist: When to Use Elastic APM ​