Skip to content

Transaction Success Rate

In system design, while individual Service Error Rates tell you if a specific API or microservice is failing, they do not tell the full story from the customer's perspective. The Transaction Success Rate (TSR) shifts the focus from "Is the API healthy?" to "Is the user able to complete their overall goal?"

Why Basic Error Rates Are Not Enough

A single business transaction (e.g., "User check out") usually relies on multiple sequential API calls across different microservices.

Imagine an e-commerce checkout flow containing three steps:

  1. POST /cart/validate (Inventory Service)
  2. POST /payment/charge (Payment Service)
  3. POST /order/finalize (Order Service)

If the Payment Service goes completely offline, your checkout transaction success rate will drop to 0%, because no user can complete a purchase. However, if you look at your global API Error Rates across all three microservices, it might only show a 33% error rate (since the Cart validation might still be succeeding). This discrepancy hides catastrophic business failures behind seemingly "moderate" technical metrics.


Architectural Visualization

A distributed transaction often spans multiple services. Monitoring tools track a unique transaction_id or trace_id to determine if the entire business flow reached a successful terminal state.

Code Example: Tracking Business Transactions

Below is an architectural example of tracking the complete lifecycle of a business transaction in Node.js. Instead of relying purely on HTTP-level metrics, we create specific business-level counters.

javascript
const express = require("express");
const promClient = require("prom-client");

const app = express();
const registry = new promClient.Registry();

// 1. Define Business-Level Metrics (Not HTTP-Level)
const businessTransactionsTotal = new promClient.Counter({
  name: "business_transactions_total",
  help: "Total number of business transactions processed",
  labelNames: ["transaction_type", "state", "failure_reason"],
});

registry.registerMetric(businessTransactionsTotal);

// 2. Mock Microservice Calls
const validateInventory = async () => true;
const processPayment = async () => {
  // Simulate a payment integration randomly failing
  if (Math.random() > 0.7) throw new Error("Payment Gateway Timeout");
  return { transactionId: "tok_123" };
};
const createOrder = async () => true;

// 3. The Business Transaction Route
app.post("/api/checkout", async (req, res) => {
  // A. Indicate the transaction started
  businessTransactionsTotal.inc({
    transaction_type: "checkout",
    state: "attempted",
  });

  try {
    // B. Execute the multi-step transaction
    await validateInventory(); // Step 1
    await processPayment(); // Step 2
    await createOrder(); // Step 3

    // C. Record a SUCCESSFUL business transaction
    businessTransactionsTotal.inc({
      transaction_type: "checkout",
      state: "successful",
    });

    res.status(200).send({ message: "Checkout completed successfully!" });
  } catch (error) {
    // D. Record a FAILED business transaction and categorize the reason
    businessTransactionsTotal.inc({
      transaction_type: "checkout",
      state: "failed",
      failure_reason: error.message,
    });

    res.status(500).send({ error: "Checkout failed to complete." });
  }
});

// 4. Expose the metrics
app.get("/metrics", async (req, res) => {
  res.set("Content-Type", registry.contentType);
  res.end(await registry.metrics());
});

app.listen(3000, () => {
  console.log("Server listening on port 3000 with Transaction Monitoring");
});

Calculating the TSR in PromQL

To alert when the Transaction Success Rate drops too low over a 5-minute rolling window, you would query Prometheus:

txt
# Calculate: (Successful Checkouts / Attempted Checkouts) * 100

(
  sum(rate(business_transactions_total{transaction_type="checkout", state="successful"}[5m]))
  /
  sum(rate(business_transactions_total{transaction_type="checkout", state="attempted"}[5m]))
) * 100

If this value drops below 99%, it usually points to a critical business outage that needs immediate attention, even if standard API error rates look fine.

Best Practices

  1. Align with Business Criticality: Don't track every single flow as a transaction. Focus on critical paths: Checkouts, User Registrations, Password Resets, Data Exports.
  2. Use Distributed Tracing: Once the metrics alert you that the success rate has dropped, use OpenTelemetry tracing with a trace_id injected into the header to inspect where the transaction is breaking.
  3. Handle Asynchronous Transactions: If your transaction spans hours (e.g., waiting for manual KYC verification), track the intermediate transition states (e.g., state="pending") rather than keeping a single synchronous HTTP request open.

Released under the ISC License.