Performance & Load Testing

Analyzing & Reporting Results

18 min Lesson 9 of 28

Analyzing & Reporting Results

Collecting load-test data is the easy part. The hard part is extracting the three answers that every stakeholder actually needs: where does the system break?, how close to that limit are we today?, and how much runway is left before we need to add capacity or fix the bottleneck? Those three questions map directly to the concepts of saturation points, percentile curves, and capacity headroom. Understanding how to read and communicate these correctly is the difference between a performance report that drives decisions and one that gets filed and forgotten.

Reading Percentile Curves

A percentile latency curve (often called a "latency CDF" or "latency-vs-load" curve) plots the pXX latency metric against increasing offered load (RPS or concurrency). The shape of the curve tells you far more than a single number ever can.

The canonical shape at low load is flat: p50, p95, and p99 are all close together and stable. As you approach the service's natural throughput ceiling, the curve begins to diverge — p99 climbs first and fastest because high percentiles are the first to absorb queueing delay. This divergence is the earliest warning of saturation and it appears well before error rates increase. By the time you see 5xx errors climbing, you are already well past the safe operating zone.

Percentile latency curves showing saturation point and capacity headroom Latency (ms) Offered Load (RPS) 0 50 100 200 400 800 0 200 400 600 800 1000 Saturation ~630 RPS Current ~400 RPS p50 p95 p99 Headroom Zone Saturation Zone
Percentile latency curves: p50 stays flat, p99 diverges first at the saturation point. The green zone represents safe operating headroom.

Key observations from the shape of these curves:

  • Flat region (headroom zone): All percentiles are stable and close together. The system has spare capacity; adding load does not meaningfully increase latency. This is where you want to operate in steady state.
  • Knee / inflection point: The point where p99 begins to diverge sharply from p50 and p95. This is the onset of queueing — the system is approaching its processing limit and requests are waiting. This is the saturation point.
  • Divergence magnitude: If p99 is 3–5x p50, that is expected variance from garbage collection pauses and OS scheduling. If it is 20x or more at moderate load, you have a structural problem: lock contention, database connection pool exhaustion, or a downstream service that is single-threaded.
Tail latency tells you about your worst-affected customers, not average customers. A p50 of 30 ms means half your users are fast. A p99 of 2 seconds means 1 in 100 requests is very slow — and for a service doing 1000 RPS, that is 10 slow requests every second. The p99 is the number your SLO is written against, and it is the number you must optimize.

Identifying Saturation Points

The saturation point is the offered load at which the system transitions from linear (latency roughly constant) to super-linear (latency grows faster than load). Correctly identifying it requires looking at multiple signals simultaneously, not just latency.

In Grafana or a k6 dashboard, open four panels side by side during a ramp-up test: p99 latency, active CPU percentage, connection wait times (for databases: pool wait; for web servers: accept queue depth), and error rate. The saturation point is the load value where two or more of these metrics begin their upward inflection. Single-metric anomalies are often noise; correlated inflection is structural saturation.

# k6 ramp-up script: used to locate the saturation point automatically # Run this, then inspect the CSV output to find where p99 crosses 2x its baseline value import http from 'k6/http'; import { check, sleep } from 'k6'; export const options = { stages: [ { duration: '2m', target: 100 }, // warm up { duration: '3m', target: 300 }, { duration: '3m', target: 500 }, { duration: '3m', target: 700 }, // expect saturation here for a typical 4-vCPU node { duration: '3m', target: 900 }, { duration: '2m', target: 0 }, // cool down — watch recovery time too ], thresholds: {}, // no gates; this is an exploratory run }; export default function () { const res = http.get('http://api-svc:8080/v1/healthz'); check(res, { 'status 200': (r) => r.status === 200 }); sleep(0.1); }

After the run, extract the saturation inflection point programmatically rather than eyeballing it:

#!/usr/bin/env python3 # find_saturation.py — reads k6 CSV output, finds the first 30s window where # p99 exceeds 2x the baseline p99 (measured during the first stage). # Usage: k6 run --out csv=results.csv ramp.js && python3 find_saturation.py results.csv import sys, csv, statistics WINDOW_SECONDS = 30 SATURATION_MULTIPLIER = 2.0 buckets = {} # {timestamp_floor: [durations]} with open(sys.argv[1]) as f: reader = csv.DictReader(f) for row in reader: if row.get('metric_name') != 'http_req_duration': continue t = int(float(row['timestamp'])) // WINDOW_SECONDS * WINDOW_SECONDS buckets.setdefault(t, []).append(float(row['metric_value'])) windows = sorted(buckets) if len(windows) < 2: print("Not enough data"); sys.exit(1) baseline_p99 = statistics.quantiles(buckets[windows[0]], n=100)[98] print(f"Baseline p99: {baseline_p99:.1f} ms") for t in windows[1:]: p99 = statistics.quantiles(buckets[t], n=100)[98] rps = len(buckets[t]) / WINDOW_SECONDS if p99 >= baseline_p99 * SATURATION_MULTIPLIER: print(f"Saturation detected at t={t}s, ~{rps:.0f} RPS, p99={p99:.1f} ms") break else: print("No saturation detected — system handled full load range")
At Google and Netflix, saturation points are tracked per release as a single number: "this service saturates at N RPS per instance." The ratio of current peak traffic to that number is the per-instance utilization. If utilization exceeds 60–70%, capacity planning is triggered immediately — not after an incident. Build this number into your runbooks.

Measuring and Communicating Capacity Headroom

Capacity headroom is the gap between your current peak observed load and your saturation point, expressed as a percentage of the saturation point. A headroom of 40% means you can absorb 40% more traffic before performance degrades. A headroom of 5% means one unexpected traffic spike — a viral tweet, a downstream retry storm, a cache flush — lands you in an incident.

The formula is simple: headroom = (saturation_rps - current_peak_rps) / saturation_rps × 100. The art is in agreeing what "current peak" means: use the p99 of your hourly peak RPS over the last 30 days, not the average, and not a synthetic estimate.

When reporting headroom to engineering leadership or SRE reviews, anchor the number to a time-to-exhaustion estimate. If traffic is growing at 15% month-over-month and you have 35% headroom today, you have roughly 2–3 months before you need to act. This framing converts an abstract percentage into a concrete deadline that drives prioritization.

# PromQL: compute capacity headroom against measured saturation RPS # Assumes you have a recording rule: job:http_requests:rate1m = sum(rate(http_requests_total[1m])) # And a static config value in a gauge: service_saturation_rps{job="checkout-api"} = 630 ( service_saturation_rps{job="checkout-api"} - max_over_time(job:http_requests:rate1m{job="checkout-api"}[30d]) ) / service_saturation_rps{job="checkout-api"} * 100
Headroom calculated from a load test run in a staging environment is not the same as production headroom. Production has real user sessions, cache warm-up state, background jobs, and connection multiplexing that staging rarely replicates. Always calibrate: run the same ramp-up test against production traffic replay (using tools like Gatling's recorder or k6's k6-traffic-capture) and compare saturation points between environments. A 20% discrepancy is common; a 60% discrepancy means your staging environment is not representative and all budget decisions based on it are suspect.

Writing Performance Reports That Drive Action

A performance report that does not lead to a decision is wasted work. Structure every report around three sections: Findings (what the data shows, with concrete numbers and annotated graphs), Risk Assessment (headroom percentage, time-to-exhaustion at current growth rate, which percentile is closest to SLO breach), and Recommended Actions (ranked by impact-to-effort ratio, with an owner and a deadline). Avoid adjectives — "latency is high" is meaningless; "p99 at 600 RPS is 340 ms, 36% above the 250 ms budget" is actionable.

For recurring reports (weekly CI gates, quarterly capacity reviews), track the saturation RPS and p99 at peak load as a time-series chart. A steady saturation RPS that decreases release-over-release means you are accumulating performance debt. A p99 that is slowly climbing toward the SLO threshold is a pre-incident signal that should trigger a code review before it becomes a production fire.

The most impactful performance improvements at large companies consistently come from fixing the top-1 bottleneck, not from general optimization passes. Use the percentile curve and CPU/memory correlation data to identify which specific operation occupies the knee of the curve, then fix that one thing. After fixing it, re-run the ramp-up and find the new knee. This is the universal loop of performance engineering.