Performance & Load Testing

Project: Load Test a Service

18 min Lesson 10 of 28

Project: Load Test a Service

The previous nine lessons gave you vocabulary, tools, and techniques. This lesson closes the loop: you will plan, script, execute, and report a complete load test against a realistic HTTP API. Every step mirrors how Google SREs and Netflix performance engineers actually run production readiness reviews — from writing a test plan that stakeholders can sign off on, to opening a Jira ticket with a regression root cause.

Step 1 — Write a Test Plan Before Touching the Terminal

A load test without a plan is an experiment without a hypothesis. Your test plan answers five questions before a single packet is sent:

  1. Service Under Test (SUT): Which endpoints, which environment (staging or prod-mirror), and which data-set?
  2. Success Criteria (SLOs): What does "pass" look like? Express as P99 latency, error rate, and throughput targets — not vague adjectives.
  3. Load Profile: Steady-state RPS, ramp duration, soak duration, spike shape. Reference your production traffic percentiles from Grafana/Datadog so the numbers are grounded.
  4. Scope of Observability: Which dashboards, logs, and profiler samplers will be active during the run?
  5. Rollback and Blast-Radius Limits: If the SUT degrades below 20 % of capacity, who calls the halt? What circuit breakers or rate-limit headers prevent accidental DoS on shared dependencies?
A one-page test plan shared with the backend, DBA, and on-call SRE before the test run prevents the all-too-common post-mortem question: "Why did you hammer the production DB replica at 2 pm on a Tuesday?"

Step 2 — Prepare the Environment

Use a staging environment that mirrors production at realistic scale: same instance types, same DB size (or a sanitized prod snapshot), same CDN rules disabled, same feature flags. A test against an under-provisioned staging cluster tells you nothing useful about production capacity.

Stand up your k6 executor on a machine (or k6 Cloud / distributed Grafana k6 OSS operator) that is not on the same host as the SUT. Network RTT between load generator and SUT should be realistic — same AWS region, same VPC, comparable to real client geography. Export baseline metrics from Prometheus/Datadog to a snapshot so you can diff before/after.

# Verify staging mirrors prod instance types kubectl get nodes -o wide --context=staging kubectl get nodes -o wide --context=prod # Confirm the app is at the same image tag kubectl get deployment api-server -n production --context=staging \ -o jsonpath='{.spec.template.spec.containers[0].image}' # Snapshot current Prometheus metric set as baseline curl -s http://prometheus.internal:9090/api/v1/query \ --data-urlencode 'query=histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, handler))' \ | jq '.data.result' > baseline_p99.json

Step 3 — Author the k6 Script

Your script should model real user journeys, not arbitrary endpoint hammering. For an e-commerce checkout API the canonical journeys are: browse catalog, view product detail, add to cart, checkout. Weight them by your analytics data — at most companies browse is 10x checkout volume.

Use k6 scenarios to encode multiple executor shapes in a single file, so a CI gate, a soak test, and a spike test share the same script without duplication.

// load-test.js — production-grade k6 script import http from 'k6/http'; import { check, sleep, group } from 'k6'; import { Rate, Trend } from 'k6/metrics'; const errorRate = new Rate('errors'); const checkoutDur = new Trend('checkout_duration', true); // true = milliseconds // --- SLO thresholds (fail the test if breached) --- export const options = { scenarios: { // 1. Steady-state: ramp to prod peak, hold 20 min steady_state: { executor: 'ramping-vus', startVUs: 0, stages: [ { duration: '3m', target: 200 }, // ramp { duration: '20m', target: 200 }, // hold at prod peak { duration: '2m', target: 0 }, // drain ], gracefulRampDown: '30s', }, // 2. Spike: 5x traffic for 60 s then back spike: { executor: 'ramping-vus', startTime: '25m', // starts after steady-state hold startVUs: 0, stages: [ { duration: '10s', target: 1000 }, { duration: '60s', target: 1000 }, { duration: '10s', target: 0 }, ], }, }, thresholds: { http_req_duration: ['p(99)<500', 'p(95)<250'], http_req_failed: ['rate<0.01'], // <1 % error rate checkout_duration: ['p(99)<800'], errors: ['rate<0.005'], }, }; const BASE = __ENV.BASE_URL || 'https://api-staging.example.com'; // Token pool — pre-generated test user JWTs loaded from a CSV via SharedArray import { SharedArray } from 'k6/data'; const users = new SharedArray('users', function () { return JSON.parse(open('./test-users.json')); }); export default function () { const user = users[__VU % users.length]; const token = user.token; const headers = { 'Authorization': `Bearer ${token}`, 'Content-Type': 'application/json', }; group('browse', function () { const res = http.get(`${BASE}/v1/products?page=1&limit=20`, { headers }); check(res, { 'browse 200': (r) => r.status === 200, 'browse latency ok': (r) => r.timings.duration < 300, }) || errorRate.add(1); }); sleep(Math.random() * 2 + 0.5); // think time: 0.5-2.5 s group('checkout', function () { const start = Date.now(); // Add to cart let res = http.post(`${BASE}/v1/cart/items`, JSON.stringify({ product_id: 'sku-42', qty: 1 }), { headers }); check(res, { 'cart 201': (r) => r.status === 201 }) || errorRate.add(1); // Place order res = http.post(`${BASE}/v1/orders`, JSON.stringify({ payment_method: 'test_card' }), { headers }); check(res, { 'order 201': (r) => r.status === 201 }) || errorRate.add(1); checkoutDur.add(Date.now() - start); }); sleep(1); }
Pre-populate test data (user accounts, product catalog, existing carts) in the DB before the test run. Generating data inside the script skews latency measurements and produces unrealistic DB access patterns. At Google-scale, data setup is a separate offline pipeline that runs hours before the test window.

Step 4 — Run With Full Observability Active

Never run a load test in the dark. Before executing, confirm that your Prometheus scrape intervals are at 15 s or tighter, your distributed traces are sampling at 100 % (or at least 10 % with tail-based sampling on slow traces), and your application profiler (pprof, async-profiler, py-spy) is ready to be triggered on demand.

# Export results to InfluxDB for Grafana dashboarding k6 run \ --out influxdb=http://influxdb.internal:8086/k6 \ --out json=results/run-$(date +%Y%m%d-%H%M).json \ -e BASE_URL=https://api-staging.example.com \ load-test.js # In a second terminal — watch error rate live watch -n5 'curl -s "http://influxdb.internal:8086/query" \ --data-urlencode "db=k6" \ --data-urlencode "q=SELECT last(\"value\") FROM \"http_req_failed\" WHERE time > now()-1m" \ | jq .results[0].series[0].values'

While the test runs, watch four signals simultaneously: SUT CPU and memory (are we CPU-bound or OOM?), DB connection pool utilization (pool exhaustion is the #1 cause of latency cliffs at 500+ RPS), P99 latency trend (is it stable or creeping?), and GC pause frequency for JVM/Go services.

End-to-end load test pipeline: generator, SUT, observability stack k6 / JMeter Load Generator 200 VUs HTTP API Service (SUT) K8s Deployment 3 replicas SQL PostgreSQL Primary + Replica pool: 100 conns Observability Stack Prometheus Jaeger / Tempo InfluxDB + Grafana pprof / py-spy End-to-End Load Test Pipeline Load Generator → SUT → Downstream Dependencies → Observability
The complete load test pipeline: k6 drives load, the SUT talks to its dependencies, and the full observability stack captures metrics, traces, and profiler samples simultaneously.

Step 5 — Analyze Results: From Numbers to Root Cause

Raw output from k6 or InfluxDB is data, not insight. Your job is to move from "P99 spiked to 1.2 s at 180 VUs" to "connection pool exhaustion on the read replica at ~175 concurrent queries." That chain of reasoning requires correlating three layers:

  1. Client-side (k6): When exactly did latency degrade? What RPS and VU count correlated? Did error rate jump simultaneously or lag the latency climb?
  2. Service-side (APM / traces): Which span accounted for the added latency — the application code, the DB query, or network? Use Jaeger or Tempo to find the slowest traces during the degradation window.
  3. Infrastructure (Prometheus): Were any resource limits hit? Classic signals: CPU throttling (container_cpu_cfs_throttled_seconds_total), OOM events, DB connection pool wait time (pgbouncer_client_wait_seconds), GC pause (jvm_gc_pause_seconds).
# PromQL: find DB pool saturation during the test window # (adjust time range to match your test run) sum(pgbouncer_client_wait_seconds) by (database) [30m:15s] # PromQL: CPU throttling per container sum(rate(container_cpu_cfs_throttled_seconds_total[1m])) by (container) # k6 JSON summary — extract the P99 and error rate cat results/run-20250612-1430.json | jq ' .metrics | { p99_ms: .http_req_duration."p(99)", p95_ms: .http_req_duration."p(95)", error_rate: .http_req_failed.rate, rps: .http_reqs.rate }'
The most dangerous load test result is a passing one that hides a silent failure mode. Always check that your check() pass rate is 100 % — a 0.5 % error rate on 500 RPS is 2.5 errors/second, which in production silently fails hundreds of thousands of transactions per day. Thresholds that only gate on P99 latency will miss this.

Step 6 — Write the Performance Report

A performance report is a contract between engineering and the business. It should contain: (1) a one-paragraph executive summary with a clear pass/fail verdict against each SLO; (2) a time-series chart showing P50/P95/P99 latency and RPS over the test duration; (3) the identified bottlenecks, ranked by impact; (4) specific, actionable recommendations (not "optimize DB queries" — rather "add a composite index on (user_id, created_at DESC) on the orders table, estimated query time drop from 180 ms to 12 ms based on EXPLAIN ANALYZE"); and (5) a regression risk section noting which changes are safe to ship and which require re-testing.

At companies like LinkedIn and Stripe, performance reports become living documents tracked in the same project management system as engineering tickets. A regression detected in CI references the original baseline report, making the delta undeniable and the fix accountable to a specific PR.

Automate report generation. The k6 --out json flag plus a short Python or Go script can produce a Markdown report, commit it as a CI artifact, and post a Slack summary with pass/fail status in under 30 seconds. That is how you make performance a first-class gate rather than a periodic ritual.

Putting It All Together: The Production-Ready Checklist

  • Test plan reviewed by backend, DBA, and SRE before execution
  • Data pre-populated; no data generation inside the k6 VU loop
  • Staging environment verified to match production instance types and replica count
  • Observability active: Prometheus, tracing, profiler all running
  • Thresholds encode SLOs, not just latency percentiles — include error rate
  • Spike and soak scenarios run in addition to steady-state
  • Root cause confirmed at infrastructure layer before declaring a bottleneck
  • Report includes actionable tickets, not generic advice
  • CI integration gate runs the steady-state scenario on every PR against the critical path

This end-to-end discipline is what separates a performance engineer who ships confidence from one who ships uncertainty. Every tutorial in this course — from Little's Law through profiling CPU flamegraphs — feeds into this single workflow. The output is not a number; it is a signed-off engineering decision about whether a service is ready to carry production load.