Project: Load Test a Service
Project: Load Test a Service
The previous nine lessons gave you vocabulary, tools, and techniques. This lesson closes the loop: you will plan, script, execute, and report a complete load test against a realistic HTTP API. Every step mirrors how Google SREs and Netflix performance engineers actually run production readiness reviews — from writing a test plan that stakeholders can sign off on, to opening a Jira ticket with a regression root cause.
Step 1 — Write a Test Plan Before Touching the Terminal
A load test without a plan is an experiment without a hypothesis. Your test plan answers five questions before a single packet is sent:
- Service Under Test (SUT): Which endpoints, which environment (staging or prod-mirror), and which data-set?
- Success Criteria (SLOs): What does "pass" look like? Express as P99 latency, error rate, and throughput targets — not vague adjectives.
- Load Profile: Steady-state RPS, ramp duration, soak duration, spike shape. Reference your production traffic percentiles from Grafana/Datadog so the numbers are grounded.
- Scope of Observability: Which dashboards, logs, and profiler samplers will be active during the run?
- Rollback and Blast-Radius Limits: If the SUT degrades below 20 % of capacity, who calls the halt? What circuit breakers or rate-limit headers prevent accidental DoS on shared dependencies?
Step 2 — Prepare the Environment
Use a staging environment that mirrors production at realistic scale: same instance types, same DB size (or a sanitized prod snapshot), same CDN rules disabled, same feature flags. A test against an under-provisioned staging cluster tells you nothing useful about production capacity.
Stand up your k6 executor on a machine (or k6 Cloud / distributed Grafana k6 OSS operator) that is not on the same host as the SUT. Network RTT between load generator and SUT should be realistic — same AWS region, same VPC, comparable to real client geography. Export baseline metrics from Prometheus/Datadog to a snapshot so you can diff before/after.
Step 3 — Author the k6 Script
Your script should model real user journeys, not arbitrary endpoint hammering. For an e-commerce checkout API the canonical journeys are: browse catalog, view product detail, add to cart, checkout. Weight them by your analytics data — at most companies browse is 10x checkout volume.
Use k6 scenarios to encode multiple executor shapes in a single file, so a CI gate, a soak test, and a spike test share the same script without duplication.
Step 4 — Run With Full Observability Active
Never run a load test in the dark. Before executing, confirm that your Prometheus scrape intervals are at 15 s or tighter, your distributed traces are sampling at 100 % (or at least 10 % with tail-based sampling on slow traces), and your application profiler (pprof, async-profiler, py-spy) is ready to be triggered on demand.
While the test runs, watch four signals simultaneously: SUT CPU and memory (are we CPU-bound or OOM?), DB connection pool utilization (pool exhaustion is the #1 cause of latency cliffs at 500+ RPS), P99 latency trend (is it stable or creeping?), and GC pause frequency for JVM/Go services.
Step 5 — Analyze Results: From Numbers to Root Cause
Raw output from k6 or InfluxDB is data, not insight. Your job is to move from "P99 spiked to 1.2 s at 180 VUs" to "connection pool exhaustion on the read replica at ~175 concurrent queries." That chain of reasoning requires correlating three layers:
- Client-side (k6): When exactly did latency degrade? What RPS and VU count correlated? Did error rate jump simultaneously or lag the latency climb?
- Service-side (APM / traces): Which span accounted for the added latency — the application code, the DB query, or network? Use Jaeger or Tempo to find the slowest traces during the degradation window.
- Infrastructure (Prometheus): Were any resource limits hit? Classic signals: CPU throttling (
container_cpu_cfs_throttled_seconds_total), OOM events, DB connection pool wait time (pgbouncer_client_wait_seconds), GC pause (jvm_gc_pause_seconds).
Step 6 — Write the Performance Report
A performance report is a contract between engineering and the business. It should contain: (1) a one-paragraph executive summary with a clear pass/fail verdict against each SLO; (2) a time-series chart showing P50/P95/P99 latency and RPS over the test duration; (3) the identified bottlenecks, ranked by impact; (4) specific, actionable recommendations (not "optimize DB queries" — rather "add a composite index on (user_id, created_at DESC) on the orders table, estimated query time drop from 180 ms to 12 ms based on EXPLAIN ANALYZE"); and (5) a regression risk section noting which changes are safe to ship and which require re-testing.
At companies like LinkedIn and Stripe, performance reports become living documents tracked in the same project management system as engineering tickets. A regression detected in CI references the original baseline report, making the delta undeniable and the fix accountable to a specific PR.
--out json flag plus a short Python or Go script can produce a Markdown report, commit it as a CI artifact, and post a Slack summary with pass/fail status in under 30 seconds. That is how you make performance a first-class gate rather than a periodic ritual.
Putting It All Together: The Production-Ready Checklist
- Test plan reviewed by backend, DBA, and SRE before execution
- Data pre-populated; no data generation inside the k6 VU loop
- Staging environment verified to match production instance types and replica count
- Observability active: Prometheus, tracing, profiler all running
- Thresholds encode SLOs, not just latency percentiles — include error rate
- Spike and soak scenarios run in addition to steady-state
- Root cause confirmed at infrastructure layer before declaring a bottleneck
- Report includes actionable tickets, not generic advice
- CI integration gate runs the steady-state scenario on every PR against the critical path
This end-to-end discipline is what separates a performance engineer who ships confidence from one who ships uncertainty. Every tutorial in this course — from Little's Law through profiling CPU flamegraphs — feeds into this single workflow. The output is not a number; it is a signed-off engineering decision about whether a service is ready to carry production load.