Performance & Load Testing

Performance in CI

18 min Lesson 8 of 28

Performance in CI

Running a load test on a laptop before shipping is better than not testing at all, but it is not a performance engineering practice — it is a ritual. The discipline that separates mature engineering orgs from the rest is treating performance as a first-class citizen of the delivery pipeline: codified budgets that define what "acceptable" means, automated checks that block regressions before they land in main, and a historical baseline that makes trends visible over weeks and quarters, not just per-release. At companies like Google, Meta, and Netflix, every service has an agreed latency SLO, a throughput floor, and a CI gate that enforces both. A PR that degrades p99 by 20% does not merge — not because someone remembered to check, but because the pipeline refuses it.

Performance Budgets: Making "Good Enough" Explicit

A performance budget is a set of quantified thresholds that a service must stay within. Without explicit budgets, "performance regression" is subjective — a 15% p99 increase might be acceptable noise or a sign that a new N+1 query was introduced, and the outcome depends on who happens to look. With explicit budgets, the question is binary.

Budget dimensions to define for every service:

Latency: p50, p95, p99, and p999 under the expected peak load profile. p999 is critical for tail-latency-sensitive services (payment processing, auth tokens). A common starting point: p99 < 200 ms under 500 RPS for an API gateway; tighten after baseline data exists.
Throughput: minimum acceptable RPS or TPS at a given concurrency. This is your service's capacity floor — if it cannot sustain it, capacity planning is broken.
Error rate: maximum acceptable 5xx rate under load. 0.1% is common for non-critical services; 0.01% for payment or identity flows.
Resource ceilings: CPU and memory per instance at peak. Budgeting these prevents a change from silently increasing provisioned capacity by 40%, which is a cost regression even if latency is unchanged.

Encode budgets as version-controlled configuration files alongside the service code. When a budget changes, the change is reviewed and the rationale is preserved in git history.

# k6/budgets.json — performance budget for the checkout service
{
  "service": "checkout-api",
  "thresholds": {
    "p99_latency_ms":  250,
    "p95_latency_ms":  120,
    "p50_latency_ms":   40,
    "error_rate_max": 0.001,
    "min_rps":         400
  },
  "resource_ceilings": {
    "cpu_p99_percent": 70,
    "memory_mb_p99":  512
  }
}

In k6, budgets translate directly to thresholds in the script options block, causing k6 to exit with a non-zero code when any threshold is breached — which CI interprets as a build failure.

// k6/checkout-load-test.js
import http from 'k6/http';
import { check } from 'k6';

export const options = {
  scenarios: {
    steady_load: {
      executor: 'constant-arrival-rate',
      rate: 400,           // 400 RPS — the throughput budget floor
      timeUnit: '1s',
      duration: '3m',
      preAllocatedVUs: 50,
      maxVUs: 100,
    },
  },
  thresholds: {
    // Fail the CI job if any threshold is breached
    'http_req_duration': [
      { threshold: 'p(99)<250', abortOnFail: true, delayAbortEval: '30s' },
      { threshold: 'p(95)<120', abortOnFail: false },
      { threshold: 'p(50)<40',  abortOnFail: false },
    ],
    http_req_failed: [
      { threshold: 'rate<0.001', abortOnFail: true },
    ],
    http_reqs: [
      { threshold: 'rate>=400' },
    ],
  },
};

export default function () {
  const res = http.post(
    'http://checkout-api:8080/v1/checkout',
    JSON.stringify({ cart_id: 'bench-cart-001' }),
    { headers: { 'Content-Type': 'application/json' } }
  );
  check(res, {
    'status 200':            (r) => r.status === 200,
    'checkout_id present':   (r) => r.json('checkout_id') !== undefined,
  });
}

Key idea — budgets are contracts, not aspirations. A budget threshold you never enforce trains your team to ignore it. Enforce every threshold in CI. If the baseline is too tight and the gate fires on every green commit, the real problem is that the baseline was wrong — recalibrate it, version-control the change, and move on. Never silence the gate.

Automated Regression Detection

An absolute threshold (p99 < 250 ms) catches only the case where you cross an absolute line. It will not catch a slow drift — a 5% p99 increase per sprint that never individually crosses the threshold but accumulates into a 50% degradation over six months. Regression detection compares the current run against a rolling baseline, flags statistically significant deviations, and blocks the build when the deviation exceeds a configured tolerance.

The standard pattern in CI:

Run the load test on every PR or on every merge to main (choose based on test duration).
Export k6 results as JSON or push metrics to a time-series store (Prometheus, InfluxDB, or k6 Cloud).
A regression-detection step reads the current run and the last N baseline runs, computes the percentage change for each budget metric, and fails the job if any metric exceeds the allowed drift percentage.
Post a structured summary to the PR comment so engineers see exactly which metric regressed and by how much, without having to read raw metrics.

CI performance gate flow: every load test run is stored, compared against a rolling baseline, and blocks the merge when a regression is detected.

# .github/workflows/performance.yml
# Runs on every push to main; adapt to run on PRs for shorter tests.

name: Performance Gate

on:
  push:
    branches: [main]

jobs:
  load-test:
    runs-on: ubuntu-latest
    services:
      checkout-api:
        image: ghcr.io/myorg/checkout-api:${{ github.sha }}
        ports: ["8080:8080"]
        options: --health-cmd "curl -sf http://localhost:8080/healthz" --health-interval 5s

    steps:
      - uses: actions/checkout@v4

      - name: Run k6 load test
        uses: grafana/k6-action@v0.3.1
        with:
          filename: k6/checkout-load-test.js
          flags: --out json=k6-results.json

      - name: Upload results artifact
        uses: actions/upload-artifact@v4
        with:
          name: k6-results-${{ github.sha }}
          path: k6-results.json
          retention-days: 90

      - name: Download baseline results
        # Fetch the last 10 main-branch results from S3 for regression comparison
        run: |
          aws s3 sync s3://myorg-perf-baselines/checkout-api/latest-10/ ./baselines/
        env:
          AWS_ACCESS_KEY_ID:     ${{ secrets.PERF_BASELINE_AWS_KEY }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.PERF_BASELINE_AWS_SECRET }}

      - name: Detect regressions
        id: regression
        run: |
          python3 scripts/detect_regression.py \
            --current    k6-results.json \
            --baselines  baselines/ \
            --budget     k6/budgets.json \
            --tolerance  0.10 \
            --output     regression-report.md
        # exits non-zero if any metric exceeds budget * (1 + tolerance)

      - name: Post regression report to PR
        if: always()
        uses: marocchino/sticky-pull-request-comment@v2
        with:
          path: regression-report.md

      - name: Store result as new baseline
        if: success()
        run: |
          aws s3 cp k6-results.json \
            s3://myorg-perf-baselines/checkout-api/latest-10/$(date +%Y%m%dT%H%M%S)-${{ github.sha }}.json

Pro practice — the tolerance band. A 0% tolerance on regression detection produces constant false positives from natural run-to-run variance (1-3% latency jitter is normal in shared CI infrastructure). A 10% tolerance on p99 and a 5% tolerance on p95 are practical starting points. Tighten the band for critical services (payment, auth) and widen it for internal tooling. Measure your baseline variance over 20 runs and set the tolerance at 2x the observed coefficient of variation — that way the gate fires on real regressions, not noise.

Regression Detection Script: The Core Logic

The detection script is simple enough to own in your repo — no third-party service required. The key steps: parse k6 JSON output for summary metrics, compute the rolling baseline average from the stored result files, calculate the percentage delta, and compare against budget * (1 + tolerance).

#!/usr/bin/env python3
# scripts/detect_regression.py
# Usage: python3 detect_regression.py --current k6-results.json
#          --baselines ./baselines/ --budget k6/budgets.json
#          --tolerance 0.10 --output regression-report.md

import json, sys, glob, argparse, statistics
from pathlib import Path

def load_k6_summary(path):
    """Extract p50/p95/p99 and error_rate from k6 --out json summary."""
    with open(path) as f:
        data = json.load(f)
    metrics = data.get('metrics', {})
    dur = metrics.get('http_req_duration', {}).get('values', {})
    return {
        'p50':        dur.get('p(50)', 0),
        'p95':        dur.get('p(95)', 0),
        'p99':        dur.get('p(99)', 0),
        'error_rate': metrics.get('http_req_failed', {}).get('values', {}).get('rate', 0),
    }

def rolling_baseline(baselines_dir):
    files = sorted(glob.glob(str(Path(baselines_dir) / '*.json')))[-10:]
    if not files:
        return None
    runs = [load_k6_summary(f) for f in files]
    return {k: statistics.mean(r[k] for r in runs) for k in runs[0]}

parser = argparse.ArgumentParser()
parser.add_argument('--current');   parser.add_argument('--baselines')
parser.add_argument('--budget');    parser.add_argument('--tolerance', type=float)
parser.add_argument('--output')
args = parser.parse_args()

current  = load_k6_summary(args.current)
baseline = rolling_baseline(args.baselines)
budget   = json.loads(Path(args.budget).read_text())['thresholds']
tol      = args.tolerance
regressions = []

checks = [
    ('p99', 'p99_latency_ms', 'P99 latency (ms)'),
    ('p95', 'p95_latency_ms', 'P95 latency (ms)'),
    ('p50', 'p50_latency_ms', 'P50 latency (ms)'),
    ('error_rate', 'error_rate_max', 'Error rate'),
]
lines = ['## Performance Regression Report\n', '| Metric | Current | Baseline | Budget | Status |',
         '|--------|---------|----------|--------|--------|']

failed = False
for key, bkey, label in checks:
    cur_val  = current[key]
    base_val = baseline[key] if baseline else None
    bud_val  = budget.get(bkey)
    # Fail if current exceeds budget*(1+tol) OR exceeds baseline*(1+tol)
    over_budget   = bud_val  and cur_val > bud_val  * (1 + tol)
    over_baseline = base_val and cur_val > base_val * (1 + tol)
    status = 'FAIL' if (over_budget or over_baseline) else 'PASS'
    if status == 'FAIL':
        failed = True
    base_str = f'{base_val:.1f}' if base_val else 'N/A'
    lines.append(f'| {label} | {cur_val:.1f} | {base_str} | {bud_val} | {status} |')

Path(args.output).write_text('\n'.join(lines))
sys.exit(1 if failed else 0)

Production pitfall — the flaky performance test. A load test running against a real staging environment that shares compute with other CI jobs produces highly variable results. A 30% latency swing between runs has nothing to do with your code change — it reflects noisy neighbours. Isolate the test environment: either run the service under test on a dedicated ephemeral VM with pinned CPU/memory (e.g. a GitHub Actions runner type with known specs, or a dedicated k8s namespace with resource requests & limits), or run the test inside a single Docker Compose stack on the runner itself. Isolation is the difference between a useful gate and a random number generator.

Where to Run Performance Tests in the Pipeline

Not every test runs on every event. Match test intensity to the cost it gates:

On every PR (fast smoke load): 60-90 second ramp to peak load, verify thresholds are not violated. Target: < 3 minutes total CI time. Purpose: catch obvious regressions (a new O(n²) loop, a missing index) before code review merges them.
On every merge to main (full regression run): 3-5 minute sustained load at the defined budget RPS, full statistical comparison against baseline. Purpose: update the baseline and detect subtle drift.
Nightly or weekly (soak test): 30-60 minutes at moderate load. Purpose: detect memory leaks, connection pool exhaustion, and GC pressure that only manifest over time.

Gate merges only on the fast PR smoke test and the main-branch regression run. Soak tests are informational — they alert on failure but do not block shipping, because a block on a 45-minute nightly test is operationally impractical. Notify on-call instead and track as a P2 investigation item.

Key idea — baselines live in git or a versioned store, not in someone's head. When a team says "performance has been gradually getting worse for six months," it is almost always because they had no automated baseline. The first investment is the historical record. Even storing k6 JSON summaries as GitHub Actions artifacts gives you the raw data to plot trends. A proper setup pushes summary metrics to Grafana or Datadog with a git.sha tag, making it trivial to correlate a latency spike to a specific commit.