Capacity Planning & Autoscaling

Capacity Reviews & Forecasting Practice

18 min Lesson 9 of 27

Capacity Reviews & Forecasting Practice

Autoscaling handles the minute-to-minute elasticity of a live system, but it cannot tell you whether your infrastructure will survive next quarter's growth, a product launch that triples your user base, or an expansion into a new region. That responsibility belongs to the capacity review — a structured engineering process that connects business intent to infrastructure commitments. This lesson covers how senior engineers at top-tier companies run launch reviews, build growth models, and reason about multi-region capacity in a way that holds up under cross-functional scrutiny.

Launch Reviews: Gatekeeping Production Capacity

A launch review (sometimes called a production readiness review, or PRR) is a pre-launch checkpoint where the team owning a new feature or service demonstrates that it will not cause an outage when real traffic hits it. At companies like Google, Meta, and Amazon, completing a launch review is a hard prerequisite for a significant traffic ramp. The review is not bureaucracy — it surfaces capacity blind spots before they become incidents.

A well-structured launch review covers four areas:

  1. Traffic shape and peak estimates. What is the projected p50 and p99 request rate at launch? Is traffic bursty (a flash sale, a "top of hour" cron fan-out) or smooth? How does traffic degrade gracefully — is there a CDN layer, a queue, or does load land directly on the origin?
  2. Resource sizing verification. Run load tests at 150% of peak forecast, confirm CPU and memory headroom on both the service and its dependencies (databases, caches, message brokers). Verify that HPA will fire and new pods will land before latency breaches the SLO.
  3. Dependency capacity contracts. Every upstream and downstream service must confirm they have headroom to absorb the launch. A single downstream that has no runway will cascade into the new service regardless of how well-sized it is.
  4. Rollback and load-shed plan. Document the exact commands that revert the rollout, kill the feature flag, or activate load shedding if the launch goes sideways. This should be rehearsed, not written for the first time during an incident.
The golden signal checklist for launch review: For each service involved, verify that dashboards exist for all four golden signals (latency, traffic rate, error rate, saturation) at launch day granularity (1-minute resolution, not 5-minute). Alerts must be pre-created and routed to on-call before the ramp begins.

Load-test automation is the foundation. The k6 script below models a realistic launch ramp — not a flat wall of load, but a staged increase that mirrors how a phased rollout or a marketing campaign drives user acquisition:

// launch-ramp.js -- k6 load test for launch review // Run: k6 run --out influxdb=http://influx:8086/k6 launch-ramp.js import http from 'k6/http'; import { check, sleep } from 'k6'; export const options = { stages: [ { duration: '5m', target: 200 }, // warm-up { duration: '10m', target: 500 }, // normal launch traffic { duration: '5m', target: 1000 }, // peak burst (150% of forecast) { duration: '5m', target: 1000 }, // sustain peak { duration: '5m', target: 0 }, // ramp down ], thresholds: { http_req_duration: ['p(95)<250', 'p(99)<500'], // SLO gates http_req_failed: ['rate<0.01'], }, }; export default function () { const res = http.get('https://staging.example.com/api/v1/feed'); check(res, { 'status 200': (r) => r.status === 200 }); sleep(1); }

Integrate this script into your CI pipeline so every feature branch can prove it meets SLOs before the review meeting even happens. The launch review then becomes a presentation of evidence, not a discovery exercise.

Growth Modeling: Translating Business Plans into Resource Numbers

Growth modeling converts a product roadmap and a business forecast into a set of resource projections that engineering can act on. The output is not a single number — it is a range with confidence intervals, updated on a regular cadence (typically monthly).

The simplest effective model uses three inputs:

  • Current baseline. Measured resource consumption per unit of business activity (requests per active user per day, database row writes per order, GB egress per video view). Extract this from your observability stack — Prometheus metrics correlated with business analytics events.
  • Growth rate. User growth, transaction volume growth, or data volume growth — whichever drives your dominant cost driver. Use the product team's committed forecast for planning, and a P90 upside scenario for headroom.
  • Efficiency improvement. Every quarter, caching improvements, query optimizations, and protocol upgrades reduce the resource cost per unit. Model a conservative 10–15% per-year efficiency improvement so you are not over-provisioning against a cost-per-unit that will shrink.

The following script pulls the last 90 days of Prometheus data and fits a linear trend to help anchor the model:

#!/usr/bin/env python3 # capacity_forecast.py — fit a trend line to Prometheus CPU usage and project forward # pip install requests numpy pandas matplotlib import requests, numpy as np, pandas as pd import matplotlib.pyplot as plt from datetime import datetime, timedelta PROM = "http://prometheus.monitoring.svc:9090" QUERY = 'sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m]))' end = datetime.utcnow() start = end - timedelta(days=90) resp = requests.get(f"{PROM}/api/v1/query_range", params={ "query": QUERY, "start": start.isoformat() + "Z", "end": end.isoformat() + "Z", "step": "1h", }).json() ts = [float(v[0]) for v in resp["data"]["result"][0]["values"]] cpu = [float(v[1]) for v in resp["data"]["result"][0]["values"]] df = pd.DataFrame({"ts": ts, "cpu": cpu}) df["day"] = (df["ts"] - df["ts"].min()) / 86400 # Linear fit coeffs = np.polyfit(df["day"], df["cpu"], 1) slope, intercept = coeffs print(f"Daily CPU growth rate: {slope:.3f} cores/day") # Project 90 days out (P50 and P90 with 20% upside buffer) for days_out in [30, 60, 90]: p50 = intercept + slope * (df["day"].max() + days_out) p90 = p50 * 1.20 print(f" +{days_out}d P50={p50:.1f} cores P90={p90:.1f} cores")
Model the tail, not the mean. Production capacity must absorb your P90 or P95 traffic scenario — not the median. If your growth model is built on average daily active users, add a multiplier for peak-to-average ratio (typically 3–5x for consumer products with morning/evening peaks) before you translate it into vCPU and memory requirements.

Regional Capacity Planning

Expanding into a new region — or maintaining N+1 regional redundancy — requires a separate capacity exercise because regional traffic is never a simple fraction of global traffic. Regional capacity planning accounts for three factors that global models miss:

  • Latency-sensitive affinity. Users do not distribute uniformly across regions. A new APAC region may capture 25% of global signups but generate 40% of API calls because the lower latency drives higher engagement. Measure existing latency buckets by geography to build region-specific request-rate multipliers.
  • Data residency requirements. GDPR, data sovereignty laws, and enterprise customer contracts often mandate that specific data stay within a region. This forces local database primaries and local object storage, which have a higher fixed cost floor than a pure read-replica deployment.
  • Regional failure isolation budget. If you are targeting N+1 redundancy, each region must be sized to absorb 100% of traffic from the failed region during a failover. Many teams under-provision the standby region with "we will scale it up if we need it" — a plan that fails in practice when failover coincides with a traffic spike.
N+1 Regional Capacity Model Region A (Primary) Region B (Standby) Global Load Balancer API Tier (100%) Worker Tier (100%) DB Primary Cache Cluster Normal load: 70% headroom for failover API Tier (sized 100%) Worker Tier (sized 100%) DB Replica (promote on failover) Cache (warm) Normal load: ~20% absorbs 100% on failover async replication active standby
N+1 regional model: each region must be sized to carry 100% of traffic; the active region runs at ~70% utilization to leave headroom for failover.

The critical sizing rule for N+1 redundancy: run each region at no more than 60–70% utilization during normal operation. This preserves enough headroom to absorb a full failover plus the additional autoscaling lag while a runaway traffic spike and a regional failure coincide — the worst-case scenario your capacity plan must survive.

Running the Quarterly Capacity Review Meeting

A capacity review meeting is most effective when it follows a consistent agenda, preventing it from becoming a free-form discussion. A proven structure:

  1. Current state (10 min). Show a 90-day utilization trend for each tier: CPU, memory, disk I/O, network egress, database connections. Call out any metric that crossed 70% of capacity in the last quarter.
  2. Forecast vs. actuals (10 min). Compare the projections from the previous quarterly review against reality. A model that consistently over-predicts wastes money; one that under-predicts causes incidents. Tune the model's growth multipliers based on variance.
  3. Next-quarter projections (15 min). Walk through the growth model for the next 90 days, including upcoming launches, marketing campaigns, and seasonality. Identify the resource that will hit 80% utilization first — this is the critical path for the quarter.
  4. Action items (5 min). Every at-risk resource needs an owner and a target resolution date: vertical scaling, horizontal scaling approval, code optimization, or a quota increase request with the cloud provider.
Avoid the "headroom theater" anti-pattern: Teams sometimes present charts showing 40% average utilization and declare the system healthy — ignoring that peak utilization routinely hits 85% and that autoscaling lag during a burst can push the system into saturation for several minutes. Always present peak and P99 utilization alongside the average in capacity reviews.

A runbook that captures the quarterly review cadence as code makes the process reproducible. The following shell snippet exports the key Prometheus metrics into a CSV that serves as the starting point for the review deck:

#!/bin/bash # export_capacity_snapshot.sh # Exports 90-day P95 utilization per service namespace into capacity_snapshot.csv # Requires: curl, jq, promtool (or direct Prometheus access) PROM="http://prometheus.monitoring.svc:9090" NAMESPACES=("production" "staging" "infra") OUT="capacity_snapshot_$(date +%Y-%m-%d).csv" echo "namespace,metric,p95_90d,unit" > "$OUT" for NS in "${NAMESPACES[@]}"; do # CPU p95 over last 90 days (vCPU cores) CPU=$(curl -sG "$PROM/api/v1/query" \ --data-urlencode "query=quantile_over_time(0.95, sum(rate(container_cpu_usage_seconds_total{namespace=\"$NS\"}[5m]))[90d:1h])" \ | jq -r '.data.result[0].value[1] // "0"') # Memory p95 over last 90 days (GiB) MEM=$(curl -sG "$PROM/api/v1/query" \ --data-urlencode "query=quantile_over_time(0.95, sum(container_memory_working_set_bytes{namespace=\"$NS\"})[90d:1h]) / 1073741824" \ | jq -r '.data.result[0].value[1] // "0"') echo "$NS,cpu,$CPU,cores" >> "$OUT" echo "$NS,mem,$MEM,GiB" >> "$OUT" done echo "Snapshot written to $OUT"
Use a shared capacity dashboard as a living document. The most effective teams maintain a Grafana dashboard titled "Capacity Review" that is the single source of truth for all quarterly reviews. It is always current, uses the same query definitions as the review script, and eliminates the manual data-gathering step that causes teams to skip reviews or run them with stale data.

Capacity planning closes the loop between the reactive elasticity of autoscaling and the proactive resource governance that keeps platforms stable as businesses grow. The engineers who master it are the ones who prevent the midnight incidents — not the ones who respond to them.