Capacity Planning & Autoscaling

Capacity Planning Fundamentals

18 min Lesson 1 of 27

Capacity Planning Fundamentals

Capacity planning is the practice of ensuring your infrastructure can serve expected demand — with enough headroom to absorb spikes — without over-provisioning to the point of wasting money. At hyperscale companies, capacity planning is a formal engineering discipline with dedicated teams, quarterly forecasting cycles, and automated procurement pipelines. For the rest of us, getting the fundamentals right prevents two failure modes that kill reliability: running out of capacity at the worst possible moment, and burning the company's cloud budget on idle instances.

This lesson focuses on the three pillars that sit beneath every autoscaling strategy you will configure in subsequent lessons: demand forecasting, headroom policy, and lead times. Get these wrong and no amount of HPA tuning or Karpenter configuration will save you during a traffic event.

Demand Forecasting

Forecasting answers the question: how much capacity will I need at time T in the future? There are three models, used in combination at mature organizations.

Trend-based forecasting — fit a curve (linear or exponential) to historical utilization data. Useful for organic growth but blind to business events.
Event-driven forecasting — overlay known business calendar events: product launches, marketing campaigns, Black Friday, fiscal quarter-end spikes in B2B SaaS. These are non-negotiable: every major incident post-mortem that starts with "we ran out of capacity" contains an ignored event.
Workload-decomposition forecasting — break total demand into its constituent signals (active users × requests/user/s × avg payload). This lets you reason about which services will saturate first and model growth independently per tier.

In practice, export 90 days of CPU/memory/RPS from Prometheus, apply a simple linear regression in Python or in your observability platform's forecast function, then add the known event multipliers on top. The goal is a P95 demand curve, not a mean — size for the tail, not the average.

# Quick demand-trend query in Prometheus (last 90 days, hourly resolution)
# Use this as the input to your regression or to Grafana's "predict_linear" function

predict_linear(
  sum(rate(http_requests_total[5m]))[90d:1h],
  86400 * 30          # forecast 30 days ahead
)

# Grafana equivalent using the transform "Regression analysis"
# Source query: avg_over_time(container_cpu_usage_seconds_total[5m])
# Model: linear | Period: 30d

Google SRE practice: Google calls this "demand signal collection" and requires every service to maintain a capacity forecast document updated before each quarterly planning cycle. The forecast must include a P50 (expected) and P95 (conservative) scenario, and explicitly list assumptions about growth drivers.

Headroom Policy

Headroom is the gap you intentionally leave between provisioned capacity and expected peak demand. It serves three purposes: absorb unexpected spikes before autoscaling responds, provide runway for autoscaling to act (a new node takes 2-4 minutes to join a Kubernetes cluster), and prevent CPU/memory saturation from degrading latency before you can scale out.

The right headroom number depends on your scaling speed and your SLO aggressiveness:

20 % headroom — minimum viable for services with fast horizontal scaling (<90 s to add a pod that is already scheduled on a warm node). Acceptable for stateless microservices backed by HPA.
30–40 % headroom — appropriate when node provisioning is in the path (cluster autoscaling, ~3–5 min). This is the Google/Netflix default for their core serving tiers.
50 %+ headroom — required for services with long warm-up times (JVM, ML model loading), stateful systems (databases, Kafka brokers), or single-region deployments where a failure in one AZ instantly doubles load on the survivors.

Headroom is not free: 30 % headroom means you are permanently paying for 1.3x the capacity you need at steady-state. The counter-argument — and it is correct — is that the cost of a 30-minute outage during a traffic spike almost always exceeds months of headroom spend. Encode your policy in runbook form so on-call engineers do not under-provision to save money.

Provisioned capacity must stay above P95 demand with deliberate headroom — not just above the mean.

Lead Times

Lead time is how long it takes to get additional capacity into production. It governs how far ahead you must forecast and how much headroom you must maintain. Ignoring lead times is the single most common capacity planning mistake.

Lead times exist at every layer of the stack:

Pod scheduling (warm node): 5–30 seconds. Kubernetes scheduler + container pull if not cached. This is the HPA regime — reactive, fast.
Node provisioning (cluster autoscaler / Karpenter): 2–6 minutes for standard instance types; 10–20 minutes for GPU instances or large bare-metal nodes. This is the layer where the "30 % headroom" rule comes from — you need enough buffer to survive while new nodes join.
Reserved instance procurement: Instantaneous for on-demand, but reserved capacity (AWS RIs, Committed Use) requires a 1–3-year commitment purchased in advance. Misjudge your reserved baseline and you either overpay or exhaust on-demand quota.
Hardware procurement (on-prem / colocation): 8–26 weeks for standard servers; 6–12 months for specialized hardware (GPUs, high-memory nodes, custom ASICs). At this scale, capacity planning is a capital expenditure process with finance and procurement stakeholders.

Production pattern — "N+2" node buffer: Always keep at least 2 unscheduled node-equivalents of capacity in your cluster so that autoscaling can respond to a sudden spike (e.g., a single viral event) without hitting the node-provisioning lead time. Configure Karpenter or cluster-autoscaler with --scale-down-unneeded-time=10m and set a minReplicas floor on your HPAs to maintain this buffer even during off-peak hours.

# Karpenter NodePool — encoding headroom policy via limits and disruption budget
# This example reserves ~30% headroom by capping max CPU at 70% of what you actually
# want to serve; the remaining 30% stays as schedulable slack on warm nodes.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general-purpose
spec:
  disruption:
    consolidationPolicy: WhenUnderutilized   # reclaim truly idle nodes
    consolidateAfter: 10m                    # but wait 10 min before consolidating
    budgets:
      - nodes: "20%"                         # never evict >20% of nodes at once
  limits:
    cpu: "640"                               # hard ceiling for this NodePool
    memory: "2560Gi"
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand", "spot"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["m6i.2xlarge", "m6i.4xlarge", "m7i.2xlarge", "m7i.4xlarge"]
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1
        kind: EC2NodeClass
        name: default

Putting It Together: The Capacity Planning Cycle

Mature organizations run capacity planning as a quarterly cycle, not a reactive fire-drill. The workflow:

Collect signals — pull 90-day utilization trends from Prometheus/Datadog; add business-calendar events for the next quarter.
Forecast P50 and P95 demand — per service, per resource type (CPU, memory, network, storage IOPS).
Apply headroom policy — multiply P95 by your headroom factor (1.3x for most services, higher for stateful tiers).
Account for lead times — if hardware procurement is in the path, submit requests 12+ weeks before the need date.
Review autoscaling configurations — validate that HPA targets, VPA recommendations, and Karpenter limits align with the new forecast. Adjust minReplicas floors before the busy season, not during it.
Document assumptions — so the next on-call engineer understands why you provisioned 140 % of current load.

Common failure mode — "we will scale dynamically": Autoscaling is not a substitute for capacity planning. Autoscaling reacts to demand that has already arrived; if your node pool is exhausted and node provisioning takes 5 minutes, you will drop traffic during those 5 minutes regardless of how well your HPA is tuned. The lessons that follow teach you how to configure autoscaling correctly — but they all assume you have done the capacity planning work to ensure there is room to scale into.