Capacity Planning & Autoscaling

Kubernetes HPA in Depth

18 min Lesson 3 of 27

Kubernetes HPA in Depth

The Horizontal Pod Autoscaler (HPA) is the workhorse of reactive scaling in Kubernetes. Most engineers know the one-liner that hooks it to CPU utilization; far fewer understand how the control loop actually works, how to tune stabilization behavior to avoid oscillation in production, or how to drive it from application-level metrics that CPU simply cannot capture. This lesson closes that gap.

How the HPA Control Loop Works

The HPA controller runs inside kube-controller-manager and polls the metrics API on a configurable interval (default 15 s, tunable via --horizontal-pod-autoscaler-sync-period). Each cycle it computes a desired replica count using the ratio formula:

desiredReplicas = ceil( currentReplicas * (currentMetricValue / desiredMetricValue) )

# Example: 4 pods, CPU utilization 80%, target 50%
desiredReplicas = ceil( 4 * (80 / 50) ) = ceil(6.4) = 7

The raw number is then clamped to [minReplicas, maxReplicas], and two stabilization windows prevent thrashing. Scale-up is gated by an upscale stabilization window (default 0 s — scale up immediately) while scale-down uses a longer window (default 300 s) that forces the controller to track the maximum recommended replicas over that window before actually removing pods. This asymmetry is deliberate: respond to load spikes instantly, but be conservative when removing capacity.

Key concept — stabilization window semantics: the stabilization window does not "wait." The HPA records every recommendation made during the window and takes the most conservative value (max for scale-down, min for scale-up). A five-minute scale-down window means the load must consistently fall before pods are removed.

A Production-Grade HPA Manifest

The autoscaling/v2 API (stable since Kubernetes 1.23) replaced the old v1 and allows multiple metrics, behavior tuning, and tolerance configuration. Below is a realistic manifest for a high-throughput API service:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-gateway
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  minReplicas: 6
  maxReplicas: 120

  metrics:
  # Primary signal: CPU — guards against CPU-bound regressions
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

  # Secondary signal: custom metric from Prometheus via adapter
  - type: Pods
    pods:
      metric:
        name: http_requests_in_flight
      target:
        type: AverageValue
        averageValue: "500"

  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0      # react immediately to spikes
      policies:
      - type: Percent
        value: 100                        # allow doubling replica count…
        periodSeconds: 15
      - type: Pods
        value: 10                         # …but never add more than 10 pods per 15 s
        periodSeconds: 15
      selectPolicy: Min                   # take the more conservative of the two policies

    scaleDown:
      stabilizationWindowSeconds: 300    # require 5 min of consistent low load
      policies:
      - type: Percent
        value: 20                         # remove at most 20% per minute
        periodSeconds: 60
      selectPolicy: Max                   # Max = most conservative for scale-down

Set minReplicas above zero — always. Google SRE guidance: keep at least N replicas where N covers your worst-case cold-start time and startup probe delay. For a service with a 30-second startup, a single pod draining under a traffic spike will cause a latency cliff before the new pod is ready.

Behavior Policies in Detail

The behavior block gives you fine-grained rate limiting on scaling actions. Each direction accepts a list of policies, and selectPolicy resolves conflicts:

Min — choose the policy that produces the smallest change. Use for scale-up when you want conservative growth.
Max — choose the policy that produces the largest change. Use for scale-down to be maximally conservative (remove fewest pods).
Disabled — block scaling in that direction entirely. Useful during a deployment freeze.

The Percent and Pods policy types compose well. A common pattern is: allow instant doubling (Percent: 100) but hard-cap at 10 pods per window so you never accidentally request 50 new nodes in 15 seconds and blow your cloud quota.

Custom Metrics via the Prometheus Adapter

Raw CPU tells you when your pods are saturated, not when your business service is saturated. Queue depth, active WebSocket connections, requests-in-flight, and GPU memory pressure are all better signals for many workloads. Kubernetes exposes these through the custom.metrics.k8s.io API, which the Prometheus Adapter implements.

A minimal adapter rule that exposes http_requests_in_flight from a Prometheus gauge:

# prometheus-adapter ConfigMap — rules section
rules:
- seriesQuery: 'http_requests_in_flight{namespace!="",pod!=""}'
  resources:
    overrides:
      namespace: {resource: "namespace"}
      pod:       {resource: "pod"}
  name:
    matches: "^(.*)$"
    as: "${1}"
  metricsQuery: 'avg_over_time(http_requests_in_flight{<<.LabelMatchers>>}[2m])'

Verify the metric is visible to the HPA controller:

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/production/pods/*/http_requests_in_flight" \
  | jq '.'

# Expected output (truncated):
{
  "kind": "MetricValueList",
  "items": [
    { "describedObject": { "kind": "Pod", "name": "api-gateway-7d9f5-xkz4n" },
      "metricName": "http_requests_in_flight",
      "value": "423" }
  ]
}

External Metrics for Cloud-Native Signals

The external.metrics.k8s.io API handles metrics that are not associated with any Kubernetes object — SQS queue depth, a Pub/Sub subscription backlog, Kafka consumer-group lag. The adapter exposes these under type: External in the HPA spec:

metrics:
- type: External
  external:
    metric:
      name: sqs_approx_number_of_messages_visible
      selector:
        matchLabels:
          queue: "order-processing"
    target:
      type: AverageValue
      averageValue: "50"   # target 50 messages per pod

External metrics and the "per-pod" contract: HPA uses AverageValue for external metrics, which means it divides the raw metric by the current replica count to compute desired replicas. A queue depth of 1000 with target 50 drives the HPA to 20 replicas — but that math only holds if each pod can process 50 messages concurrently. Calibrate targets against your actual consumer throughput in load tests before enabling in production.

The Tolerance Parameter and Metric Fluctuation

The HPA ignores small deviations from the target to avoid constant micro-scaling. The default tolerance is 10% (configurable globally via --horizontal-pod-autoscaler-tolerance). This means scaling is suppressed when the ratio currentMetric / desiredMetric is in the range [0.9, 1.1]. If your metric fluctuates ±15% naturally (e.g., Prometheus scrape jitter on a short window), you will see continuous scaling noise. The fix is to use a longer averaging window in the adapter rule (e.g., avg_over_time(...[5m]) instead of the raw instant value) or widen the stabilization window.

HPA with KEDA for Advanced Patterns

Kubernetes Event-Driven Autoscaling (KEDA) extends HPA rather than replacing it. KEDA installs a ScaledObject CRD and registers itself as a custom/external metrics provider. The practical advantage over raw HPA is scale-to-zero support and 50+ built-in scalers (Kafka lag, Redis list length, Datadog metric, Cron schedule). For Kafka-driven microservices at scale, KEDA is now the standard.

Do not run HPA and a manual kubectl scale on the same deployment simultaneously. The HPA will immediately override your manual change on the next sync cycle. If you need to pin a replica count temporarily (deploy freeze, incident), either remove the HPA or set minReplicas == maxReplicas to the desired count.

The HPA control loop: metrics from four sources feed the controller, which applies ratio math, stabilization windows, and behavior policies before patching the Deployment replica count.

Debugging HPA in Production

When scaling is not happening as expected, these commands give you the full picture:

# Check current HPA state, conditions, and last scale time
kubectl describe hpa api-gateway -n production

# Watch the HPA object live (shows metrics values in real time)
kubectl get hpa api-gateway -n production -w

# Check that the metrics API is reachable
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq '.resources[].name'

# Events on the HPA (scaling decisions and errors logged here)
kubectl get events -n production --field-selector reason=SuccessfulRescale --sort-by='.lastTimestamp'

# Relevant kube-controller-manager flags (if you manage the control plane)
# --horizontal-pod-autoscaler-sync-period=15s
# --horizontal-pod-autoscaler-downscale-stabilization=5m0s
# --horizontal-pod-autoscaler-tolerance=0.1
# --horizontal-pod-autoscaler-cpu-initialization-period=5m0s
# --horizontal-pod-autoscaler-initial-readiness-delay=30s

The --horizontal-pod-autoscaler-initial-readiness-delay flag is particularly important: pods that have been ready for less than this duration are excluded from the CPU average calculation. This prevents a freshly scaled batch of cold pods from artificially deflating the observed CPU and immediately triggering a scale-down before the pods warm up.

Multi-metric HPA behavior: when you specify more than one metric, the HPA computes a desired replica count for each metric independently and takes the maximum. This means any single metric that demands more capacity will drive the scale-up — a sensible conservative default. You cannot change this aggregation strategy in standard HPA; for AND semantics (scale only when all metrics are high), KEDA ScaledObjects give you more control.