Kubernetes HPA in Depth
Kubernetes HPA in Depth
The Horizontal Pod Autoscaler (HPA) is the workhorse of reactive scaling in Kubernetes. Most engineers know the one-liner that hooks it to CPU utilization; far fewer understand how the control loop actually works, how to tune stabilization behavior to avoid oscillation in production, or how to drive it from application-level metrics that CPU simply cannot capture. This lesson closes that gap.
How the HPA Control Loop Works
The HPA controller runs inside kube-controller-manager and polls the metrics API on a configurable interval (default 15 s, tunable via --horizontal-pod-autoscaler-sync-period). Each cycle it computes a desired replica count using the ratio formula:
The raw number is then clamped to [minReplicas, maxReplicas], and two stabilization windows prevent thrashing. Scale-up is gated by an upscale stabilization window (default 0 s — scale up immediately) while scale-down uses a longer window (default 300 s) that forces the controller to track the maximum recommended replicas over that window before actually removing pods. This asymmetry is deliberate: respond to load spikes instantly, but be conservative when removing capacity.
A Production-Grade HPA Manifest
The autoscaling/v2 API (stable since Kubernetes 1.23) replaced the old v1 and allows multiple metrics, behavior tuning, and tolerance configuration. Below is a realistic manifest for a high-throughput API service:
minReplicas above zero — always. Google SRE guidance: keep at least N replicas where N covers your worst-case cold-start time and startup probe delay. For a service with a 30-second startup, a single pod draining under a traffic spike will cause a latency cliff before the new pod is ready.
Behavior Policies in Detail
The behavior block gives you fine-grained rate limiting on scaling actions. Each direction accepts a list of policies, and selectPolicy resolves conflicts:
Min— choose the policy that produces the smallest change. Use for scale-up when you want conservative growth.Max— choose the policy that produces the largest change. Use for scale-down to be maximally conservative (remove fewest pods).Disabled— block scaling in that direction entirely. Useful during a deployment freeze.
The Percent and Pods policy types compose well. A common pattern is: allow instant doubling (Percent: 100) but hard-cap at 10 pods per window so you never accidentally request 50 new nodes in 15 seconds and blow your cloud quota.
Custom Metrics via the Prometheus Adapter
Raw CPU tells you when your pods are saturated, not when your business service is saturated. Queue depth, active WebSocket connections, requests-in-flight, and GPU memory pressure are all better signals for many workloads. Kubernetes exposes these through the custom.metrics.k8s.io API, which the Prometheus Adapter implements.
A minimal adapter rule that exposes http_requests_in_flight from a Prometheus gauge:
Verify the metric is visible to the HPA controller:
External Metrics for Cloud-Native Signals
The external.metrics.k8s.io API handles metrics that are not associated with any Kubernetes object — SQS queue depth, a Pub/Sub subscription backlog, Kafka consumer-group lag. The adapter exposes these under type: External in the HPA spec:
AverageValue for external metrics, which means it divides the raw metric by the current replica count to compute desired replicas. A queue depth of 1000 with target 50 drives the HPA to 20 replicas — but that math only holds if each pod can process 50 messages concurrently. Calibrate targets against your actual consumer throughput in load tests before enabling in production.
The Tolerance Parameter and Metric Fluctuation
The HPA ignores small deviations from the target to avoid constant micro-scaling. The default tolerance is 10% (configurable globally via --horizontal-pod-autoscaler-tolerance). This means scaling is suppressed when the ratio currentMetric / desiredMetric is in the range [0.9, 1.1]. If your metric fluctuates ±15% naturally (e.g., Prometheus scrape jitter on a short window), you will see continuous scaling noise. The fix is to use a longer averaging window in the adapter rule (e.g., avg_over_time(...[5m]) instead of the raw instant value) or widen the stabilization window.
HPA with KEDA for Advanced Patterns
Kubernetes Event-Driven Autoscaling (KEDA) extends HPA rather than replacing it. KEDA installs a ScaledObject CRD and registers itself as a custom/external metrics provider. The practical advantage over raw HPA is scale-to-zero support and 50+ built-in scalers (Kafka lag, Redis list length, Datadog metric, Cron schedule). For Kafka-driven microservices at scale, KEDA is now the standard.
kubectl scale on the same deployment simultaneously. The HPA will immediately override your manual change on the next sync cycle. If you need to pin a replica count temporarily (deploy freeze, incident), either remove the HPA or set minReplicas == maxReplicas to the desired count.
Debugging HPA in Production
When scaling is not happening as expected, these commands give you the full picture:
The --horizontal-pod-autoscaler-initial-readiness-delay flag is particularly important: pods that have been ready for less than this duration are excluded from the CPU average calculation. This prevents a freshly scaled batch of cold pods from artificially deflating the observed CPU and immediately triggering a scale-down before the pods warm up.