Kubernetes Workloads & Configuration

Horizontal Pod Autoscaling

18 min Lesson 8 of 32

Horizontal Pod Autoscaling

Traffic is never flat. A payment service sees 10x its baseline load during a flash sale. A batch-analytics API gets hammered every morning when nightly reports are generated. Provisioning enough replicas to handle peak load at all times wastes money and CPU during off-peak hours. Provisioning for the average leaves you with a cascading failure at the worst possible moment.

Horizontal Pod Autoscaling (HPA) is Kubernetes' answer to this problem. Rather than fixing the replica count, you define a target utilisation threshold — "keep average CPU at 60%" — and the HPA controller continuously adjusts the Deployment's replicas field to meet that target. The result is a self-regulating system that scales out under load and scales back in when traffic drops, automatically, without operator intervention. At Google, Meta, and AWS scale, every stateless workload runs under an autoscaler. Manually managing replica counts is considered an operational anti-pattern.

How the HPA Control Loop Works

HPA is a Kubernetes controller that runs a reconciliation loop, typically every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period on the controller manager). Each iteration follows the same sequence:

Collect metrics — query the Metrics Server (built-in CPU/memory) or an external metrics adapter (custom/external metrics).
Compute desired replicas — apply the scaling formula: desiredReplicas = ceil(currentReplicas × currentMetricValue / desiredMetricValue).
Clamp to bounds — never go below minReplicas or above maxReplicas.
Apply stabilisation window — honour the configured cooldown to prevent thrashing.
Patch the target — write the new replica count to the Deployment (or StatefulSet, etc.).

The HPA reconciliation loop: metrics are polled every 15 s, desired replicas are computed and clamped, the stabilisation window prevents thrashing, then the Deployment is patched.

Prerequisites: Metrics Server

HPA for CPU and memory relies on the Metrics Server — a cluster add-on that scrapes resource usage from kubelets and exposes it via the metrics.k8s.io aggregated API. Most managed clusters (EKS, GKE, AKS) ship it by default. On a bare-metal or kubeadm cluster you must install it yourself.

# Install Metrics Server (bare-metal / kubeadm clusters)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify it is running and ready
kubectl get deployment metrics-server -n kube-system
kubectl top nodes
kubectl top pods -n production

# If Metrics Server is not ready, HPA shows: "unable to get metrics"
# For kind/minikube clusters with self-signed certs, add --kubelet-insecure-tls:
kubectl patch deployment metrics-server -n kube-system \
  --type='json' \
  -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'

Basic HPA on CPU

The simplest HPA targets average CPU utilisation across all Pods in the Deployment. The utilisation percentage is relative to the container's CPU request, not the node's total CPU — so every container in the Deployment must declare a CPU request for HPA to work correctly.

# --- Declarative HPA manifest (preferred — commit to Git) ---
# hpa-api.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-api
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60   # target 60% of each Pod's CPU request

---
# The Deployment MUST have CPU requests set or HPA cannot compute utilisation
# excerpt from the Deployment spec:
# resources:
#   requests:
#     cpu: "500m"
#     memory: "256Mi"
#   limits:
#     cpu: "2"
#     memory: "1Gi"

# Apply and verify
kubectl apply -f hpa-api.yaml
kubectl get hpa api-hpa -n production
kubectl describe hpa api-hpa -n production

# Quick imperative form (useful for testing, not for production)
kubectl autoscale deployment my-api \
  --cpu-percent=60 \
  --min=2 \
  --max=20 \
  -n production

HPA and requests are inseparable: HPA computes utilisation as actual CPU used / CPU request. If a container has no CPU request, the Metrics Server cannot compute a percentage, and the HPA event log shows missing request for cpu. Always set both requests and limits on every container in a deployment managed by HPA — this is enforced at Google by LimitRange admission webhooks that reject pods lacking requests.

Scaling on Memory

Memory scaling behaves differently from CPU scaling in one critical respect: memory is not compressible. When a Pod is using 90% of its memory request, you cannot throttle it — you must scale out. However, memory-based HPA must be tuned carefully because memory often does not drop after a scale-out event (the old pods still hold their heap allocations), which can trigger further scale-out loops. At big-tech shops the preferred pattern is to scale on CPU or request rate (custom metric), and use memory requests primarily as a scheduling hint backed by Vertical Pod Autoscaler recommendations.

# HPA targeting both CPU and memory (HPA v2 takes the higher desired replica count)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-api
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  - type: Resource
    resource:
      name: memory
      target:
        type: AverageValue
        averageValue: 512Mi     # absolute value, not percentage

Custom and External Metrics

CPU and memory are trailing indicators — they rise after your users are already experiencing latency. Production systems at scale autoscale on leading indicators: requests per second (RPS), queue depth, p99 latency, or even a scheduled load forecast. This requires the custom metrics API (custom.metrics.k8s.io) or external metrics API (external.metrics.k8s.io). Common adapters include KEDA (Kubernetes Event-Driven Autoscaling), Prometheus Adapter, and cloud-native adapters (Datadog, Stackdriver).

# HPA on custom metric: Prometheus "http_requests_per_second" (via Prometheus Adapter)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-api
  minReplicas: 2
  maxReplicas: 50
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second   # registered in Prometheus Adapter config
      target:
        type: AverageValue
        averageValue: "500"              # 500 RPS per Pod
  - type: External
    external:
      metric:
        name: sqs_queue_depth            # AWS SQS metric via external adapter
        selector:
          matchLabels:
            queue: orders
      target:
        type: AverageValue
        averageValue: "100"              # scale to keep <100 messages per Pod

KEDA for event-driven scaling: KEDA (keda.sh) is now the industry-standard way to autoscale on external event sources — SQS, Kafka, RabbitMQ, Azure Service Bus, Pub/Sub, and 60+ others. It installs as a CRD that creates and manages HPAs under the hood. It also supports scale-to-zero, which native HPA cannot do (minimum 1 replica). If your workload is a queue consumer, KEDA scales on queue depth and idles to zero when the queue is empty — a significant cost saving.

Behaviour Tuning: Preventing Thrash

The default HPA behaviour works for most cases but is too aggressive for latency-sensitive services. A sudden spike might cause a scale-out, and seconds later the metric drops back, causing a scale-in — a thrashing loop that creates pod-startup latency with each oscillation. The behavior block (autoscaling/v2) gives you precise control.

# Production-grade HPA with aggressive scale-out and conservative scale-in
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-api
  minReplicas: 3
  maxReplicas: 100
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0      # scale up immediately (no cooldown)
      policies:
      - type: Percent
        value: 100                        # double replicas at most per period
        periodSeconds: 15
      - type: Pods
        value: 4                          # or add at most 4 Pods per period
        periodSeconds: 15
      selectPolicy: Max                   # pick the policy that adds the most Pods
    scaleDown:
      stabilizationWindowSeconds: 300    # wait 5 min of stability before scaling in
      policies:
      - type: Percent
        value: 10                         # remove at most 10% of replicas per period
        periodSeconds: 60
      selectPolicy: Min                   # pick the policy that removes the fewest Pods

The default scale-down window is 5 minutes — do not reduce it carelessly. The stabilisation window prevents thrashing by requiring the HPA to see a stable low-load reading for the entire window before scaling down. Reducing it to 0 means any brief traffic lull triggers a scale-in, which creates new pods at the next spike — each pod startup adds latency (image pull, JVM warmup, etc.). Google SRE guidance is to keep the scale-down window at 5–10 minutes for web services and never drop it below 2 minutes.

Observing and Debugging HPA

When HPA is not behaving as expected, kubectl describe hpa is your first tool. The Events section and Conditions table tell you exactly what the controller saw and why it made (or did not make) a scaling decision.

# Real-time HPA status — shows current / desired / min / max and metric values
kubectl get hpa -n production -w

# Full detail: conditions, events, last scaling decision
kubectl describe hpa api-hpa -n production

# Simulate load to test scale-out (run in a separate terminal)
kubectl run -i --tty load-gen \
  --image=busybox:1.36 \
  --rm \
  --restart=Never \
  -n production \
  -- /bin/sh -c "while true; do wget -q -O- http://my-api-svc/; done"

# Check if Metrics Server is returning pod-level metrics
kubectl top pods -n production --sort-by=cpu

# Check HPA events for errors (missing metrics, object not found, etc.)
kubectl get events -n production --field-selector reason=SuccessfulRescale
kubectl get events -n production --field-selector reason=FailedGetScale

# Inspect the autoscaler status in detail
kubectl get hpa api-hpa -n production -o yaml | grep -A 20 status:

HPA does not replace Cluster Autoscaler. HPA adds Pods to existing nodes. If all nodes are full, the new Pods will be Pending until a node is available. On cloud clusters, Cluster Autoscaler (or Karpenter on EKS) watches for Pending pods and provisions new nodes. These two autoscalers are designed to work together: HPA handles pod-level scaling, Cluster Autoscaler handles node-level scaling. Both must be tuned in concert for a fully self-healing, cost-optimal cluster.