Advanced Kubernetes Operations

Cluster Autoscaling & Karpenter

18 min Lesson 6 of 30

Cluster Autoscaling & Karpenter

Horizontal Pod Autoscaling (HPA) scales your workloads; Cluster Autoscaling scales your infrastructure. When every node is full and a new Pod cannot be scheduled, you need capacity to appear — fast, cheaply, and with the right shape. Getting this right is the difference between a platform that absorbs traffic spikes gracefully and one that pages you at 3 AM because Pods have been Pending for 20 minutes.

This lesson covers two generations of the same idea: the Cluster Autoscaler (CA), the long-standing Kubernetes-project tool, and Karpenter, the modern AWS-originated (now CNCF) alternative that fixes most of CA's structural limitations.

Why Nodes Run Out: The Bin-Packing Problem

Kubernetes schedules Pods onto nodes using a bin-packing strategy: it fits as many Pods as possible into the available capacity. Each Pod declares requests (guaranteed minimum CPU/memory) and limits (hard cap). The scheduler sums requests per node and will not place a Pod if it would exceed the node's allocatable capacity.

In practice, nodes are never 100% utilized: system daemons (kubelet, kube-proxy, log agents, CNI plugins) consume 5–15% of each node's capacity before workloads even start. A 4-vCPU node typically exposes ~3.5 vCPU as allocatable. This gap means you always need slightly more nodes than raw math suggests.

Allocatable vs Capacity: Run kubectl describe node <name> and look at the Allocatable block — that is the number the scheduler uses. The delta between Capacity and Allocatable is reserved for the OS and Kubernetes system components.

Cluster Autoscaler (CA) — The Classic Approach

CA watches for Pending Pods and checks whether adding a node from any configured Node Group (AWS Auto Scaling Groups, GKE Managed Instance Groups, Azure VMSS) would make the Pod schedulable. If yes, it increments the group's desired count. It also scans for underutilized nodes and, after a configurable idle window (default 10 minutes), drains and terminates them.

Key CA limitations:

  • ASG-coupled thinking: CA operates on pre-defined instance types within each ASG. You must create an ASG per instance family — mixing instance types requires multiple groups and careful priority configuration.
  • Slow scale-up: CA polls every 10 seconds, then waits for the cloud provider to provision a node (1–3 minutes for EC2), then kubelet bootstraps (~30–60 s). Total: often 3–5 minutes from Pending to running Pod.
  • Scale-down conservatism: CA will not remove a node if any non-mirrored, non-DaemonSet Pod on it lacks a controller, or if a PodDisruptionBudget would be violated. This is safe but leaves idle nodes running longer than necessary.
# Install Cluster Autoscaler via Helm (EKS example) helm repo add autoscaler https://kubernetes.github.io/autoscaler helm repo update helm install cluster-autoscaler autoscaler/cluster-autoscaler \ --namespace kube-system \ --set autoDiscovery.clusterName=my-cluster \ --set awsRegion=us-east-1 \ --set rbac.serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::123456789:role/ClusterAutoscalerRole \ --set extraArgs.balance-similar-node-groups=true \ --set extraArgs.skip-nodes-with-system-pods=false \ --set extraArgs.scale-down-delay-after-add=5m \ --set extraArgs.scale-down-unneeded-time=10m # Annotate the ASG-backed node groups so CA discovers them # (done on the ASG in AWS Console or via Terraform): # k8s.io/cluster-autoscaler/enabled = "true" # k8s.io/cluster-autoscaler/my-cluster = "owned"

Karpenter — A Fundamentally Different Model

Karpenter (now a CNCF incubating project, originally built by AWS) abandons the Node Group abstraction entirely. Instead of managing ASGs, Karpenter calls cloud APIs directly to launch exactly the instance type that fits the pending workload. This eliminates the ASG indirection layer and makes provisioning both faster and more cost-efficient.

Karpenter introduces two CRDs:

  • NodePool — replaces the old Provisioner (pre-v0.32). Defines which instance families, zones, capacity types (on-demand or spot), taints/labels, and disruption budgets are allowed.
  • EC2NodeClass (AWS-specific) — describes the underlying EC2 configuration: AMI family, subnet selectors, security group selectors, instance profile, userData.
Karpenter provisioning flow vs Cluster Autoscaler Cluster Autoscaler Pending Pod scheduler unschedulable CA polls every 10 s Scale ASG +1 fixed instance type New Node Joins ~3-5 min total Karpenter Pending Pod scheduler unschedulable Karpenter watches event-driven, no poll Direct EC2 API call best-fit instance picked New Node Joins ~60-90 s total CA: ~3-5 min (ASG indirection) vs Karpenter: ~60-90 s (direct API, right-sized instance)
Karpenter's event-driven, direct-EC2 approach provisions capacity 2–4x faster than the classic Cluster Autoscaler.
# Install Karpenter on EKS (v1.x, using Helm) export KARPENTER_VERSION=1.0.6 export CLUSTER_NAME=my-cluster export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text) export AWS_REGION=us-east-1 helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \ --version "${KARPENTER_VERSION}" \ --namespace kube-system \ --set "settings.clusterName=${CLUSTER_NAME}" \ --set "settings.interruptionQueue=${CLUSTER_NAME}" \ --set controller.resources.requests.cpu=1 \ --set controller.resources.requests.memory=1Gi \ --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=\ arn:aws:iam::${AWS_ACCOUNT_ID}:role/KarpenterControllerRole --- # NodePool — defines allowed capacity apiVersion: karpenter.sh/v1 kind: NodePool metadata: name: default spec: template: metadata: labels: billing-team: platform spec: nodeClassRef: group: karpenter.k8s.aws kind: EC2NodeClass name: default requirements: - key: karpenter.sh/capacity-type operator: In values: ["spot", "on-demand"] - key: kubernetes.io/arch operator: In values: ["amd64", "arm64"] - key: karpenter.k8s.aws/instance-category operator: In values: ["c", "m", "r"] - key: karpenter.k8s.aws/instance-generation operator: Gt values: ["4"] expireAfter: 720h # recycle nodes after 30 days limits: cpu: "1000" memory: 4000Gi disruption: consolidationPolicy: WhenEmptyOrUnderutilized consolidateAfter: 1m --- # EC2NodeClass — AWS-specific config apiVersion: karpenter.k8s.aws/v1 kind: EC2NodeClass metadata: name: default spec: amiSFamily: AL2023 subnetSelectorTerms: - tags: karpenter.sh/discovery: my-cluster securityGroupSelectorTerms: - tags: karpenter.sh/discovery: my-cluster instanceProfile: KarpenterNodeInstanceProfile blockDeviceMappings: - deviceName: /dev/xvda ebs: volumeSize: 50Gi volumeType: gp3 encrypted: true

Spot Capacity: Saving 60–90% on Compute

Spot Instances (AWS) / Preemptible VMs (GCP) / Spot VMs (Azure) offer deep discounts — typically 60–90% cheaper than on-demand — in exchange for up to a 2-minute termination notice. Used correctly, spot is transformative for batch workloads, stateless services, and even some stateful workloads with proper disruption handling.

Production rules for spot-safe workloads:

  1. Always define a PodDisruptionBudget (PDB) so Karpenter/CA cannot drain too many replicas simultaneously.
  2. Spread Pods across multiple instance families and Availability Zones — spot pools are per-instance-type per-AZ; diversification dramatically reduces simultaneous interruption risk.
  3. Handle SIGTERM gracefully. Your application must finish in-flight requests within terminationGracePeriodSeconds (recommend 60–120 s).
  4. Never run singleton critical components (etcd, CA itself, admission webhooks) on spot.

Karpenter handles spot interruption via SQS interruption queue: when EC2 sends a spot interruption notice, Karpenter receives it from SQS, cordons the node immediately, and starts scheduling replacement capacity — all before the 2-minute window expires. This gives far better MTTI (mean time to interrupt and recover) than the default node-problem-detector approach.

Spot diversification formula: Use at least 5–8 instance types per NodePool. Karpenter's karpenter.k8s.aws/instance-category: [c, m, r] combined with instance-generation >= 4 typically yields 30–50 eligible instance types per zone — the widest possible spot pool, maximizing capacity availability.

Consolidation: Karpenter's Superpower

Consolidation is Karpenter's ability to continuously right-size your node fleet. When nodes are underutilized, Karpenter simulates whether all their Pods could fit on fewer (or smaller) nodes, then executes the bin-packing: it launches a cheaper replacement node, drains the inefficient nodes, and lets the Pods reschedule. This happens autonomously, within your PDB constraints, without any manual intervention.

Set consolidationPolicy: WhenEmptyOrUnderutilized for full consolidation. Use WhenEmpty only for sensitive production tiers where you want to avoid any voluntary disruption of running Pods.

Do not set consolidateAfter: 0s in production. Aggressive consolidation triggers constant node churn, which disrupts Pods unnecessarily and can cause cascading failures if your workload has slow startup times. A value of 1m to 5m gives the scheduler time to stabilize after a scale event before another consolidation round fires.

Observing the Autoscaler in Action

# Watch Karpenter controller logs in real time kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter -f --container controller # See pending pods and why they are unschedulable kubectl get pods --all-namespaces --field-selector=status.phase=Pending kubectl describe pod <pending-pod> | grep -A 10 Events # List Karpenter-managed nodes and their NodePool kubectl get nodes -l karpenter.sh/nodepool --show-labels # Inspect a NodeClaim (Karpenter\'s internal node request object) kubectl get nodeclaims kubectl describe nodeclaim <name> # Cost: see Karpenter node annotations for instance type and capacity type kubectl get node <name> -o json | jq \ \'{instance: .metadata.labels["node.kubernetes.io/instance-type"], capacity: .metadata.labels["karpenter.sh/capacity-type"], zone: .metadata.labels["topology.kubernetes.io/zone"]}\'

Pair Karpenter with Kubernetes Metrics Server and HPA for a complete autoscaling stack: HPA scales Pods in response to CPU/memory/custom metrics, which causes Pending Pods when nodes are full, which Karpenter sees and resolves by launching new capacity. The two controllers never interfere — they operate on different resources (Pods vs Nodes).

CA or Karpenter? On EKS, Karpenter is now the recommended path for new clusters. CA remains valid on GKE (where GKE Autopilot is the Karpenter equivalent) and AKS. If you are migrating an existing CA-managed cluster to Karpenter, run both in parallel with non-overlapping node group tags during the transition, then drain and remove the CA-managed ASGs last.