Advanced Kubernetes Operations

Cluster Upgrades & Node Management

22 min Lesson 4 of 30

Cluster Upgrades & Node Management

Keeping a Kubernetes cluster on a supported, patched version is not optional — it is a fundamental reliability and security obligation. Yet a botched upgrade is one of the fastest ways to cause a widespread outage. This lesson covers how senior SREs execute zero-downtime upgrades in production: the control-plane-first strategy, PodDisruptionBudgets, safe node draining, and the modern node-pool patterns used at big-tech scale.

The Kubernetes Release Cycle and Skew Policy

Kubernetes releases three minor versions per year (roughly every four months). Each minor version receives patch updates and critical CVE backports for about 14 months after its release date. The upstream skew policy is strict:

kube-apiserver must be at the same or one minor version ahead of kubelet. You cannot have a v1.28 apiserver and a v1.26 kubelet.
Control-plane components (kube-scheduler, kube-controller-manager) can be one minor version behind the apiserver — this is the window you exploit to roll them safely.
kubectl can be ±1 minor version of the apiserver.

Plan upgrades incrementally. You cannot jump from v1.26 to v1.29. Each minor version must be traversed in order: 1.26 → 1.27 → 1.28 → 1.29. Budget at least one maintenance window per version hop, and verify that your workload APIs are not deprecated in the target version before starting.

The Upgrade Sequence: Control Plane First

The golden rule: always upgrade the control plane before the data plane. The apiserver must always be the most recent component in the cluster. The recommended order is:

Back up etcd.
Upgrade kube-apiserver.
Upgrade kube-controller-manager and kube-scheduler.
Upgrade each node pool, one batch at a time.
Upgrade add-ons (CoreDNS, kube-proxy, CNI plugin, metrics-server).

On managed clusters (EKS, GKE, AKS) steps 1–3 are handled by the cloud provider. You invoke them via console or CLI, wait for the control plane to stabilize, then drain and replace node groups yourself — or use managed node groups that automate it. On self-managed clusters you run kubeadm upgrade.

# --- Self-managed clusters: kubeadm upgrade workflow ---

# Step 1: upgrade kubeadm on the first control-plane node
apt-mark unhold kubeadm && \
  apt-get install -y kubeadm=1.30.2-1.1 && \
  apt-mark hold kubeadm

# Step 2: dry-run to check what will change
kubeadm upgrade plan v1.30.2

# Step 3: apply (upgrades apiserver, controller-manager, scheduler, etcd config)
kubeadm upgrade apply v1.30.2

# Step 4: upgrade kubelet + kubectl on this node
apt-mark unhold kubelet kubectl && \
  apt-get install -y kubelet=1.30.2-1.1 kubectl=1.30.2-1.1 && \
  apt-mark hold kubelet kubectl
systemctl daemon-reload && systemctl restart kubelet

# Step 5: repeat for additional control-plane nodes (use 'upgrade node' not 'apply')
kubeadm upgrade node

# Step 6: for each worker node — drain, upgrade kubelet, uncordon
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
apt-mark unhold kubelet kubectl && \
  apt-get install -y kubelet=1.30.2-1.1 kubectl=1.30.2-1.1 && \
  apt-mark hold kubelet kubectl
systemctl daemon-reload && systemctl restart kubelet
kubectl uncordon <node-name>

PodDisruptionBudgets: Guaranteeing Availability During Maintenance

A PodDisruptionBudget (PDB) is the contract between operations and workloads. It tells the Kubernetes eviction API the minimum number (or percentage) of pods that must remain available during voluntary disruptions — node drains, cluster upgrades, or manual evictions. Without PDBs, a drain can terminate all replicas of a critical service simultaneously.

PDBs apply to voluntary disruptions only. Hardware failure is an involuntary disruption and PDBs cannot prevent it — that is what replicas and topology spread constraints handle.

# PDB: at least 2 pods of the api-server deployment must be available at all times
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-server-pdb
  namespace: production
spec:
  minAvailable: 2          # absolute count; alternatively use maxUnavailable
  selector:
    matchLabels:
      app: api-server

---
# PDB using percentage: no more than 25% of pods may be unavailable
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: worker-pdb
  namespace: production
spec:
  maxUnavailable: "25%"
  selector:
    matchLabels:
      app: background-worker

---
# Check the current PDB status (ALLOWED DISRUPTIONS column tells you if a drain would block)
kubectl get pdb -n production
# NAME              MIN-AVAILABLE   MAX-UNAVAILABLE   ALLOWED-DISRUPTIONS   AGE
# api-server-pdb    2               N/A               1                     3d
# worker-pdb        N/A             25%               2                     3d

PDB deadlock — a real production trap. If minAvailable equals the total replica count, or maxUnavailable: 0, a drain will block forever. The eviction API will refuse to evict any pod, and kubectl drain will hang. Always set PDBs so that at least one disruption is allowed. A common safe default is minAvailable: 1 for small deployments and maxUnavailable: "25%" for large ones.

Draining Nodes Safely

kubectl drain does two things: it cordons the node (marks it SchedulingDisabled so no new pods land on it) and then evicts all evictable pods, respecting PDBs. Understanding each flag is essential:

--ignore-daemonsets — DaemonSet pods cannot be evicted (the DaemonSet controller will re-create them); this flag skips them safely.
--delete-emptydir-data — evicts pods with emptyDir volumes. Required on most real clusters. Data in emptyDir is lost on eviction — make sure the workload is designed for it.
--pod-selector — drain only pods matching a label, leaving others in place (useful for incremental maintenance).
--timeout — how long to wait for pods to terminate. Default is infinite; set a realistic value (e.g., 300s) to avoid hangs on misbehaving pods.
--force — evict pods not managed by a controller (bare pods). Use with caution; those pods are gone permanently.

# Safe production drain sequence

# 1. Cordon first (stops new scheduling, lets you verify before evicting)
kubectl cordon ip-10-0-1-45.ec2.internal

# 2. Check what would be affected (--dry-run is not available for drain, so inspect manually)
kubectl get pods --all-namespaces --field-selector spec.nodeName=ip-10-0-1-45.ec2.internal

# 3. Drain with PDB awareness and a timeout guard
kubectl drain ip-10-0-1-45.ec2.internal \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --timeout=300s

# 4. Verify node is empty (DaemonSet pods remain — that is correct)
kubectl get pods --all-namespaces --field-selector spec.nodeName=ip-10-0-1-45.ec2.internal

# 5. Perform maintenance (OS patch, kubelet upgrade, instance replacement, etc.)

# 6. Uncordon to return the node to the scheduler pool
kubectl uncordon ip-10-0-1-45.ec2.internal

Node Pools and Blue/Green Node Upgrades

At scale, draining individual nodes is too slow and error-prone. The industry-standard approach is blue/green node pools: provision a new node group on the target Kubernetes version, migrate workloads to it, then terminate the old group. This gives you an instant rollback path — if the new nodes behave badly, just cordon the new group and uncordon the old one.

Blue/green node pool upgrade: new pool (v1.30) receives workloads while the old pool (v1.29) is cordoned and drained, then terminated.

On EKS the pattern uses managed node groups or self-managed Auto Scaling Groups. The workflow with eksctl or Terraform is:

# EKS: upgrade the control plane first via AWS CLI
aws eks update-cluster-version \
  --name prod-cluster \
  --kubernetes-version 1.30 \
  --region us-east-1

# Wait for the control plane to be ACTIVE
aws eks wait cluster-active --name prod-cluster --region us-east-1

# Option A: Managed node group in-place rolling update (AWS drains nodes for you)
aws eks update-nodegroup-version \
  --cluster-name prod-cluster \
  --nodegroup-name workers \
  --kubernetes-version 1.30 \
  --region us-east-1

# Option B: Blue/green via eksctl (create a new node group, then delete the old one)
eksctl create nodegroup \
  --cluster prod-cluster \
  --name workers-v130 \
  --kubernetes-version 1.30 \
  --node-type m6i.2xlarge \
  --nodes 6 --nodes-min 3 --nodes-max 12 \
  --region us-east-1

# Cordon and drain the old node group
for node in $(kubectl get nodes -l eks.amazonaws.com/nodegroup=workers -o name); do
  kubectl cordon $node
done

for node in $(kubectl get nodes -l eks.amazonaws.com/nodegroup=workers -o name); do
  kubectl drain $node --ignore-daemonsets --delete-emptydir-data --timeout=300s
done

# Verify all pods are running on new nodes, then delete old node group
eksctl delete nodegroup --cluster prod-cluster --name workers --region us-east-1

Taints, Tolerations, and Node Selectors in Maintenance Workflows

During rolling upgrades, you often need to ensure workloads land only on upgraded nodes. Apply a NoSchedule taint to old nodes as a belt-and-suspenders measure alongside cordoning, and use node.kubernetes.io/unschedulable awareness. For specialized node pools (GPU, high-memory, spot instances), taints and tolerations are the permanent mechanism that prevents ordinary workloads from consuming expensive resources.

Label node pools consistently from day one. Use labels like node.company.io/pool=workers, node.company.io/lifecycle=spot, and node.company.io/generation=v2 on every node group. These labels make it trivial to target drain operations, write PodAffinity rules, and query nodes during incidents — without having to remember instance IDs or IP addresses.

Upgrade Checklist for Production

Check deprecated APIs with kubectl convert or kubent before upgrading. Deprecated CRD versions will break after the upgrade.
Back up etcd with etcdctl snapshot save or a cloud-provider snapshot.
Validate PDBs: confirm ALLOWED-DISRUPTIONS > 0 for every critical workload.
Upgrade control plane; wait for all components to report Ready.
Upgrade nodes in batches of 25% or use blue/green pool replacement.
Monitor error rates and latency on your observability stack throughout.
Upgrade add-ons last (CoreDNS, CNI, metrics-server) to their versions compatible with the new minor release.
Run conformance smoke tests or your integration test suite to confirm cluster health.