Cluster Upgrades & Node Management
Cluster Upgrades & Node Management
Keeping a Kubernetes cluster on a supported, patched version is not optional — it is a fundamental reliability and security obligation. Yet a botched upgrade is one of the fastest ways to cause a widespread outage. This lesson covers how senior SREs execute zero-downtime upgrades in production: the control-plane-first strategy, PodDisruptionBudgets, safe node draining, and the modern node-pool patterns used at big-tech scale.
The Kubernetes Release Cycle and Skew Policy
Kubernetes releases three minor versions per year (roughly every four months). Each minor version receives patch updates and critical CVE backports for about 14 months after its release date. The upstream skew policy is strict:
kube-apiservermust be at the same or one minor version ahead ofkubelet. You cannot have a v1.28 apiserver and a v1.26 kubelet.- Control-plane components (
kube-scheduler,kube-controller-manager) can be one minor version behind the apiserver — this is the window you exploit to roll them safely. kubectlcan be ±1 minor version of the apiserver.
The Upgrade Sequence: Control Plane First
The golden rule: always upgrade the control plane before the data plane. The apiserver must always be the most recent component in the cluster. The recommended order is:
- Back up etcd.
- Upgrade
kube-apiserver. - Upgrade
kube-controller-managerandkube-scheduler. - Upgrade each node pool, one batch at a time.
- Upgrade add-ons (CoreDNS, kube-proxy, CNI plugin, metrics-server).
On managed clusters (EKS, GKE, AKS) steps 1–3 are handled by the cloud provider. You invoke them via console or CLI, wait for the control plane to stabilize, then drain and replace node groups yourself — or use managed node groups that automate it. On self-managed clusters you run kubeadm upgrade.
PodDisruptionBudgets: Guaranteeing Availability During Maintenance
A PodDisruptionBudget (PDB) is the contract between operations and workloads. It tells the Kubernetes eviction API the minimum number (or percentage) of pods that must remain available during voluntary disruptions — node drains, cluster upgrades, or manual evictions. Without PDBs, a drain can terminate all replicas of a critical service simultaneously.
PDBs apply to voluntary disruptions only. Hardware failure is an involuntary disruption and PDBs cannot prevent it — that is what replicas and topology spread constraints handle.
minAvailable equals the total replica count, or maxUnavailable: 0, a drain will block forever. The eviction API will refuse to evict any pod, and kubectl drain will hang. Always set PDBs so that at least one disruption is allowed. A common safe default is minAvailable: 1 for small deployments and maxUnavailable: "25%" for large ones.
Draining Nodes Safely
kubectl drain does two things: it cordons the node (marks it SchedulingDisabled so no new pods land on it) and then evicts all evictable pods, respecting PDBs. Understanding each flag is essential:
--ignore-daemonsets— DaemonSet pods cannot be evicted (the DaemonSet controller will re-create them); this flag skips them safely.--delete-emptydir-data— evicts pods withemptyDirvolumes. Required on most real clusters. Data in emptyDir is lost on eviction — make sure the workload is designed for it.--pod-selector— drain only pods matching a label, leaving others in place (useful for incremental maintenance).--timeout— how long to wait for pods to terminate. Default is infinite; set a realistic value (e.g.,300s) to avoid hangs on misbehaving pods.--force— evict pods not managed by a controller (bare pods). Use with caution; those pods are gone permanently.
Node Pools and Blue/Green Node Upgrades
At scale, draining individual nodes is too slow and error-prone. The industry-standard approach is blue/green node pools: provision a new node group on the target Kubernetes version, migrate workloads to it, then terminate the old group. This gives you an instant rollback path — if the new nodes behave badly, just cordon the new group and uncordon the old one.
On EKS the pattern uses managed node groups or self-managed Auto Scaling Groups. The workflow with eksctl or Terraform is:
Taints, Tolerations, and Node Selectors in Maintenance Workflows
During rolling upgrades, you often need to ensure workloads land only on upgraded nodes. Apply a NoSchedule taint to old nodes as a belt-and-suspenders measure alongside cordoning, and use node.kubernetes.io/unschedulable awareness. For specialized node pools (GPU, high-memory, spot instances), taints and tolerations are the permanent mechanism that prevents ordinary workloads from consuming expensive resources.
node.company.io/pool=workers, node.company.io/lifecycle=spot, and node.company.io/generation=v2 on every node group. These labels make it trivial to target drain operations, write PodAffinity rules, and query nodes during incidents — without having to remember instance IDs or IP addresses.
Upgrade Checklist for Production
- Check deprecated APIs with
kubectl convertor kubent before upgrading. Deprecated CRD versions will break after the upgrade. - Back up etcd with
etcdctl snapshot saveor a cloud-provider snapshot. - Validate PDBs: confirm
ALLOWED-DISRUPTIONS > 0for every critical workload. - Upgrade control plane; wait for all components to report
Ready. - Upgrade nodes in batches of 25% or use blue/green pool replacement.
- Monitor error rates and latency on your observability stack throughout.
- Upgrade add-ons last (CoreDNS, CNI, metrics-server) to their versions compatible with the new minor release.
- Run conformance smoke tests or your integration test suite to confirm cluster health.