Troubleshooting Clusters
Troubleshooting Clusters
Kubernetes abstracts away a great deal of infrastructure complexity, but when something breaks the abstraction itself becomes an obstacle. A pod that never schedules, a node that flips NotReady, a service that drops every third request — each failure sits somewhere inside a layered system of controllers, kubelets, network plugins, and kernel primitives. Senior SREs at big-tech companies do not guess; they follow a disciplined triage tree that narrows the blast radius systematically, minimising mean time to resolution (MTTR) under the pressure of an active incident.
This lesson walks through that tree: node problems, scheduling failures, and networking breaks — the three categories that account for the vast majority of production Kubernetes incidents.
The Triage Mindset: Scope, Layer, Signal
Before running a single command, orient yourself on three axes:
- Scope: Is the impact cluster-wide, zone-specific, or isolated to one namespace/workload? Cluster-wide points to the control plane or a shared network component. Namespace-isolated points to RBAC, quotas, or a bad admission webhook.
- Layer: Kubernetes sits on top of the OS, the container runtime, the CNI plugin, and the cloud provider's infrastructure. Failures propagate upward, so always consider the layer below the symptom first.
- Signal: Kubernetes emits rich signal — Events, component logs,
kubectl describe, metrics, and audit logs. The fastest path to root cause is usually the most recent Warning event in the relevant object's event stream.
Node Issues
A node in NotReady state stops receiving new pod scheduling and begins evicting existing pods after the configured node.kubernetes.io/not-ready toleration period (default 5 minutes). Time matters — start here.
DiskPressure. df -i shows inode usage — this check is almost always missing from junior runbooks.
Common root causes for NotReady: kubelet certificate rotation failure (check /var/lib/kubelet/pki/), containerd socket permissions, OOM-killed kubelet process, or a cloud provider NTP drift causing token validation failures.
Scheduling Failures
A pod stuck in Pending has been accepted by the API server but the scheduler cannot place it. The scheduler's decision is recorded in the pod's Event stream — this is always the first place to look.
The most frequent scheduling failure messages and their causes:
- 0/5 nodes are available: insufficient memory — requests exceed allocatable capacity. Either right-size requests or add nodes.
- 0/5 nodes are available: node(s) had untolerated taint — the pod is missing a
tolerationsentry for a taint such asdedicated=gpu:NoSchedule. - 0/5 nodes are available: node(s) didn't match Pod's node affinity/selector — the
nodeSelectororrequiredDuringSchedulingIgnoredDuringExecutionrules don't match any node's labels. - 0/5 nodes are available: pod has unbound immediate PersistentVolumeClaims — the StorageClass or zone topology constraints prevent PVC binding.
kubectl describe over kubectl get -o yaml for events. The YAML representation does not preserve event ordering or include the human-readable reason strings that the scheduler populates. describe aggregates events and conditions into a single readable output, which is far faster to scan during an incident.
Networking Breaks
Network failures in Kubernetes are the most complex category because they span multiple layers: the CNI plugin (Calico, Cilium, AWS VPC CNI), kube-proxy (or eBPF-based replacements), CoreDNS, and cloud provider routing tables. A pod that runs but cannot be reached — or cannot reach a dependency — is usually a networking issue.
selector does not match the labels on any running pod. This happens after label typos, namespace mismatches, or a rollout that used a different label schema. Always verify: kubectl get endpoints <svc> before diving into iptables or CNI diagnostics.
For intermittent failures — where some requests succeed and some fail — the problem is usually one unhealthy pod behind a Service that has multiple replicas. Check each pod's readiness and recent restarts:
Control Plane and etcd Signals
If API server latency is high, resources take tens of seconds to appear after kubectl apply, or controllers are falling behind, the problem is above the node layer. Check the control plane components directly:
kubectl get --raw /healthz— API server liveness.kubectl get --raw /readyz— API server readiness (includes etcd check).- On managed clusters (EKS, GKE, AKS), control plane metrics are surfaced in the cloud console; on self-managed clusters, use
etcdctl endpoint health --clusterto check quorum. - API server audit logs capture every mutation — invaluable for "who deleted that resource" post-mortems.
netshoot or busybox debug pod template in your runbook. Ephemeral debug containers (kubectl debug -it <pod> --image=nicolaka/netshoot --target=<container>) let you attach a full network toolset to a running pod without modifying the deployment — essential when you need to debug inside the same network namespace as the broken container.
Systematic Resolution Checklist
- Determine scope: one pod, one node, one namespace, or cluster-wide.
- Read the most recent Warning events (
kubectl get events -A --sort-by='.lastTimestamp'). - For node problems: kubelet logs → runtime health → disk/memory/inode pressure → network routes.
- For scheduling problems:
describe podEvents → resource quotas → taints/tolerations → affinity rules → PVC topology. - For networking problems: Endpoints selector match → CoreDNS → kube-proxy rules → CNI plugin → cloud routing.
- Mitigate first (reschedule, add capacity, restart component) to restore service, then root-cause.
- Write an incident document: timeline, hypothesis tested, fix applied, follow-up action items. This is non-negotiable at big-tech scale because the same class of failure will recur.