Advanced Kubernetes Operations

Troubleshooting Clusters

18 min Lesson 8 of 30

Troubleshooting Clusters

Kubernetes abstracts away a great deal of infrastructure complexity, but when something breaks the abstraction itself becomes an obstacle. A pod that never schedules, a node that flips NotReady, a service that drops every third request — each failure sits somewhere inside a layered system of controllers, kubelets, network plugins, and kernel primitives. Senior SREs at big-tech companies do not guess; they follow a disciplined triage tree that narrows the blast radius systematically, minimising mean time to resolution (MTTR) under the pressure of an active incident.

This lesson walks through that tree: node problems, scheduling failures, and networking breaks — the three categories that account for the vast majority of production Kubernetes incidents.

The Triage Mindset: Scope, Layer, Signal

Before running a single command, orient yourself on three axes:

Scope: Is the impact cluster-wide, zone-specific, or isolated to one namespace/workload? Cluster-wide points to the control plane or a shared network component. Namespace-isolated points to RBAC, quotas, or a bad admission webhook.
Layer: Kubernetes sits on top of the OS, the container runtime, the CNI plugin, and the cloud provider's infrastructure. Failures propagate upward, so always consider the layer below the symptom first.
Signal: Kubernetes emits rich signal — Events, component logs, kubectl describe, metrics, and audit logs. The fastest path to root cause is usually the most recent Warning event in the relevant object's event stream.

Kubernetes triage tree: scope the impact first, then dive into node, scheduler, or network signals.

Node Issues

A node in NotReady state stops receiving new pod scheduling and begins evicting existing pods after the configured node.kubernetes.io/not-ready toleration period (default 5 minutes). Time matters — start here.

# 1. Spot NotReady nodes immediately
kubectl get nodes -o wide

# 2. Inspect the node's conditions and recent events
kubectl describe node <node-name>
# Look for: MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable

# 3. SSH to the node and check kubelet health
systemctl status kubelet
journalctl -u kubelet --since "10 minutes ago" --no-pager | tail -80

# 4. Check container runtime (containerd is the default)
systemctl status containerd
crictl info            # runtime health
crictl ps -a           # all containers (including crashed ones)

# 5. Resource exhaustion checks
df -h                  # disk — inode exhaustion is a silent killer
free -m                # memory
cat /proc/sys/kernel/pid_max   # PID limits

Inode exhaustion: A node can run out of inodes long before disk capacity is full. When this happens, the kubelet cannot write the pod's emptyDir or log files and the node flips to DiskPressure. df -i shows inode usage — this check is almost always missing from junior runbooks.

Common root causes for NotReady: kubelet certificate rotation failure (check /var/lib/kubelet/pki/), containerd socket permissions, OOM-killed kubelet process, or a cloud provider NTP drift causing token validation failures.

Scheduling Failures

A pod stuck in Pending has been accepted by the API server but the scheduler cannot place it. The scheduler's decision is recorded in the pod's Event stream — this is always the first place to look.

# Most useful single command for a pending pod
kubectl describe pod <pod-name> -n <ns>
# Read the Events section at the bottom carefully

# Cluster-wide recent events (sorts by timestamp)
kubectl get events --sort-by='.lastTimestamp' -A | grep -i warning

# Inspect scheduler decisions for a specific pod
kubectl get events -n <ns> --field-selector involvedObject.name=<pod-name>

# Check resource quotas and current usage in a namespace
kubectl describe resourcequota -n <ns>
kubectl describe limitrange -n <ns>

# Find nodes that can actually satisfy the pod's requirements
kubectl get nodes -o json | \
  kubectl-neat | jq '.items[] | {name:.metadata.name, allocatable:.status.allocatable}'

# Taint inspection — why is a pod not tolerated?
kubectl describe node <node> | grep -A5 Taints

The most frequent scheduling failure messages and their causes:

0/5 nodes are available: insufficient memory — requests exceed allocatable capacity. Either right-size requests or add nodes.
0/5 nodes are available: node(s) had untolerated taint — the pod is missing a tolerations entry for a taint such as dedicated=gpu:NoSchedule.
0/5 nodes are available: node(s) didn't match Pod's node affinity/selector — the nodeSelector or requiredDuringSchedulingIgnoredDuringExecution rules don't match any node's labels.
0/5 nodes are available: pod has unbound immediate PersistentVolumeClaims — the StorageClass or zone topology constraints prevent PVC binding.

Use kubectl describe over kubectl get -o yaml for events. The YAML representation does not preserve event ordering or include the human-readable reason strings that the scheduler populates. describe aggregates events and conditions into a single readable output, which is far faster to scan during an incident.

Networking Breaks

Network failures in Kubernetes are the most complex category because they span multiple layers: the CNI plugin (Calico, Cilium, AWS VPC CNI), kube-proxy (or eBPF-based replacements), CoreDNS, and cloud provider routing tables. A pod that runs but cannot be reached — or cannot reach a dependency — is usually a networking issue.

# Step 1: verify DNS resolution from inside a pod
kubectl run netdebug --image=nicolaka/netshoot --rm -it --restart=Never -- \
  nslookup kubernetes.default.svc.cluster.local

# Step 2: test direct ClusterIP connectivity
kubectl run netdebug --image=nicolaka/netshoot --rm -it --restart=Never -- \
  curl -sv http://<ClusterIP>:<port>/healthz

# Step 3: verify kube-proxy iptables rules exist for the Service
# (run on a node with kubectl or SSH)
iptables-save | grep <service-name>
# Should show DNAT rules for each Endpoint

# Step 4: check Endpoints backing the Service
kubectl get endpoints <service-name> -n <ns>
# Empty Endpoints = label selector mismatch between Service and Pods

# Step 5: inspect CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# Step 6: CNI plugin health (example: Calico)
kubectl get pods -n calico-system
calicoctl node status   # BGP peer states on each node

Empty Endpoints is the most common "service not reachable" root cause. It means the Service's selector does not match the labels on any running pod. This happens after label typos, namespace mismatches, or a rollout that used a different label schema. Always verify: kubectl get endpoints <svc> before diving into iptables or CNI diagnostics.

For intermittent failures — where some requests succeed and some fail — the problem is usually one unhealthy pod behind a Service that has multiple replicas. Check each pod's readiness and recent restarts:

# Show restart counts and readiness across a deployment
kubectl get pods -n <ns> -l app=<label> -o wide

# Tail logs from all replicas simultaneously
kubectl logs -n <ns> -l app=<label> --prefix --tail=100

# Describe a crashing pod (OOMKilled, CrashLoopBackOff)
kubectl describe pod <pod> -n <ns>
# Previous container logs (after a crash)
kubectl logs <pod> -n <ns> --previous

Control Plane and etcd Signals

If API server latency is high, resources take tens of seconds to appear after kubectl apply, or controllers are falling behind, the problem is above the node layer. Check the control plane components directly:

kubectl get --raw /healthz — API server liveness.
kubectl get --raw /readyz — API server readiness (includes etcd check).
On managed clusters (EKS, GKE, AKS), control plane metrics are surfaced in the cloud console; on self-managed clusters, use etcdctl endpoint health --cluster to check quorum.
API server audit logs capture every mutation — invaluable for "who deleted that resource" post-mortems.

Production habit: keep a netshoot or busybox debug pod template in your runbook. Ephemeral debug containers (kubectl debug -it <pod> --image=nicolaka/netshoot --target=<container>) let you attach a full network toolset to a running pod without modifying the deployment — essential when you need to debug inside the same network namespace as the broken container.

Systematic Resolution Checklist

Determine scope: one pod, one node, one namespace, or cluster-wide.
Read the most recent Warning events (kubectl get events -A --sort-by='.lastTimestamp').
For node problems: kubelet logs → runtime health → disk/memory/inode pressure → network routes.
For scheduling problems: describe pod Events → resource quotas → taints/tolerations → affinity rules → PVC topology.
For networking problems: Endpoints selector match → CoreDNS → kube-proxy rules → CNI plugin → cloud routing.
Mitigate first (reschedule, add capacity, restart component) to restore service, then root-cause.
Write an incident document: timeline, hypothesis tested, fix applied, follow-up action items. This is non-negotiable at big-tech scale because the same class of failure will recur.