A runbook is the single most important document an SRE team produces. It is the authoritative, executable guide that tells an on-call engineer exactly what to do at 3 AM when a production alert fires. Unlike architectural decision records or design documents, a runbook must be opinionated, step-by-step, and tested against reality. This lesson walks you through writing a production-grade operations runbook covering the three highest-impact operational categories for any Kubernetes cluster: cluster upgrades, scaling events, and incident response.
What Makes a Runbook Production-Grade
At top-tier companies (Google, Stripe, Cloudflare), runbooks are reviewed quarterly, linked directly from alert definitions, and tested in game days — planned drills where teams execute the runbook against a staging cluster mirroring production traffic. Every runbook section answers four questions: What is the trigger?What is the blast radius?What is the procedure?What is the rollback?
Store runbooks in Git alongside your cluster manifests and enforce review as part of any infrastructure pull request that changes operational procedures. Version them with the cluster configuration they describe.
A runbook is not a tutorial. Omit background theory. Write only the exact, ordered commands an on-call engineer needs to execute under pressure with no time to think.
Section 1 — Cluster Upgrade Procedure
Kubernetes releases a minor version roughly every four months. Each minor version is supported for approximately fourteen months, meaning a cluster running three or more releases behind is out of support — no CVE patches, no backports. The upgrade procedure below targets a managed cluster (EKS) with self-managed node groups, the most common production topology.
Pre-upgrade checklist: Confirm the target version is available in your cloud provider, check the Kubernetes changelog for API deprecations using pluto detect-files -d ./k8s, verify all add-ons (CoreDNS, kube-proxy, CNI, metrics-server) have compatible versions, and ensure PodDisruptionBudgets are in place for all critical workloads before touching any nodes.
#!/usr/bin/env bash
# RUNBOOK: Cluster Upgrade — EKS
# Trigger: Target version available; current version within 2 releases of end-of-support
# Blast radius: API server restarts briefly; node drain causes pod rescheduling
# Rollback: Control-plane rollback NOT supported on EKS; rollback = restore prior node-group launch template
CLUSTER="prod-us-east-1"
REGION="us-east-1"
TARGET_VERSION="1.30"
# Step 1: Detect deprecated API usage BEFORE upgrading
pluto detect-files -d ./k8s --target-versions k8s=v${TARGET_VERSION}
# Step 2: Upgrade the control plane (10-20 min; API server rolls one replica at a time)
aws eks update-cluster-version \
--name "${CLUSTER}" \
--kubernetes-version "${TARGET_VERSION}" \
--region "${REGION}"
aws eks wait cluster-active --name "${CLUSTER}" --region "${REGION}"
# Step 3: Update core add-ons BEFORE touching data-plane nodes
eksctl update addon --cluster "${CLUSTER}" --name vpc-cni --version latest --region "${REGION}"
eksctl update addon --cluster "${CLUSTER}" --name coredns --version latest --region "${REGION}"
eksctl update addon --cluster "${CLUSTER}" --name kube-proxy --version latest --region "${REGION}"
# Step 4: Upgrade node groups one at a time (surge strategy: max-unavailable=0)
for NG in $(eksctl get nodegroup --cluster "${CLUSTER}" --region "${REGION}" \
-o json | jq -r '.[].Name'); do
echo "=== Upgrading node group: ${NG} ==="
eksctl upgrade nodegroup \
--cluster "${CLUSTER}" \
--name "${NG}" \
--kubernetes-version "${TARGET_VERSION}" \
--region "${REGION}"
done
# Step 5: Verify — all nodes on new version, no stuck pods
kubectl get nodes -o wide
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
Always upgrade one minor version at a time. Kubernetes does not support skipping minor versions (e.g., 1.27 → 1.30 directly). Skipping causes undefined API-server behavior and will break your cluster.
Section 2 — Scaling Event Procedure
Scaling in production is not just kubectl scale. It involves understanding whether the bottleneck is CPU, memory, I/O, or external-service latency; verifying that cluster autoscaler (or Karpenter) has room to provision new nodes; and ensuring the application handles mid-flight requests correctly during scale-down. The runbook below covers both reactive scaling (an alert fired, traffic is spiking now) and planned scaling (pre-scaling ahead of a known event such as a marketing campaign).
Decision flow for a reactive scaling event triggered by a high-CPU or high-latency alert.
# RUNBOOK: Reactive Scaling
# Trigger: P99 latency > SLO threshold OR HPA firing max-replicas alert
# Blast radius: Brief pod churn; nodes provisioning takes 60-120 s (Karpenter) or 3-5 min (CA)
# 1. Identify the bottlenecked deployment
kubectl top pods -n production --sort-by=cpu | head -20
kubectl get hpa -n production
# 2. Check if HPA is already at max
kubectl describe hpa <name> -n production
# Look for: "Warning FailedGetResourceMetric" or "unable to fetch metrics"
# 3. If metric pipeline is broken, manually scale as a stop-gap
kubectl scale deployment <name> --replicas=20 -n production
# 4. Check node headroom
kubectl describe nodes | grep -A5 "Allocated resources"
# Or with Karpenter:
kubectl get nodeclaims
# 5. For planned events — pre-scale the night before
kubectl patch hpa <name> -n production \
--type=merge -p '{"spec":{"minReplicas":10,"maxReplicas":50}}'
# 6. After event — restore original values
kubectl patch hpa <name> -n production \
--type=merge -p '{"spec":{"minReplicas":3,"maxReplicas":20}}'
# 7. Scale-down safety: ensure terminationGracePeriodSeconds > longest request
kubectl get deployment <name> -n production -o jsonpath=\
'{.spec.template.spec.terminationGracePeriodSeconds}'
Section 3 — Incident Response Steps
Incident response in Kubernetes follows a universal pattern regardless of the failure mode: Detect → Contain → Diagnose → Remediate → Restore → Review. The runbook below covers the five most common cluster-level incidents: CrashLoopBackOff storms, node NotReady events, OOMKilled workloads, persistent volume issues, and network partition symptoms.
# RUNBOOK: Incident Response — Kubernetes Production Cluster
# =============================================================
# ---- INCIDENT TYPE 1: CrashLoopBackOff Storm ----
# Trigger: Multiple pods in CrashLoopBackOff; alert: pod_restart_count > 10 in 5 min
# Blast radius: Service degraded / unavailable depending on replica count
# Identify affected pods
kubectl get pods -A --field-selector=status.phase=Running | grep -v Running
kubectl get events -n <ns> --sort-by=.lastTimestamp | tail -30
# Pull logs from the PREVIOUS container (the one that crashed)
kubectl logs <pod> -n <ns> --previous
# Common causes checklist:
# - Broken config (ConfigMap / Secret updated with bad value)
# - OOMKilled (check kubectl describe pod — look for "OOMKilled" reason)
# - Bad image (check image tag drift, registry credentials)
# - Liveness probe too aggressive (check thresholds)
# Emergency stop: suspend the deployment to stop the crash loop
kubectl rollout pause deployment/<name> -n <ns>
# Rollback if a recent deploy caused it
kubectl rollout undo deployment/<name> -n <ns>
kubectl rollout status deployment/<name> -n <ns>
# ---- INCIDENT TYPE 2: Node NotReady ----
# Trigger: Node transitions to NotReady; alert: kube_node_status_condition
# Check node status and conditions
kubectl get nodes
kubectl describe node <node-name> # Look for: DiskPressure, MemoryPressure, PIDPressure
# SSH to node (if accessible) or use kubectl debug
kubectl debug node/<node-name> -it --image=ubuntu
# On the node — check kubelet logs
journalctl -u kubelet --since "10 minutes ago" | tail -50
# Cordon immediately to stop new pods scheduling there
kubectl cordon <node-name>
# Drain if hardware issue confirmed (respects PDBs)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --grace-period=60
# For managed node groups — terminate and let ASG replace
aws ec2 terminate-instances --instance-ids <id> --region us-east-1
# ---- INCIDENT TYPE 3: OOMKilled Workloads ----
# Check which container was OOMKilled
kubectl describe pod <pod> -n <ns> | grep -A10 "Last State"
# Increase memory limit as immediate mitigation (then file ticket to right-size properly)
kubectl set resources deployment/<name> -n <ns> \
--limits=memory=1Gi --requests=memory=512Mi
# View VPA recommendation if VPA is installed
kubectl describe vpa <name> -n <ns>
Never skip kubectl drain --grace-period on a node under active load. Skipping grace periods kills in-flight requests instantly, turning a node failure into a user-visible outage.
Postmortem Template
Every P0 or P1 incident must produce a postmortem document within 48 hours. The blameless postmortem is the primary mechanism by which SRE organizations improve reliability. The document must include: incident timeline with UTC timestamps, contributing factors (technical AND process), impact quantification (error rate, users affected, SLO budget burned), root cause, and action items with owners and due dates. Action items must be tracked in the team issue tracker — a postmortem without tracked follow-through is just a report.
Link every alert runbook directly to its postmortem log. When an alert fires again, the on-call engineer can immediately see all prior occurrences, what fixed them, and which action items are still open. PagerDuty, Opsgenie, and Grafana Incident all support runbook URL fields on alert policies.
Keeping the Runbook Alive
A runbook written once and never updated is worse than no runbook — it gives false confidence and leads engineers down wrong paths. Establish three practices: First, runbook reviews in your quarterly infrastructure review — assign ownership, verify commands still work against the current cluster version. Second, game days every six months — deliberately inject a node failure or simulate a traffic spike in a non-production environment and execute the runbook verbatim. Third, runbook updates as a merge-request requirement — any PR that changes cluster configuration, HPA settings, or add-on versions must include a corresponding runbook diff if the operational procedure changes.
This project lesson ties together every concept from the tutorial: RBAC restricts who can execute the remediation commands; admission webhooks enforce the PodDisruptionBudgets that make drains safe; custom operators automate the repetitive parts; autoscaling provides the elasticity; and etcd integrity is the foundation that makes any recovery possible. The runbook is the human-readable layer that orchestrates all of them.