Kubernetes Fundamentals

The Reconciliation Loop

18 min Lesson 8 of 32

The Reconciliation Loop

Every Kubernetes controller — the Deployment controller, the ReplicaSet controller, the Node controller, the StatefulSet controller — is built on the same fundamental idea: a watch loop that continuously compares desired state with observed state and acts to eliminate the difference. This pattern is not incidental to Kubernetes; it is the entire architecture. Understanding it deeply changes how you reason about why the system behaves the way it does, why eventually-consistent thinking matters, and how to design workloads that co-operate with — rather than fight — the control plane.

Desired State vs. Observed State

When you run kubectl apply -f deployment.yaml, you are writing a declaration into etcd — the cluster's distributed key-value store. You are not issuing a command to "go run six Pods right now." You are saying: "I desire six Pods that look like this. Please make it so, and keep it that way forever." The API server validates the object and persists it. From that moment on, nothing you said matters except what is stored in etcd. The cluster will work toward that goal regardless of reboots, network partitions, or failures.

Observed state is what the controllers discover by querying the cluster: how many Pods are actually running, which ones are healthy, which nodes are reachable. The gap between desired and observed is called drift, and eliminating drift is the sole purpose of every controller.

Key mental model: Kubernetes controllers are declarative thermostats. You set the temperature (desired state); the thermostat continuously measures the room (observed state) and fires the heater or AC (reconciliation actions) until the room matches the setting. You never tell the thermostat "turn on the heater for exactly 4 minutes" — you just declare what you want.

The Watch Loop: How Controllers Work

Each controller is a goroutine running in the kube-controller-manager process (or as a separate operator). It performs three steps in a tight loop:

List & Watch — On startup, the controller lists all relevant objects from the API server and then opens a long-lived Watch stream. The API server pushes events (ADDED, MODIFIED, DELETED) whenever an object changes. This is far more efficient than polling — controllers react within milliseconds of a change hitting etcd.
Compute diff — Compare the spec (desired) with the status (observed). For the ReplicaSet controller: desired replicas vs. running healthy Pods matching the selector.
Act — Issue API calls to close the gap: create Pods, delete Pods, update status fields. Then immediately requeue to re-observe. The loop never terminates.

The Kubernetes reconciliation loop: controllers watch the API server, compare desired vs. observed state, and act continuously until they match.

Level-Triggered vs. Edge-Triggered

Most traditional automation is edge-triggered: it reacts to an event ("a server crashed — run the recovery script"). If the event is missed — because your listener was down, or the network dropped the notification — the system is permanently broken until a human intervenes. Kubernetes is deliberately level-triggered: controllers do not care what happened; they only care about the current state. Even if a controller crashed and missed a hundred Pod deletions, when it restarts it lists all Pods, sees the current count is wrong, and reconciles. The system is inherently self-healing without any special retry logic.

Production implication: You can safely restart kube-controller-manager at any time. The controllers will re-list everything and reconcile from scratch. This is why running multiple replicas of the control plane (HA mode) works without split-brain: controllers use leader election via a Lease object, and only the elected leader acts. Followers watch and stay ready to take over.

Inspecting the Control Loop in Real Time

The best way to see the reconciliation loop in action is to deliberately create drift and watch the system correct it.

# Step 1: create a Deployment with 3 replicas
kubectl create deployment demo --image=nginx:1.25 --replicas=3

# Step 2: list the Pods
kubectl get pods -l app=demo
# NAME                    READY   STATUS    RESTARTS   AGE
# demo-7d9f4b6c9d-2xkrp   1/1     Running   0          30s
# demo-7d9f4b6c9d-h7qnl   1/1     Running   0          30s
# demo-7d9f4b6c9d-pj8tm   1/1     Running   0          30s

# Step 3: manually delete one Pod — introduce drift
kubectl delete pod demo-7d9f4b6c9d-2xkrp

# Step 4: immediately list again — the controller already reacted
kubectl get pods -l app=demo
# NAME                    READY   STATUS              RESTARTS   AGE
# demo-7d9f4b6c9d-h7qnl   1/1     Running             0          45s
# demo-7d9f4b6c9d-pj8tm   1/1     Running             0          45s
# demo-7d9f4b6c9d-r9mxq   0/1     ContainerCreating   0          1s   <-- new pod

# Step 5: watch controller events — these are the reconciliation actions
kubectl describe replicaset -l app=demo | grep -A 10 "Events:"
# Events:
#   Normal  SuccessfulCreate  2m    replicaset-controller  Created pod: demo-7d9f4b6c9d-r9mxq

The gap between deleting the Pod and seeing a new one in ContainerCreating is typically under a second in a healthy cluster. This responsiveness comes from the Watch stream — the ReplicaSet controller receives a DELETED event for the Pod almost immediately and enqueues a reconciliation work item. No polling interval delays the reaction.

The Role of status vs. spec

Every Kubernetes object has two top-level sections that encode the dual nature of the system. The spec is the contract you write; the status is what the controller writes back after observing reality. You never write status yourself — it is managed by the control plane. When you check kubectl get deployment, the READY column is reading status.readyReplicas, not spec.replicas.

# Inspect raw spec and status with kubectl get -o yaml
kubectl get deployment demo -o yaml | grep -A 12 "^status:"
# status:
#   availableReplicas: 3
#   conditions:
#   - lastTransitionTime: "2025-03-15T10:22:31Z"
#     message: Deployment has minimum availability.
#     reason: MinimumReplicasAvailable
#     status: "True"
#     type: Available
#   observedGeneration: 2
#   readyReplicas: 3
#   replicas: 3
#   updatedReplicas: 3

# observedGeneration vs metadata.generation tells you if the controller has processed the latest spec
# If metadata.generation=3 and status.observedGeneration=2, the controller is still reconciling
kubectl get deployment demo \
  -o jsonpath='{.metadata.generation} {.status.observedGeneration}{"\n"}'
# 2 2   <-- fully reconciled; "3 2" means a rollout is still in progress

Production Failure Modes Caused by Misunderstanding the Loop

Most Kubernetes production incidents stem from engineers thinking imperatively when the system is declarative. Here are the most common traps at scale:

Manually patching Pods expecting the change to persist. If you run kubectl exec into a Pod and install a package, or use kubectl edit pod to change an environment variable, the change disappears the moment Kubernetes replaces that Pod (on node drain, eviction, or crash). The ground truth is the spec in the Deployment — always update the manifest, not the live Pod.
Fighting the controller with scripts. Some teams write scripts that scale down a Deployment temporarily ("set replicas to 0 for maintenance"), then immediately the controller detects the scale-down and writes it to status — fine. But if the script does not update the spec and something else (an HPA, an operator) resets it, the controller obediently scales back up. The only safe way to prevent the controller from acting is to update the spec, or pause the controller via kubectl rollout pause.
Ignoring controller backoff. When the reconciliation loop fails repeatedly (image pull errors, quota exceeded, scheduler unable to place Pods), Kubernetes applies exponential backoff — up to 5 minutes between retries by default. The cluster is not stuck; it is backing off. Check kubectl describe deployment events to diagnose, not kubectl get pods alone.

Production pitfall — CrashLoopBackOff misread: CrashLoopBackOff is not a Kubernetes error — it is the reconciliation loop working correctly. The container exits immediately (your bug), Kubernetes recreates it (desired state says it should run), sees it crash again, and backs off. The fix is always in your application, not in Kubernetes settings. Never set restartPolicy: Never on a Deployment to suppress CrashLoopBackOff — that just causes the controller to create a new Pod instead of restarting the same one, which is worse. Fix the root cause.

Custom Controllers and the Operator Pattern

The reconciliation loop is not limited to built-in Kubernetes resources. The Operator pattern extends the control plane by writing custom controllers for your own CustomResourceDefinitions (CRDs). An operator for a database (like the Postgres Operator or Cassandra Operator) watches for objects of kind PostgresCluster, then reconciles by creating StatefulSets, Services, ConfigMaps, and Secrets to match the declared spec. The same level-triggered, self-healing guarantee applies. At big-tech companies, the majority of production infrastructure — databases, queues, caches, ML training jobs — is managed by operators running this exact pattern.

# See all controllers currently registered and their leader-election status
kubectl -n kube-system get lease
# NAME                               HOLDER                                                                    AGE
# kube-controller-manager            controlplane-node-1_abc123...                                             10d
# kube-scheduler                     controlplane-node-1_abc123...                                             10d

# Watch the controller manager logs to see reconciliation in action
kubectl -n kube-system logs -l component=kube-controller-manager --tail=50 -f

# Check for conditions on a Deployment that indicate reconciliation problems
kubectl get deployment demo \
  -o jsonpath='{range .status.conditions[*]}{.type}{"\t"}{.status}{"\t"}{.message}{"\n"}{end}'
# Available      True    Deployment has minimum availability.
# Progressing    True    ReplicaSet "demo-abc123" has successfully progressed.