Cluster Architecture
Cluster Architecture
A Kubernetes cluster is split into two planes that serve completely different roles: the control plane, which is the brain, and the data plane (worker nodes), which is the muscle. Every production outage, every performance problem, and every scaling decision ultimately traces back to one of these components. Understanding what each piece does — and why it was designed that way — is what separates engineers who operate Kubernetes from engineers who merely use it.
The Control Plane
The control plane is a set of processes that implement the Kubernetes API and maintain the desired state of the cluster. In managed offerings (GKE, EKS, AKS) you never see these VMs — the cloud vendor runs them for you and gives you an SLA. In self-managed clusters (kubeadm, Talos, k3s) you own every one of them.
API Server (kube-apiserver)
The API server is the only component that reads from and writes to etcd. Everything else — kubelets, controllers, your kubectl — talks to the API server over HTTPS. It validates every incoming request against the schema, runs admission webhooks (OPA/Gatekeeper, LimitRanger, ResourceQuota), and then persists the object.
This single-entry-point design means the API server is your cluster's most critical component. In high-availability control planes you run three or five replicas behind a load balancer. At Google scale, the API server is horizontally partitioned by resource type, but you will not see that configuration until you are operating clusters with tens of thousands of nodes.
etcd
etcd is a strongly consistent, distributed key-value store based on the Raft consensus algorithm. It holds the entire cluster state: every Pod spec, every Service definition, every Secret, every ConfigMap. A quorum of etcd members must agree before any write is acknowledged, which is why you always run an odd number of members (3 or 5) in production — a cluster of 3 can tolerate 1 failure; a cluster of 5 can tolerate 2.
etcd is not a database for your application data. It is sized for metadata, not large blobs. Values over 1.5 MB (the default gRPC message limit) will be rejected. Operators routinely compact and defragment etcd on a schedule to keep its database file from growing without bound.
fsync after every write. On a cloud VM with a slow disk, etcd leader elections can cascade into API server timeouts and cascading failures across the cluster. Always run etcd on a volume with low write latency (AWS io2, GCP pd-ssd, or a dedicated NVMe local disk). Monitor the etcd_disk_wal_fsync_duration_seconds Prometheus metric — p99 above 10 ms is a red flag.
Scheduler (kube-scheduler)
The scheduler watches the API server for Pods that have no nodeName assigned (i.e., Pods in the Pending state) and selects the best node for each one. The selection is a two-phase process:
- Filtering — eliminates nodes that cannot satisfy the Pod's hard constraints: CPU/memory requests, node selector, taints, affinity/anti-affinity rules, topology spread.
- Scoring — ranks the remaining nodes using a weighted set of functions (LeastAllocated, BalancedResourceAllocation, ImageLocality, etc.) and binds the Pod to the highest-scoring node.
The scheduler writes the node name back to the API server as a Binding object; it never directly communicates with the kubelet.
Controller Manager (kube-controller-manager)
This is a single binary that runs dozens of control loops (controllers) in goroutines. Each controller watches one or more resource types and reconciles actual state toward desired state:
- ReplicaSet controller — ensures the right number of Pod replicas exist.
- Deployment controller — manages rolling updates and rollbacks via ReplicaSets.
- Node controller — marks nodes as
NotReadywhen their heartbeats stop and evicts Pods after a configurable grace period (--node-monitor-grace-period, default 40 s). - Job controller, CronJob controller, Namespace controller, ServiceAccount controller — and many more.
Cloud Controller Manager
Cloud-specific logic was extracted from kube-controller-manager into its own binary so cloud providers can ship their own implementation independently of the Kubernetes release cycle. It handles: provisioning cloud load balancers for Service type: LoadBalancer, attaching cloud volumes for PersistentVolumeClaims, and syncing node metadata (instance type, zone, region labels).
Worker Nodes
Every worker node runs three components that together form the execution environment for Pods.
kubelet
The kubelet is a long-running agent that registers the node with the API server and reconciles the set of Pods that the scheduler has assigned to it. For each assigned Pod, the kubelet instructs the container runtime to pull images and start containers, then monitors health via liveness and readiness probes. It reports node capacity, allocatable resources, and Pod status back to the API server roughly every 10 seconds (configurable via --node-status-update-frequency).
The kubelet is also the only component that talks directly to the container runtime — via the Container Runtime Interface (CRI), a gRPC API. This abstraction lets you swap out containerd for another CRI-compliant runtime (CRI-O, kata-containers) without changing anything else.
journalctl -u kubelet -f. Image pull failures, runtime errors, and cgroup issues surface here before they propagate to kubectl describe pod.
kube-proxy
kube-proxy watches Services and Endpoints and programs the node's networking rules to implement virtual IP routing. In most clusters it uses iptables mode — each Service IP gets a chain of DNAT rules that load-balance across healthy Pod IPs. In high-throughput clusters, eBPF mode (via Cilium or kube-proxy replacement) bypasses iptables entirely for lower latency and better scalability.
Container Runtime
The container runtime is responsible for image management and the actual lifecycle of container processes. The de facto standard is containerd (a CNCF graduated project, extracted from Docker). It pulls OCI-compliant images from registries, manages the on-disk layer cache, and calls the lower-level runc (or a sandboxed runtime like gVisor) to create isolated namespaced processes.
How the Components Talk at Startup
When you submit kubectl apply -f deployment.yaml, here is the exact sequence:
- kubectl serializes your manifest, sends a
POST /apis/apps/v1/namespaces/.../deploymentsto the API server. - API server authenticates (mTLS/OIDC), authorizes (RBAC), runs admission webhooks, validates the schema, and writes the Deployment object to etcd.
- Deployment controller (inside controller-manager) watches for new Deployments, creates a ReplicaSet object, writes it to the API server.
- ReplicaSet controller creates the required number of Pod objects (with no
nodeName) — writes them to the API server. - Scheduler detects the unbound Pods, runs filtering and scoring, writes the node binding.
- kubelet on the chosen node detects the Pod is assigned to it, instructs containerd to pull the image and start containers.
- kubelet reports Pod status back; the API server updates the object in etcd.
kubectl get podsnow shows Running.
Every step is an asynchronous watch loop — there are no direct RPC calls between components except through the API server. This is the architecture that lets Kubernetes self-heal: any component can crash and restart, and the reconciliation loops pick up exactly where they left off.