The Kubernetes scheduler is one of the most sophisticated pieces of distributed systems software ever built. By default it places Pods on any node with sufficient CPU and memory — and for many workloads, that is exactly right. But production clusters at scale demand more nuance: GPU workloads must land on GPU nodes, frontend Pods should be spread across availability zones, a noisy data-pipeline job should not share a node with a latency-sensitive API, and a tainted spot-instance pool should only accept workloads that explicitly tolerate interruptions. This lesson covers the four primitives that give you precise, intentional control over where Pods run.
Node Affinity: Targeting Node Characteristics
Node affinity is the evolution of the older nodeSelector. Where nodeSelector only supports exact label matches, node affinity supports operators (In, NotIn, Exists, DoesNotExist, Gt, Lt) and distinguishes between required rules (hard constraints — the Pod will not schedule if unmet) and preferred rules (soft constraints — the scheduler tries to honor them but places the Pod anyway if it cannot).
# node-affinity.yaml — GPU workload that MUST land on a GPU node,
# and PREFERS nodes labeled zone=us-east-1a
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-inference
namespace: ml
spec:
replicas: 4
selector:
matchLabels:
app: model-inference
template:
metadata:
labels:
app: model-inference
spec:
affinity:
nodeAffinity:
# Hard rule — Pod stays Pending if no matching node exists
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: accelerator
operator: In
values: [nvidia-a100, nvidia-h100]
# Soft rule — prefer zone us-east-1a, weight 80 out of 100
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: [us-east-1a]
- weight: 20
preference:
matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values: [p4d.24xlarge]
containers:
- name: inference
image: company/model-server:v3.2.1
resources:
limits:
nvidia.com/gpu: "1"
# Label a node so it matches the hard rule:
kubectl label node gpu-node-01 accelerator=nvidia-a100
IgnoredDuringExecution: Both rule types carry the suffix IgnoredDuringExecution, meaning node label changes after scheduling do NOT evict running Pods. A future RequiredDuringExecution variant will add that capability. For now, if you remove a label from a node, running Pods that were placed by that rule continue running undisturbed.
Pod Affinity & Anti-Affinity: Co-location and Separation
Pod affinity schedules a Pod near (or away from) other Pods that match a label selector, measured within a topology domain — typically a node, zone, or rack. This is how you implement two critical production patterns: co-location (put my cache sidecar on the same node as the API to minimize latency) and separation (spread my replicas across zones so a single AZ failure does not kill all of them).
# pod-affinity-antiaffinity.yaml
# Scenario: a Redis cache that MUST run on the same node as the API (co-location),
# and API replicas that MUST NOT share a zone (anti-affinity for HA)
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-api
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: payments-api
template:
metadata:
labels:
app: payments-api
spec:
affinity:
podAffinity:
# REQUIRED: schedule near a pod labeled app=payments-cache on the SAME NODE
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: payments-cache
topologyKey: kubernetes.io/hostname # "same node"
podAntiAffinity:
# REQUIRED: no two payments-api pods in the same ZONE
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: payments-api
topologyKey: topology.kubernetes.io/zone # "same AZ"
# SOFT: also prefer different nodes within the same zone
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: payments-api
topologyKey: kubernetes.io/hostname
containers:
- name: api
image: company/payments-api:v8.1.0
Production pitfall — hard pod anti-affinity with too few zones: If you set requiredDuringSchedulingIgnoredDuringExecution pod anti-affinity scoped to topology.kubernetes.io/zone and your cluster has 3 zones but you request 4 replicas, the fourth Pod will be permanently Pending — there is no fourth zone to place it on. This is a common cause of mysterious scale-out failures. Either use soft anti-affinity, or cap replicas to the number of zones you have. Always verify with kubectl get events --field-selector reason=FailedScheduling.
Pod anti-affinity with zone topology key: replicas spread across availability zones, but a 4th replica stays Pending when only 3 zones exist.
Taints and Tolerations: Repelling Pods from Nodes
Taints work in the opposite direction from affinity. A taint is placed on a node and repels all Pods that do not carry a matching toleration. This is how you create dedicated node pools: spot-instance pools, GPU pools, high-memory nodes, or compliance-boundary nodes that should only run specific workloads.
Every taint has a key=value:effect triple. There are three effects:
NoSchedule: New Pods without a matching toleration will not be scheduled on this node. Existing Pods are not evicted.
PreferNoSchedule: The scheduler tries to avoid placing Pods here but will do so if no other option exists. A soft form of NoSchedule.
NoExecute: New Pods are not scheduled AND existing Pods without a toleration are evicted (after an optional tolerationSeconds grace period).
# --- Managing taints on nodes ---
# Add a taint: only spot-tolerant workloads may run here
kubectl taint node spot-node-01 node-role=spot:NoSchedule
# Add a taint: nodes being drained (Kubernetes does this automatically)
kubectl taint node worker-02 node.kubernetes.io/unschedulable:NoSchedule
# Remove a taint (note the trailing dash)
kubectl taint node spot-node-01 node-role=spot:NoSchedule-
# View existing taints
kubectl describe node spot-node-01 | grep -A5 Taints
# --- Pod manifest with tolerations ---
apiVersion: apps/v1
kind: Deployment
metadata:
name: batch-processor
namespace: data
spec:
replicas: 10
selector:
matchLabels:
app: batch-processor
template:
metadata:
labels:
app: batch-processor
spec:
# Tolerate the spot taint — this pod CAN run on spot nodes
tolerations:
- key: node-role
operator: Equal
value: spot
effect: NoSchedule
# Tolerate transient node-not-ready conditions for up to 60s
# before being evicted (overrides the default 300s)
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
tolerationSeconds: 60
# Combined with node affinity to REQUIRE spot nodes (not just tolerate them)
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role
operator: In
values: [spot]
containers:
- name: processor
image: company/batch:v2.0.0
Tolerations do not attract — they permit. A toleration alone does not guarantee placement on the tainted node. The Pod can also land on any untainted node. To force placement on a specific tainted pool, combine the toleration with a nodeAffinity rule or nodeSelector. This combination — taint + toleration + affinity — is the canonical pattern for dedicated node pools used by every major cloud provider's managed Kubernetes offering (EKS, GKE, AKS).
Topology Spread Constraints: Even Distribution
topologySpreadConstraints is the most powerful and composable scheduling primitive added in recent Kubernetes versions (GA in 1.19). It lets you define a maximum allowed skew — the difference in Pod count between the most-loaded and least-loaded topology domain. Where anti-affinity expresses a binary "not together" rule, topology spread expresses a continuous distribution goal.
# topology-spread.yaml
# Spread 12 replicas as evenly as possible across zones AND nodes
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend
namespace: production
spec:
replicas: 12
selector:
matchLabels:
app: frontend
template:
metadata:
labels:
app: frontend
spec:
topologySpreadConstraints:
# Rule 1: Max 2-pod skew across zones (hard)
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule # hard — block if cannot satisfy
labelSelector:
matchLabels:
app: frontend
# Rule 2: Max 1-pod skew across nodes within each zone (soft)
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway # soft — warn but proceed
labelSelector:
matchLabels:
app: frontend
containers:
- name: web
image: company/frontend:v5.3.0
# Inspect scheduling decisions
kubectl get pods -n production -o wide -l app=frontend | awk '{print $7}' | sort | uniq -c
whenUnsatisfiable values:DoNotSchedule (hard — Pod stays Pending if the skew would be violated) and ScheduleAnyway (soft — the scheduler picks the node that minimizes skew even if it violates). In production, zone-level spread should typically be hard (DoNotSchedule) to guarantee HA, while node-level spread can be soft (ScheduleAnyway) to avoid unnecessary Pending.
Putting It Together: A Production-Grade Node Pool Pattern
Real clusters compose all four primitives. A typical big-tech cluster has multiple node pools: a general pool (no taints), a GPU pool (tainted accelerator=true:NoSchedule), a spot pool (tainted node-role=spot:NoSchedule), and a system pool (tainted CriticalAddonsOnly=true:NoSchedule for cluster-critical daemonsets). Each workload declares exactly which pools it tolerates, which it requires via affinity, and how it wants to spread. This separation prevents noisy-neighbour interference, optimizes cost (batch on spot, APIs on on-demand), and ensures critical infrastructure cannot be evicted by user workloads.
# Verify scheduling decisions and diagnose Pending pods
kubectl get events -n production --field-selector reason=FailedScheduling --sort-by='.lastTimestamp'
# Simulate scheduling without actually placing the Pod (dry-run)
kubectl apply -f deployment.yaml --dry-run=server
# Use the scheduler's explain endpoint (alpha feature in newer clusters)
kubectl alpha events --for pod/my-pod-abc123 -n production
# Show node labels (used to write affinity rules)
kubectl get nodes --show-labels
# Cordon a node (adds Unschedulable taint, prevents new Pods)
kubectl cordon worker-03
# Drain a node for maintenance (evicts Pods respecting PodDisruptionBudgets)
kubectl drain worker-03 --ignore-daemonsets --delete-emptydir-data --grace-period=60
# Uncordon after maintenance
kubectl uncordon worker-03
Do not over-constrain your cluster. Every hard scheduling rule (requiredDuringScheduling, NoSchedule taints, DoNotSchedule topology constraints) narrows the set of valid placements. Stacking multiple hard rules without enough nodes to satisfy all of them simultaneously causes Pods to stay Pending indefinitely — often discovered only during a traffic spike or a node failure event. Prefer soft rules wherever possible, and always test your constraint combination by temporarily scaling up the Deployment in a staging cluster and observing which Pods reach Running state.