Kubernetes Workloads & Configuration

DaemonSets & Node-Level Workloads

18 min Lesson 6 of 32

DaemonSets & Node-Level Workloads

Most Kubernetes workloads are fungible: the scheduler picks whichever nodes have spare capacity, and you do not care which node a given replica lands on. But some workloads must run on every node, exactly once — a log forwarder that ships container logs to a central aggregator, a monitoring agent that exposes per-node CPU/memory/disk metrics, a network plugin (CNI) that wires up Pod networking, or a security scanner that watches every container on the host. These are node-level infrastructure agents, and Kubernetes provides the DaemonSet controller to manage them.

What Is a DaemonSet?

A DaemonSet ensures that one Pod runs on every (or a selected subset of) nodes in the cluster. When a node joins the cluster, the DaemonSet controller automatically schedules a Pod onto it. When a node is drained or deleted, the Pod is garbage-collected. You never set a replicas field on a DaemonSet — the replica count is determined entirely by the number of nodes that match the selector.

Key distinction from Deployments: A Deployment schedules N replicas wherever capacity exists. A DaemonSet schedules exactly 1 Pod per matching node, driven by cluster topology — not resource availability.

Canonical Use Cases at Big-Tech Scale

  • Log shipping: Fluentd, Fluent Bit, or Vector reading /var/log/containers/*.log from the host filesystem and forwarding to Elasticsearch, Loki, or Splunk.
  • Node monitoring: Prometheus node_exporter exposing CPU, memory, disk, and network metrics for each node; Datadog Agent, New Relic Infrastructure.
  • Network plugins (CNI): Calico, Cilium, Weave — these are DaemonSets that configure iptables or eBPF rules on every node so Pods can communicate across the cluster.
  • Storage plugins (CSI node drivers): Agents that attach and mount volumes on the local node.
  • Security agents: Falco, Sysdig, or Aqua runtime security scanning every syscall on every node.

Writing a Real DaemonSet Manifest

Below is a production-grade Fluent Bit DaemonSet that ships container logs to an Elasticsearch cluster. Key details: it mounts /var/log and /var/lib/docker/containers from the host (read-only), runs as a privileged container so it can read kernel-level log metadata, and sets conservative resource limits so it cannot starve application Pods.

# fluent-bit-daemonset.yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: fluent-bit namespace: logging labels: app: fluent-bit spec: selector: matchLabels: app: fluent-bit updateStrategy: type: RollingUpdate # roll one node at a time; default for DaemonSets rollingUpdate: maxUnavailable: 1 template: metadata: labels: app: fluent-bit spec: serviceAccountName: fluent-bit tolerations: # Run on master/control-plane nodes too (they have this taint by default) - key: node-role.kubernetes.io/control-plane operator: Exists effect: NoSchedule # Run on nodes that are not-ready or unreachable — critical for log capture - key: node.kubernetes.io/not-ready operator: Exists effect: NoExecute tolerationSeconds: 30 - key: node.kubernetes.io/unreachable operator: Exists effect: NoExecute tolerationSeconds: 30 containers: - name: fluent-bit image: fluent/fluent-bit:3.1 resources: requests: cpu: 50m memory: 64Mi limits: cpu: 200m memory: 256Mi volumeMounts: - name: varlog mountPath: /var/log readOnly: true - name: dockercontainers mountPath: /var/lib/docker/containers readOnly: true - name: config mountPath: /fluent-bit/etc terminationGracePeriodSeconds: 30 volumes: - name: varlog hostPath: path: /var/log - name: dockercontainers hostPath: path: /var/lib/docker/containers - name: config configMap: name: fluent-bit-config
Always set resource requests on DaemonSet Pods. The scheduler uses requests for bin-packing. If you omit them, a single misbehaving log forwarder can silently evict application Pods by consuming all node memory — and you will only discover this during an incident.

Tolerations: Scheduling Onto Tainted Nodes

Kubernetes uses taints on nodes to repel Pods. A taint says "do not schedule here unless you explicitly tolerate this." Control-plane nodes carry node-role.kubernetes.io/control-plane:NoSchedule by default; GPU nodes often carry nvidia.com/gpu=present:NoSchedule; nodes being drained carry node.kubernetes.io/unschedulable:NoSchedule.

Infrastructure DaemonSets almost always need to run on every node — including tainted ones — so they must declare tolerations that match those taints. A toleration has three fields: key, operator (Equal or Exists), and effect (NoSchedule, PreferNoSchedule, or NoExecute). Using operator: Exists without a value matches any taint with that key regardless of value — useful for blanket toleration of all infrastructure taints.

DaemonSet scheduling one Pod per node, with taints and tolerations DaemonSet Controller: One Pod Per Node DaemonSet Controller watches node list Node 1 (worker) no taint fluent-bit Pod Scheduled ✓ app-pod app-pod Node 2 (worker) no taint fluent-bit Pod Scheduled ✓ app-pod Node 3 (control-plane) taint: NoSchedule fluent-bit Pod Tolerated ✓ app-pod ✗ (blocked) DaemonSet Pod App Pod Blocked by taint Tainted node (tolerated by DaemonSet)
DaemonSet schedules one fluent-bit Pod per node automatically, including tainted control-plane nodes that block ordinary app Pods.

Targeting a Node Subset with nodeSelector and nodeAffinity

Sometimes you want a DaemonSet to run only on nodes with specific hardware or roles — GPU nodes for a CUDA metrics exporter, or SSD-backed nodes for a high-throughput log forwarder. Use nodeSelector (simple label match) or nodeAffinity (richer expressions) in the Pod template:

# Target only nodes labeled with hardware=gpu spec: template: spec: nodeSelector: hardware: gpu # OR — preferred for complex rules: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: hardware operator: In values: [gpu, gpu-ampere] # Label a node for targeting: kubectl label node ip-10-0-1-42.ec2.internal hardware=gpu kubectl get nodes --show-labels | grep hardware

Operational Commands

# List all DaemonSet Pods and which nodes they are on kubectl get pods -n logging -o wide -l app=fluent-bit # Check DaemonSet rollout status kubectl rollout status daemonset/fluent-bit -n logging # See how many nodes are scheduled vs desired kubectl get daemonset fluent-bit -n logging # Output columns: DESIRED | CURRENT | READY | UP-TO-DATE | AVAILABLE | NODE SELECTOR | AGE # Trigger a rolling restart (e.g. after a ConfigMap change) kubectl rollout restart daemonset/fluent-bit -n logging # Watch the rolling restart progress node by node kubectl get pods -n logging -l app=fluent-bit -w # Describe a specific DaemonSet Pod to debug scheduling failures kubectl describe pod fluent-bit-xk92p -n logging # Look for: "Events" section — FailedScheduling means taint/toleration mismatch # Force delete a stuck DaemonSet Pod (it will be recreated immediately) kubectl delete pod fluent-bit-xk92p -n logging --grace-period=0 --force

Production Failure Modes

The most common DaemonSet incident at scale: a new node joins the cluster but the DaemonSet Pod stays in Pending. Root cause is almost always a missing toleration. The new node has a custom taint (e.g. a cloud provider spot-instance taint like kubernetes.azure.com/scalesetpriority=spot:NoSchedule) that the DaemonSet manifest does not tolerate. Always audit the taints on every node class in your cluster and ensure your infrastructure DaemonSets tolerate all of them.

A second common failure: a DaemonSet log agent consumes unbounded memory during a log burst, triggers OOMKill, restarts, and enters a CrashLoopBackOff on every node simultaneously — breaking observability right when you need it most. Always set memory limits and configure the agent\'s internal buffer and backpressure settings so it degrades gracefully under load instead of crashing.

Do NOT use DaemonSets for application workloads. Running your API server as a DaemonSet so "it gets one replica per node" is a common anti-pattern. It couples your application deployment to cluster topology, makes horizontal scaling impossible, and wastes resources on small nodes. Use Deployments with topology spread constraints instead.

Update Strategy Considerations

DaemonSets support two update strategies. RollingUpdate (default since Kubernetes 1.6) replaces Pods one node at a time, respecting maxUnavailable — set this to 1 in production so you never lose log coverage on more than one node simultaneously. OnDelete only replaces a Pod when you manually delete it — useful for critical CNI plugins where an in-place restart would break Pod networking on that node and you prefer to drain the node first.

Always test DaemonSet updates on a staging cluster with an identical node configuration. A bad Fluent Bit config that crashes the agent will propagate to every node in the cluster within minutes of a rolling update — there is no concept of a "canary DaemonSet Pod" out of the box.