Kubernetes Workloads & Configuration

DaemonSets & Node-Level Workloads

18 min Lesson 6 of 32

DaemonSets & Node-Level Workloads

Most Kubernetes workloads are fungible: the scheduler picks whichever nodes have spare capacity, and you do not care which node a given replica lands on. But some workloads must run on every node, exactly once — a log forwarder that ships container logs to a central aggregator, a monitoring agent that exposes per-node CPU/memory/disk metrics, a network plugin (CNI) that wires up Pod networking, or a security scanner that watches every container on the host. These are node-level infrastructure agents, and Kubernetes provides the DaemonSet controller to manage them.

What Is a DaemonSet?

A DaemonSet ensures that one Pod runs on every (or a selected subset of) nodes in the cluster. When a node joins the cluster, the DaemonSet controller automatically schedules a Pod onto it. When a node is drained or deleted, the Pod is garbage-collected. You never set a replicas field on a DaemonSet — the replica count is determined entirely by the number of nodes that match the selector.

Key distinction from Deployments: A Deployment schedules N replicas wherever capacity exists. A DaemonSet schedules exactly 1 Pod per matching node, driven by cluster topology — not resource availability.

Canonical Use Cases at Big-Tech Scale

Log shipping: Fluentd, Fluent Bit, or Vector reading /var/log/containers/*.log from the host filesystem and forwarding to Elasticsearch, Loki, or Splunk.
Node monitoring: Prometheus node_exporter exposing CPU, memory, disk, and network metrics for each node; Datadog Agent, New Relic Infrastructure.
Network plugins (CNI): Calico, Cilium, Weave — these are DaemonSets that configure iptables or eBPF rules on every node so Pods can communicate across the cluster.
Storage plugins (CSI node drivers): Agents that attach and mount volumes on the local node.
Security agents: Falco, Sysdig, or Aqua runtime security scanning every syscall on every node.

Writing a Real DaemonSet Manifest

Below is a production-grade Fluent Bit DaemonSet that ships container logs to an Elasticsearch cluster. Key details: it mounts /var/log and /var/lib/docker/containers from the host (read-only), runs as a privileged container so it can read kernel-level log metadata, and sets conservative resource limits so it cannot starve application Pods.

# fluent-bit-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: logging
  labels:
    app: fluent-bit
spec:
  selector:
    matchLabels:
      app: fluent-bit
  updateStrategy:
    type: RollingUpdate          # roll one node at a time; default for DaemonSets
    rollingUpdate:
      maxUnavailable: 1
  template:
    metadata:
      labels:
        app: fluent-bit
    spec:
      serviceAccountName: fluent-bit
      tolerations:
        # Run on master/control-plane nodes too (they have this taint by default)
        - key: node-role.kubernetes.io/control-plane
          operator: Exists
          effect: NoSchedule
        # Run on nodes that are not-ready or unreachable — critical for log capture
        - key: node.kubernetes.io/not-ready
          operator: Exists
          effect: NoExecute
          tolerationSeconds: 30
        - key: node.kubernetes.io/unreachable
          operator: Exists
          effect: NoExecute
          tolerationSeconds: 30
      containers:
        - name: fluent-bit
          image: fluent/fluent-bit:3.1
          resources:
            requests:
              cpu: 50m
              memory: 64Mi
            limits:
              cpu: 200m
              memory: 256Mi
          volumeMounts:
            - name: varlog
              mountPath: /var/log
              readOnly: true
            - name: dockercontainers
              mountPath: /var/lib/docker/containers
              readOnly: true
            - name: config
              mountPath: /fluent-bit/etc
      terminationGracePeriodSeconds: 30
      volumes:
        - name: varlog
          hostPath:
            path: /var/log
        - name: dockercontainers
          hostPath:
            path: /var/lib/docker/containers
        - name: config
          configMap:
            name: fluent-bit-config

Always set resource requests on DaemonSet Pods. The scheduler uses requests for bin-packing. If you omit them, a single misbehaving log forwarder can silently evict application Pods by consuming all node memory — and you will only discover this during an incident.

Tolerations: Scheduling Onto Tainted Nodes

Kubernetes uses taints on nodes to repel Pods. A taint says "do not schedule here unless you explicitly tolerate this." Control-plane nodes carry node-role.kubernetes.io/control-plane:NoSchedule by default; GPU nodes often carry nvidia.com/gpu=present:NoSchedule; nodes being drained carry node.kubernetes.io/unschedulable:NoSchedule.

Infrastructure DaemonSets almost always need to run on every node — including tainted ones — so they must declare tolerations that match those taints. A toleration has three fields: key, operator (Equal or Exists), and effect (NoSchedule, PreferNoSchedule, or NoExecute). Using operator: Exists without a value matches any taint with that key regardless of value — useful for blanket toleration of all infrastructure taints.

DaemonSet schedules one fluent-bit Pod per node automatically, including tainted control-plane nodes that block ordinary app Pods.

Targeting a Node Subset with nodeSelector and nodeAffinity

Sometimes you want a DaemonSet to run only on nodes with specific hardware or roles — GPU nodes for a CUDA metrics exporter, or SSD-backed nodes for a high-throughput log forwarder. Use nodeSelector (simple label match) or nodeAffinity (richer expressions) in the Pod template:

# Target only nodes labeled with hardware=gpu
spec:
  template:
    spec:
      nodeSelector:
        hardware: gpu
      # OR — preferred for complex rules:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: hardware
                    operator: In
                    values: [gpu, gpu-ampere]

# Label a node for targeting:
kubectl label node ip-10-0-1-42.ec2.internal hardware=gpu
kubectl get nodes --show-labels | grep hardware

Operational Commands

# List all DaemonSet Pods and which nodes they are on
kubectl get pods -n logging -o wide -l app=fluent-bit

# Check DaemonSet rollout status
kubectl rollout status daemonset/fluent-bit -n logging

# See how many nodes are scheduled vs desired
kubectl get daemonset fluent-bit -n logging
# Output columns: DESIRED | CURRENT | READY | UP-TO-DATE | AVAILABLE | NODE SELECTOR | AGE

# Trigger a rolling restart (e.g. after a ConfigMap change)
kubectl rollout restart daemonset/fluent-bit -n logging

# Watch the rolling restart progress node by node
kubectl get pods -n logging -l app=fluent-bit -w

# Describe a specific DaemonSet Pod to debug scheduling failures
kubectl describe pod fluent-bit-xk92p -n logging
# Look for: "Events" section — FailedScheduling means taint/toleration mismatch

# Force delete a stuck DaemonSet Pod (it will be recreated immediately)
kubectl delete pod fluent-bit-xk92p -n logging --grace-period=0 --force

Production Failure Modes

The most common DaemonSet incident at scale: a new node joins the cluster but the DaemonSet Pod stays in Pending. Root cause is almost always a missing toleration. The new node has a custom taint (e.g. a cloud provider spot-instance taint like kubernetes.azure.com/scalesetpriority=spot:NoSchedule) that the DaemonSet manifest does not tolerate. Always audit the taints on every node class in your cluster and ensure your infrastructure DaemonSets tolerate all of them.

A second common failure: a DaemonSet log agent consumes unbounded memory during a log burst, triggers OOMKill, restarts, and enters a CrashLoopBackOff on every node simultaneously — breaking observability right when you need it most. Always set memory limits and configure the agent\'s internal buffer and backpressure settings so it degrades gracefully under load instead of crashing.

Do NOT use DaemonSets for application workloads. Running your API server as a DaemonSet so "it gets one replica per node" is a common anti-pattern. It couples your application deployment to cluster topology, makes horizontal scaling impossible, and wastes resources on small nodes. Use Deployments with topology spread constraints instead.

Update Strategy Considerations

DaemonSets support two update strategies. RollingUpdate (default since Kubernetes 1.6) replaces Pods one node at a time, respecting maxUnavailable — set this to 1 in production so you never lose log coverage on more than one node simultaneously. OnDelete only replaces a Pod when you manually delete it — useful for critical CNI plugins where an in-place restart would break Pod networking on that node and you prefer to drain the node first.

Always test DaemonSet updates on a staging cluster with an identical node configuration. A bad Fluent Bit config that crashes the agent will propagate to every node in the cluster within minutes of a rolling update — there is no concept of a "canary DaemonSet Pod" out of the box.