Kubernetes Logging Patterns
Kubernetes Logging Patterns
Kubernetes does not have a built-in mechanism for persisting or forwarding pod logs. The platform intentionally leaves logging as an operator concern, which means every team must make deliberate architectural choices about how logs are collected, what format they are emitted in, and how multi-line events are reconstructed before they reach the storage backend. This lesson covers the three production-grade patterns — node-level agents, stdout discipline, and multi-line handling — that together form the foundation of every serious Kubernetes logging implementation, from a 10-node startup cluster to the multi-thousand-node fleets run by top-tier cloud providers.
How the Container Runtime Handles Logs
When a container writes to stdout or stderr, the kubelet captures those bytes and routes them to the configured container runtime interface (CRI) — containerd or CRI-O in virtually all production clusters today. The CRI writes each line to a log file under /var/log/pods/<namespace>_<pod-name>_<uid>/<container-name>/0.log, in a format called CRI log format:
This CRI-wrapped file is what log shippers actually tail. /var/log/containers/ contains symlinks pointing to these files, named <pod>_<namespace>_<container>-<container-id>.log — the symlink naming convention is what most shipper configurations reference. Understanding this indirection is critical: when you configure your DaemonSet to watch /var/log/containers/*.log, you are following symlinks to the real CRI log files, and your shipper must know how to strip the CRI wrapper before parsing the log body.
Node-Level Agent Pattern (DaemonSet)
The standard Kubernetes logging architecture places a lightweight log-shipping agent on every node as a DaemonSet. The DaemonSet guarantees exactly one agent replica per node — it tracks Kubernetes scheduling events so new nodes automatically get an agent, and drained nodes have their agent gracefully terminated. This pattern is preferred over per-pod sidecars at scale because a single agent can multiplex the logs of dozens of pods running on the same node, amortizing CPU and memory costs.
The DaemonSet agent accesses log files via hostPath volumes. The required mounts are:
/var/log— CRI log files and pod log symlinks/var/lib/docker/containers(legacy Docker runtime) or/run/containerd(containerd)/run/log/journal— systemd journal for node-level daemon logs (kubelet, containerd itself)/var/lib/fluent-bit— the agent state directory (offset registry); must be a hostPath so it survives agent pod restarts
emptyDir for its state directory, every agent pod restart (OOM kill, node reboot, DaemonSet rollout) resets the offset registry to zero. The agent then replays every log file on the node from the beginning, flooding your storage backend with duplicates and potentially triggering index capacity alerts. Mount /var/lib/fluent-bit as a hostPath so the registry persists across pod restarts.
Stdout Discipline: Why It Matters and How to Enforce It
The entire node-level agent pattern depends on a fundamental contract: all application logs must go to stdout/stderr, never to files inside the container filesystem. This contract exists because the container filesystem is ephemeral — when a pod is deleted or rescheduled, its writable layer disappears, taking any file-based logs with it. The kubelet-managed log files under /var/log/pods/ survive container restarts (up to a configurable rotation limit) precisely because the CRI writes them on the node, outside the container.
In practice, stdout discipline means three things at big-tech companies:
- Log to stdout/stderr only. No file appenders, no
logging.FileHandler, no/app/logs/*.log. Configure your frameworks:LOG_FILE=stdoutin Spring Boot,logging.handlers.StreamHandlerin Python,--log-format=jsonandstderrin Go'slog/slog. - Emit structured JSON on a single line per event. One log entry = one line. The CRI and all shippers treat newlines as event boundaries. A multi-line JSON dump to stdout breaks this contract and requires expensive reassembly (discussed below).
- Never log to both stdout and a file. Dual logging creates duplicate events in your backend and inflates costs. Worse, ops teams learn to check one or the other, not both — critical context ends up in the wrong place during an incident.
/app/logs directories or install log rotation daemons. Pair this with a PodSecurity admission controller that denies emptyDir volume mounts named logs. At Google and Meta, these controls are enforced by the platform team, not left to individual application developers.
The kubelet enforces log rotation on the CRI-managed files: by default, logs are rotated at 10 MB with 5 rotations kept (--container-log-max-size, --container-log-max-files kubelet flags). Your node-level agent must be configured to follow rotated files (inode tracking, not filename tracking) or you will miss the tail of each rotation. Fluent Bit does this correctly by default via its inotify-based tail implementation.
Multi-Line Log Handling
Multi-line logs are one of the most common sources of silent data corruption in Kubernetes logging pipelines. A Java stack trace, a Python traceback, a Go panic dump, or a pretty-printed JSON blob all span multiple lines of stdout. The CRI writes each line as a separate log entry, tagged with the P (partial) flag for continuation lines and F (full) for the terminating line. If your shipper does not reassemble these partial lines into a single logical event, your backend receives dozens of disconnected one-liners instead of one coherent stack trace — and your alerting rules that search for Exception in the log body find the first line but not the context on lines 2–20.
There are two distinct reassembly problems and they must both be solved:
- CRI partial-line reassembly (P/F flags). The CRI splits very long lines (longer than 16 KB in containerd) across multiple log file entries with the
Pflag. Fluent Bit'scrimultiline parser handles this automatically. Promtail handles it via itsdocker/cripipeline stages. This layer is about raw byte reassembly, not semantic understanding. - Application-level multi-line reassembly (stack traces, panics). The CRI correctly wrote each line as a separate entry (each with
F), but they represent a single logical error event. Your shipper must detect the pattern — typically "starts with a timestamp = new event; a line that does not start with a timestamp = continuation" — and merge them before forwarding.
The flush_timeout parameter is critical in production: if the application crashes mid-stack-trace, Fluent Bit will wait for this duration before emitting the incomplete group rather than holding it indefinitely. Set it to 2–5 seconds. Too short and you split events during GC pauses; too long and you delay alert firing during an outage.
exception.stack_trace field). Log4j2 JSON layout, Logback's logstash-logback-encoder, Python's python-json-logger, and Go's log/slog with a JSON handler all do this natively. When the stack trace is a JSON string value rather than literal newlines, the CRI writes exactly one log file entry per event and multi-line reassembly becomes unnecessary. This is the approach used by Netflix, Uber, and Shopify.
Kubernetes Metadata Enrichment
Raw CRI logs contain only the log body — no pod name, no namespace, no deployment, no container image tag. The shipper must join this metadata from the Kubernetes API at collection time. Fluent Bit's built-in kubernetes filter and Promtail's kubernetes_sd_configs both query the local kubelet's pod metadata endpoint (https://<NODE_IP>:10250/pods) and the Kubernetes API server to attach standard labels:
The K8S-Logging.Exclude annotation is a powerful escape hatch: pods that generate high-volume, low-value logs (health-check aggregators, metrics scrapers) can opt out of collection entirely by setting the annotation fluentbit.io/exclude: "true" in their pod spec. This is a much cheaper filter than processing and then dropping events downstream.
get, list, and watch on pods and namespaces cluster-wide. In clusters with strict RBAC, the most common DaemonSet failure mode is a silent metadata enrichment failure: Fluent Bit logs kube-filter: API call failed at warn level but continues shipping — logs arrive at the backend with no namespace or pod-name labels, making them unfilterable. Always verify the ClusterRoleBinding is present after deploying the DaemonSet.
When to Use Sidecar Containers Instead
The DaemonSet pattern handles 95% of Kubernetes logging needs. The 5% exception is when an application cannot be modified to write to stdout — legacy JVM apps that write to rolling files, databases that write binary WAL segments to a data volume, or processes that mix application logs with audit logs that must be shipped to a different backend. In these cases, a sidecar log shipper co-located in the same pod can tail the shared volume and forward to the appropriate destination. The tradeoff is resource overhead: each pod now carries its own Fluent Bit or Vector process, increasing per-pod CPU and memory requirements. At scale (thousands of pods), this cost is significant. Prefer fixing the application to write to stdout over deploying sidecar shippers.