Log Shippers
Log Shippers
A log shipper is the agent that bridges raw log output — files on disk, container stdout, systemd journals, syslog sockets — and your centralized logging backend. Getting this layer right is non-negotiable: a misconfigured shipper silently drops events under load, introduces multi-second latency spikes, or consumes so much CPU it becomes the dominant resource user on a production node. This lesson compares the three tools that dominate real engineering teams today: Filebeat, Fluent Bit, and Vector.
How Tailing Works
All three agents tail files using the same kernel primitive: inotify (Linux) or kqueue (macOS/BSD). They hold a registry — a small state file that records the inode and byte offset for every file they track. On restart, they seek directly to the last committed offset so no line is reprocessed or skipped. This registry is the first thing you should protect: losing it (e.g., an ephemeral container wiping /var/lib/filebeat) forces the agent to start from the file's current tail — events that arrived during the gap are gone.
logrotate rotates a file without copytruncate, the old inode is unlinked and a new one is created. If your shipper only watches by filename, it momentarily reads from the new (empty) file, missing the final bytes of the old one. Always configure your shipper to follow the old inode until EOF before switching — Filebeat calls this close_inactive, Fluent Bit uses Rotate_Wait.
Filebeat
Filebeat is the Elastic-native shipper, written in Go. Its main job is reliable delivery to Elasticsearch or Logstash. It shines in environments already running the ELK stack, where its autodiscover hints, module ecosystem (nginx, PostgreSQL, AWS, etc.), and built-in Ingest Node pipeline support reduce boilerplate. CPU footprint is low; memory footprint is moderate because it buffers an internal queue of events before acknowledging them.
Fluent Bit
Fluent Bit is the lightweight sibling of Fluentd, also written in C. At roughly 650 KB binary and under 10 MB RSS at rest, it is the default DaemonSet shipper in almost every managed Kubernetes distribution (EKS, GKE, AKS all ship it). Its pipeline model — Input → Parser → Filter → Buffer → Output — is explicit and composable. The tail input has a dedicated multi-line parser that handles stack traces correctly, which matters enormously for Java/Python services.
tail input pauses reading. This protects the agent from OOM kills on log bursts. Pair it with filesystem-backed storage (storage.path) so chunks that exceed the memory cap spill to disk rather than being dropped.
Vector
Vector, by Datadog (open source), is the newest of the three and the most ambitious. It models everything — agents, aggregators, and transformers — as a single binary with a unified topology graph. Its remap language (VRL — Vector Remap Language) is a purpose-built, sandboxed expression language for log transformation that is significantly more powerful than Filebeat processors or Fluent Bit Lua filters. Vector also publishes internal observability metrics about itself, so you can alert on shipper throughput and error rates.
Parsing and Enrichment
Raw log lines are useless for querying at scale — you need structured fields. The key parsing patterns are:
- JSON passthrough: if your service already emits JSON (it should — see Lesson 2), the shipper just needs to parse the outer string. Use Filebeat's
json.*keys, Fluent Bit'sjsonparser, or Vector'sparse_json!(). - Grok/regex: for legacy plaintext logs (nginx access logs, Apache). Filebeat and Fluent Bit both ship Grok support. Grok is slow — if you own the service, switch to structured logging instead.
- Multiline stitching: stack traces span many lines. All three agents support regex-based multiline aggregation. Size the
timeoutconservatively (2–5 s) — too short and you split traces, too long and you add latency to every event.
Enrichment adds metadata that wasn't in the original log: hostname, Kubernetes pod labels, cloud region, deploy environment. Do this at the shipper layer, not the indexer — you pay for storage on every enriched field, and it is far cheaper to enrich once at the edge than to add a pipeline step in a hot Elasticsearch or Loki ingest path.
Backpressure and Buffering
Backpressure is what keeps your shipper from becoming an unbounded memory leak when the downstream (Elasticsearch, Loki, Kafka) is slow or unavailable. Each tool handles it differently:
- Filebeat: uses an in-memory queue (configurable size) with an optional disk spool. When the queue is full, the harvester (file reader) pauses. With the default memory queue, a Filebeat crash loses buffered-but-unacked events. For high-reliability deployments, switch to the disk queue (
queue.diskin recent Filebeat versions). - Fluent Bit: filesystem-backed chunks (
storage.type filesystem) survive restarts. TheMem_Buf_Limitcap on each input triggers pause-on-full. Thestorage.max_chunks_upcontrols how many chunks can be in memory at once. Without filesystem storage, a node reboot loses everything buffered. - Vector: explicit per-sink buffers — either
memory(fast, lossy on crash) ordisk(durable, I/O cost). Thewhen_fullpolicy lets you choose betweenblock(apply backpressure upstream, risking slowdown) anddrop_newest(accept loss, protect throughput). Chooseblockfor error logs;drop_newestfor high-volume debug streams.
Tool Selection Cheatsheet
Common Production Failure Modes
- Registry corruption: if the registry file is deleted or truncated (e.g., an EmptyDir volume in Kubernetes), the shipper re-sends from the file start. Duplicate events flood Elasticsearch and trigger alerting storms. Mitigation: use a
hostPathvolume for the registry, not an ephemeral pod volume. - Harvester leak: Filebeat opens a file handle per harvested file. In environments with thousands of rotating log files, this exhausts the process's open-file limit (
ulimit -n). Setclose_inactiveaggressively and raise the OS limit in systemd unit or DaemonSetsecurityContext. - Output timeout cascades: if the backend is slow, the output blocks, the internal queue fills, and the harvester pauses. Meanwhile the log file keeps writing. If the file rotates during this pause, unread data is lost after rotation. Use a persistent disk buffer so the shipper can drain asynchronously.
- Clock skew and out-of-order events: container clocks and host clocks diverge. Always enrich with a server-side timestamp in the shipper and preserve the application timestamp as a separate field (
app_timestamp). Never use ingestion time as your only timestamp.
/api/v1/metrics on port 2020; Vector: Prometheus scrape on port 9598) and alert on output_events_dropped_total > 0. Silent drops are the hardest logging failure to detect after the fact.