Logging at Scale: ELK & Loki

Log Shippers

18 min Lesson 5 of 28

Log Shippers

A log shipper is the agent that bridges raw log output — files on disk, container stdout, systemd journals, syslog sockets — and your centralized logging backend. Getting this layer right is non-negotiable: a misconfigured shipper silently drops events under load, introduces multi-second latency spikes, or consumes so much CPU it becomes the dominant resource user on a production node. This lesson compares the three tools that dominate real engineering teams today: Filebeat, Fluent Bit, and Vector.

How Tailing Works

All three agents tail files using the same kernel primitive: inotify (Linux) or kqueue (macOS/BSD). They hold a registry — a small state file that records the inode and byte offset for every file they track. On restart, they seek directly to the last committed offset so no line is reprocessed or skipped. This registry is the first thing you should protect: losing it (e.g., an ephemeral container wiping /var/lib/filebeat) forces the agent to start from the file's current tail — events that arrived during the gap are gone.

Inode recycling on log rotation: when logrotate rotates a file without copytruncate, the old inode is unlinked and a new one is created. If your shipper only watches by filename, it momentarily reads from the new (empty) file, missing the final bytes of the old one. Always configure your shipper to follow the old inode until EOF before switching — Filebeat calls this close_inactive, Fluent Bit uses Rotate_Wait.

Filebeat

Filebeat is the Elastic-native shipper, written in Go. Its main job is reliable delivery to Elasticsearch or Logstash. It shines in environments already running the ELK stack, where its autodiscover hints, module ecosystem (nginx, PostgreSQL, AWS, etc.), and built-in Ingest Node pipeline support reduce boilerplate. CPU footprint is low; memory footprint is moderate because it buffers an internal queue of events before acknowledging them.

# filebeat.yml — tail a JSON app log, enrich, ship to Elasticsearch filebeat.inputs: - type: log enabled: true paths: - /var/log/myapp/*.log json.keys_under_root: true json.add_error_key: true fields: env: production service: payment-api fields_under_root: true close_inactive: 5m # release file handle after 5 min quiet ignore_older: 24h # skip files not modified in 24 h processors: - drop_fields: fields: ["agent.ephemeral_id", "ecs.version"] - rename: fields: - from: "log.level" to: "level" output.elasticsearch: hosts: ["https://es-cluster:9200"] username: "${ES_USER}" password: "${ES_PASS}" index: "myapp-%{[fields.env]}-%{+yyyy.MM.dd}" bulk_max_size: 2048 worker: 4 queue.mem: events: 8192 # in-memory buffer depth flush.min_events: 512 flush.timeout: 5s

Fluent Bit

Fluent Bit is the lightweight sibling of Fluentd, also written in C. At roughly 650 KB binary and under 10 MB RSS at rest, it is the default DaemonSet shipper in almost every managed Kubernetes distribution (EKS, GKE, AKS all ship it). Its pipeline model — Input → Parser → Filter → Buffer → Output — is explicit and composable. The tail input has a dedicated multi-line parser that handles stack traces correctly, which matters enormously for Java/Python services.

# fluent-bit.conf — Kubernetes pod log tailing with enrichment [SERVICE] Flush 5 Daemon Off Log_Level warn storage.path /var/log/flb-storage/ storage.sync normal storage.checksum off storage.max_chunks_up 128 [INPUT] Name tail Path /var/log/containers/*.log multiline.parser cri, docker Tag kube.* Refresh_Interval 10 Rotate_Wait 30 Mem_Buf_Limit 64MB Skip_Long_Lines On DB /var/log/flb_kube.db # offset registry [FILTER] Name kubernetes Match kube.* Kube_URL https://kubernetes.default.svc:443 Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token Merge_Log On Keep_Log Off K8S-Logging.Parser On K8S-Logging.Exclude On Labels On Annotations Off [FILTER] Name record_modifier Match kube.* Record cluster prod-us-east-1 [OUTPUT] Name loki Match kube.* Host loki.monitoring.svc.cluster.local Port 3100 Labels job=fluentbit,namespace=$kubernetes[\'namespace_name\'],pod=$kubernetes[\'pod_name\'] label_keys $level,$service Batch_Size 1048576 Batch_Wait 1 auto_kubernetes_labels On
Mem_Buf_Limit is your first line of backpressure defence in Fluent Bit. When in-memory chunks reach this limit, the tail input pauses reading. This protects the agent from OOM kills on log bursts. Pair it with filesystem-backed storage (storage.path) so chunks that exceed the memory cap spill to disk rather than being dropped.

Vector

Vector, by Datadog (open source), is the newest of the three and the most ambitious. It models everything — agents, aggregators, and transformers — as a single binary with a unified topology graph. Its remap language (VRL — Vector Remap Language) is a purpose-built, sandboxed expression language for log transformation that is significantly more powerful than Filebeat processors or Fluent Bit Lua filters. Vector also publishes internal observability metrics about itself, so you can alert on shipper throughput and error rates.

# vector.yaml — tail files, parse, enrich, route by log level sources: app_logs: type: file include: - /var/log/myapp/*.log read_from: beginning # use "end" in production daemonsets ignore_older_secs: 86400 fingerprint: strategy: checksum # inode + byte checksum for stable identity multiline: start_pattern: '^[0-9]{4}-[0-9]{2}-[0-9]{2}' condition_pattern: '^[0-9]{4}-[0-9]{2}-[0-9]{2}' mode: halt_before timeout_ms: 2000 transforms: parse_json: type: remap inputs: [app_logs] source: | . = parse_json!(string!(.message)) .env = get_env_var!("DEPLOY_ENV") .host = get_hostname!() if exists(.duration_ms) { .duration_ms = to_float!(.duration_ms) } route_by_level: type: route inputs: [parse_json] route: errors: '.level == "error" || .level == "fatal"' debug: '.level == "debug"' sinks: loki_errors: type: loki inputs: [route_by_level.errors] endpoint: http://loki:3100 encoding: codec: json labels: level: "{{ level }}" service: "{{ service }}" buffer: type: disk max_size: 268435456 # 256 MB disk buffer for errors loki_default: type: loki inputs: [parse_json, route_by_level.debug] endpoint: http://loki:3100 encoding: codec: json labels: level: "{{ level }}" buffer: type: memory max_events: 10000 when_full: drop_newest # drop debug logs rather than OOM

Parsing and Enrichment

Raw log lines are useless for querying at scale — you need structured fields. The key parsing patterns are:

  • JSON passthrough: if your service already emits JSON (it should — see Lesson 2), the shipper just needs to parse the outer string. Use Filebeat's json.* keys, Fluent Bit's json parser, or Vector's parse_json!().
  • Grok/regex: for legacy plaintext logs (nginx access logs, Apache). Filebeat and Fluent Bit both ship Grok support. Grok is slow — if you own the service, switch to structured logging instead.
  • Multiline stitching: stack traces span many lines. All three agents support regex-based multiline aggregation. Size the timeout conservatively (2–5 s) — too short and you split traces, too long and you add latency to every event.

Enrichment adds metadata that wasn't in the original log: hostname, Kubernetes pod labels, cloud region, deploy environment. Do this at the shipper layer, not the indexer — you pay for storage on every enriched field, and it is far cheaper to enrich once at the edge than to add a pipeline step in a hot Elasticsearch or Loki ingest path.

Backpressure and Buffering

Backpressure is what keeps your shipper from becoming an unbounded memory leak when the downstream (Elasticsearch, Loki, Kafka) is slow or unavailable. Each tool handles it differently:

  • Filebeat: uses an in-memory queue (configurable size) with an optional disk spool. When the queue is full, the harvester (file reader) pauses. With the default memory queue, a Filebeat crash loses buffered-but-unacked events. For high-reliability deployments, switch to the disk queue (queue.disk in recent Filebeat versions).
  • Fluent Bit: filesystem-backed chunks (storage.type filesystem) survive restarts. The Mem_Buf_Limit cap on each input triggers pause-on-full. The storage.max_chunks_up controls how many chunks can be in memory at once. Without filesystem storage, a node reboot loses everything buffered.
  • Vector: explicit per-sink buffers — either memory (fast, lossy on crash) or disk (durable, I/O cost). The when_full policy lets you choose between block (apply backpressure upstream, risking slowdown) and drop_newest (accept loss, protect throughput). Choose block for error logs; drop_newest for high-volume debug streams.
Production sizing rule of thumb: allocate a disk buffer equal to roughly 30 minutes of peak ingest volume per shipper node. If your service generates 50 MB/min of logs at peak, provision ~1.5 GB of disk buffer per node. This gives you 30 minutes of downstream outage before you start losing data — long enough to page an engineer and remediate most incidents.

Tool Selection Cheatsheet

Log Shipper Selection Guide Tool Memory footprint Transform power Best fit Filebeat ~60–120 MB Moderate (processors) ELK-native; rich modules + disk queue option Ingest pipelines offload work Fluent Bit <10 MB at rest Basic–moderate (Lua/WASM) Kubernetes DaemonSets Smallest binary (~650 KB) Great k8s metadata filter Vector ~30–80 MB High (VRL language) Complex routing / multi-sink Also acts as aggregator Built-in internal metrics Typical flow: Log file → Shipper (tail + parse + enrich) → Buffer → Backend Log File Tail + Parse Enrich Buffer Backend backpressure signal
Log shipper pipeline: events flow left to right; backpressure flows right to left when the backend is slow.

Common Production Failure Modes

  • Registry corruption: if the registry file is deleted or truncated (e.g., an EmptyDir volume in Kubernetes), the shipper re-sends from the file start. Duplicate events flood Elasticsearch and trigger alerting storms. Mitigation: use a hostPath volume for the registry, not an ephemeral pod volume.
  • Harvester leak: Filebeat opens a file handle per harvested file. In environments with thousands of rotating log files, this exhausts the process's open-file limit (ulimit -n). Set close_inactive aggressively and raise the OS limit in systemd unit or DaemonSet securityContext.
  • Output timeout cascades: if the backend is slow, the output blocks, the internal queue fills, and the harvester pauses. Meanwhile the log file keeps writing. If the file rotates during this pause, unread data is lost after rotation. Use a persistent disk buffer so the shipper can drain asynchronously.
  • Clock skew and out-of-order events: container clocks and host clocks diverge. Always enrich with a server-side timestamp in the shipper and preserve the application timestamp as a separate field (app_timestamp). Never use ingestion time as your only timestamp.
Operational best practice: expose each shipper's internal metrics (Filebeat: HTTP monitoring endpoint on port 5066; Fluent Bit: /api/v1/metrics on port 2020; Vector: Prometheus scrape on port 9598) and alert on output_events_dropped_total > 0. Silent drops are the hardest logging failure to detect after the fact.