The ELK Stack
The ELK Stack
The ELK Stack — Elasticsearch, Logstash, and Kibana — is the most widely deployed open-source centralized logging solution in the industry. Large organisations at Netflix, LinkedIn, Uber, and Goldman Sachs run ELK at petabyte scale. Understanding its architecture is the foundation for all operational log work at big-tech standard.
The Three-Component Architecture
Each component has one job, and the separation is intentional. Breaking that boundary is one of the most common causes of production ELK incidents.
- Elasticsearch — a distributed, inverted-index search and analytics engine. It stores logs, indexes every field, and executes queries across terabytes in milliseconds. Logs are written to time-stamped indices (
logs-app-2025.06.11). Each index is sharded across nodes and replicated for fault-tolerance. - Logstash (or Elastic Ingest Nodes) — the ingestion and transformation layer. It receives raw log streams, parses them into structured JSON (using Grok, Dissect, CSV, JSON filters), enriches fields (GeoIP, user-agent, DNS), and forwards to Elasticsearch. It also acts as a buffer under backpressure.
- Kibana — the visualisation and query interface. Engineers use the Discover tab to run KQL (Kibana Query Language) searches, build dashboards with aggregation charts, and set alerting rules. In production it is also the entry point for APM and Fleet management.
Elasticsearch: What Operators Must Know
Elasticsearch uses an inverted index — every token in every field is indexed at write time. This is why full-text search across billions of log lines is sub-second, but it also means write throughput and disk usage are 3–5× the raw log volume. Each index is divided into primary shards (write targets) and replica shards (read scaling and HA). Missharding is the most common cause of Elasticsearch cluster degradation in production.
number_of_shards explicitly in your index template.
Logstash: Ingestion Pipeline
A Logstash pipeline has three sections: input (where logs come from), filter (how they are parsed and enriched), and output (where they go). The most important filter is Grok, which matches free-text log lines against named regex patterns.
Kibana: Querying and Dashboards
Kibana connects to Elasticsearch via its REST API and provides the Discover view for ad-hoc investigation and the Dashboard editor for persistent visualisations. The query language is KQL (Kibana Query Language), a simplified syntax over Elasticsearch DSL.
Production Failure Modes
Three failure modes account for the majority of ELK production incidents:
- Heap pressure and GC pauses. Elasticsearch nodes are JVM processes. When heap usage exceeds 75%, the garbage collector starts causing multi-second pauses, slowing ingestion and query response. The hard limit is 50% of available RAM, capped at 31 GB (due to compressed OOPs). Monitor
jvm.mem.heap_used_percentas a critical Prometheus metric. - Split-brain. In Elasticsearch 7+ this is largely solved by the Raft-based cluster coordination, but it can still occur if
discovery.seed_hostsis misconfigured or network partitions occur. Always run an odd number of master-eligible nodes (3 or 5) and setcluster.initial_master_nodescorrectly on first bootstrap only. - Logstash pipeline stall. If Elasticsearch is backpressured (full disk, circuit breaker open), Logstash's persistent queue fills up. Without persistent queue enabled (
queue.type: persisted), log lines are dropped silently. Always enable the persistent queue in production and size it for at least 30 minutes of peak ingestion volume.
Security Baseline
Never run ELK without security enabled. The default open-access Elasticsearch has led to thousands of publicly exposed databases. Since Elasticsearch 8.0, security (TLS + basic auth) is enabled by default. For existing deployments:
- Enable TLS on all inter-node and client-to-node communication (
xpack.security.enabled: true,xpack.security.http.ssl.enabled: true). - Use dedicated service accounts with minimal privileges (
logstash_writerrole: only write to specific index patterns;kibana_read_onlyfor dashboard consumers). - Never expose the Elasticsearch HTTP port (9200) to the internet — place it behind a VPC security group or firewall, accessible only from the ingest tier and Kibana.