Elasticsearch for Operators
Elasticsearch for Operators
Running Elasticsearch in production is not the same as running it for a demo. At scale, the difference between a cluster that holds up under a 200 GB/day ingest spike and one that falls over comes down to four things you must own as an operator: index design, shard strategy, mapping discipline, and Index Lifecycle Management (ILM). This lesson is your operator's field guide to all four, plus the cluster-health signals that tell you when something is about to go wrong.
Indices and What They Actually Are
An index in Elasticsearch is the logical namespace for a collection of related JSON documents. Under the hood it is a set of one or more Apache Lucene indices. Every write goes to a primary shard, which then replicates to its replica shard(s). Reads are served from either — this is how Elasticsearch delivers both durability and read scalability from the same cluster.
For logs, you almost never create a single monolithic index. Instead you use a data stream, which is a sequence of time-stamped backing indices managed automatically. A write to the data stream alias always lands on the current write index (called the "write backing index"). Old backing indices become read-only and are rolled over or deleted by ILM policy.
@timestamp sorting guarantee that log queries rely on.
Shards: the Unit of Scale and the Source of Most Production Pain
A shard is a single Lucene index. Every index has a fixed number of primary shards set at creation time — you cannot change it without reindexing. Each shard has a configurable number of replica shards. The rules of thumb used at large Elasticsearch deployments are:
- Target 20–50 GB per shard. Shards outside this range are problematic: too small and the overhead of coordinating thousands of shards degrades query latency; too large and segment merges stall, recovery after a node failure takes too long, and heap pressure spikes.
- One replica minimum in production. Zero replicas means any node failure causes data loss and a red cluster state.
- Total shard count drives heap usage. Elasticsearch keeps shard metadata in heap. At scale, every shard consumes ~few KB of heap on every node. A cluster with 100,000 shards will OOM even on 64 GB nodes.
To inspect shard allocation right now:
Mappings: Schema-on-Write for Search
Elasticsearch is often described as "schema-free," but that is misleading. What it actually does is dynamic mapping — the first document that arrives defines the field types for every subsequent document. This is dangerous for log pipelines. A field that arrives as a long in one service's logs but as a keyword in another will trigger a mapping conflict, causing one of those documents to be rejected with a 400 and silently dropped depending on your ingest configuration.
The production pattern is to define an explicit index template with component templates that lock down the fields you know about and configure sensible defaults for the rest. Use dynamic: strict for fields where you want hard enforcement, or dynamic: true with a dynamic_templates block that catches unexpected fields and maps them as keyword instead of allowing Elasticsearch to auto-detect a numeric or date type that might conflict later.
norms: false on the message field. Norms store per-document field-length normalization data for relevance scoring — useless for log queries where you care about exact matches, not BM25 ranking. Disabling them saves ~1 byte per document per field, which compounds to gigabytes in high-volume clusters.
Index Lifecycle Management (ILM)
ILM is the policy engine that moves indices through defined phases — hot → warm → cold → frozen → delete — as they age. For a logging cluster this is non-negotiable: without ILM, disks fill up and you are deleting indices manually at 3 AM.
The key transition triggers are rollover conditions on the hot phase: when an index reaches a maximum age, a maximum size, or a maximum document count, ILM creates a new write index and the old one becomes read-only and starts moving toward warm. Typical production thresholds at a mid-size company: rollover at max_age: 1d or max_size: 40gb, move to warm after 2 days (force merge to 1 segment + shrink shards), move to cold after 7 days (mount as frozen searchable snapshot on object storage), delete after 30–90 days (regulatory window).
Reading Cluster Health
Elasticsearch exposes cluster health as green / yellow / red. Red means at least one primary shard is unassigned — some data is unavailable and writes to that shard are rejected. Yellow means all primaries are assigned but at least one replica shard is unassigned — the cluster is fully functional but has no redundancy for those shards. Green means everything is assigned and healthy.
In a production logging cluster, a sustained yellow state during a rolling restart is expected and acceptable. Red is always an incident. The fastest triage path is:
- Check
GET /_cluster/health?pretty— look atunassigned_shardsandactive_shards_percent_as_number. - Run
GET /_cluster/allocation/explain— Elasticsearch will tell you exactly why a shard is unassigned (disk watermark, no eligible nodes, etc.). - If disk watermarks are the cause, the default thresholds are 85% (low — stop allocating replicas), 90% (high — move shards away), 95% (flood stage — all indices become read-only). Check with
GET /_cat/nodes?v&h=name,diskUsed,diskAvail,diskTotal,diskPercent.
jvm.mem.heap_used_percent on every node. Above 85% heap usage, the JVM garbage collector starts causing multi-second stop-the-world pauses. This is the second-most-common cause of production Elasticsearch incidents after shard proliferation. Alert at 75% heap; page at 85%.
Key Operator Habits
- Always test a new ILM policy against a non-production data stream before applying it to production indices — the
shrinkaction is destructive and cannot be undone. - Pin your index template priority (use values like 200 for your templates) so that Elastic's built-in templates at priority 100 do not accidentally override yours.
- Use
GET /_data_stream/logs-app-*to see which backing indices exist and which is the current write index — this is the first thing to check when logs stop arriving. - Avoid
_forcemergeon hot indices. It is a heavily I/O-bound operation that competes with active indexing and can destabilize a node under load.