Structured Logging Practices
Structured Logging Practices
Logs are the oldest observability signal. They are also the most abused. A poorly designed logging strategy produces terabytes of noise, $50k/month Datadog bills, and engineers who grep through walls of unformatted text at 3 am while a production incident burns. A well-designed strategy produces a searchable, correlated, cost-effective record of everything your system did — and exactly why.
This lesson covers the three practices that separate production-grade logging from chaos: structured (JSON) logs, correlation IDs, and disciplined log levels. These are the defaults at every major tech company for a reason.
Why JSON Logs Beat Plain Text
The old world of logging looks like this: 2024-03-15 10:23:41 INFO User 4291 placed order 8820. That line is readable by a human in isolation, but it is nearly useless at scale. Parsing it requires custom regex per service. Adding a new field means breaking existing dashboards. Querying across 200 microservices that each invented their own format is an operational nightmare.
Structured logging means emitting every log entry as a machine-readable document — almost always JSON — where every field is a typed key-value pair. The same event becomes:
Now any log aggregation platform — Elasticsearch, Loki, Splunk, CloudWatch — can index every field. You can filter level:error AND service:order-svc AND amount_cents:>10000 in milliseconds without writing a single regex. Dashboards can aggregate avg(duration_ms) grouped by service. Alerting rules can fire on event:order.failed AND currency:USD. This is not a nice-to-have at scale; it is a prerequisite for observability.
Required Fields: What Every Log Line Must Carry
Agree on a canonical schema across all services, enforced at the logging library level — not by trusting individual developers to remember. At minimum, every log line should carry:
- timestamp — UTC, ISO 8601 with milliseconds. Never local time.
- level — lowercase string:
debug,info,warn,error,fatal. - service — the logical service name, not the hostname.
- version — the deployed artifact version/commit SHA. Essential for correlating a bug to a deploy.
- trace_id / correlation_id — covered in depth below.
- message — a short, static string. Never build the message via string interpolation with dynamic values; put those in dedicated fields.
- env —
production,staging,dev.
"message": "order.placed" and put dynamic data in "order_id": 8820. If the message string varies per invocation, log aggregators cannot group occurrences of the same event — you get 10,000 unique log lines instead of one event fired 10,000 times.Correlation IDs: Tracing a Request Across Services
A single user request in a microservices system touches 5–20 services. When that request fails, you need to follow the chain from the API gateway through the auth service, product service, payment service, and notification service. Without a shared identifier, you are matching timestamps and praying.
A correlation ID (also called a trace ID or request ID) is a unique identifier generated at the system's entry point — the API gateway or load balancer — and propagated as an HTTP header through every downstream call. Every service reads the header and attaches the ID to every log line it emits for that request.
The standard transport mechanism is HTTP headers. W3C Trace Context (traceparent / tracestate) is the modern standard and is the header format used by OpenTelemetry, Zipkin, and Jaeger. If you are not using a full tracing stack yet, even a simple custom header like X-Request-ID is a massive improvement over nothing.
traceparent header. In Kubernetes this is often handled automatically by a service mesh (Istio / Linkerd), but never assume — verify with a test request and confirm the header arrives downstream.Log Levels Done Right
Log levels exist to let you tune verbosity at runtime without redeploying. Used correctly they let you run quiet in production, turn up the volume during an incident, and never pay for logs you do not need. Used incorrectly, every line is INFO, your bill is $40k/month, and signal is buried in noise.
Here is the canonical semantics for each level, as used in production at large-scale systems:
- DEBUG — detailed internal state, variable values, intermediate steps. Should be off by default in production. Enable dynamically for a specific service during an active investigation. Cost: high volume, high cost.
- INFO — significant business events: a request completed successfully, a job started, a user logged in. Should be meaningful to a non-developer reading the logs. Avoid logging every function call as INFO.
- WARN — something unexpected happened but the system recovered. A retry succeeded. A deprecated API endpoint was called. A circuit breaker opened and fell back. Worth tracking as a leading indicator of future errors.
- ERROR — an operation failed and the system could not recover from it for this request. Always include the exception/stack trace. Must page the on-call engineer if sustained. Never use ERROR for expected failure conditions like a user entering a wrong password.
- FATAL — the process is about to exit due to an unrecoverable condition (OOM, corrupt config, lost DB connection at startup). Rare and always alarming.
Practical Configuration: Structured Logging with Pino (Node.js)
Most modern logging libraries support structured output natively. Pino (Node.js) and Zap (Go) are the gold standards for performance. Here is a production-ready Pino setup that covers all of the above:
Key patterns to note: LOG_LEVEL is an environment variable so ops can change verbosity without a redeploy. The redact option strips secrets before they ever reach the log stream — this is not optional; PCI-DSS and SOC 2 both require it. The transport switches to pretty-print locally while shipping raw JSON in production so container log drivers and log forwarders receive clean, parseable output.
Sampling and Cost Control
At high traffic volumes, logging every request at INFO produces enormous data volumes. A service handling 100,000 requests per second, each emitting 5 log lines, is 500,000 lines/second — tens of gigabytes per hour per service. Log ingestion cost at Datadog, Splunk, or New Relic at that volume exceeds the compute cost of the service itself.
Production strategies to control cost without losing signal:
- Head-based sampling: log 1% of successful requests at full verbosity; always log 100% of errors and warnings. Apply at the gateway level with a
X-Sample-Rateheader so all services honour the same sampling decision. - Tail-based sampling: buffer logs in memory, and only flush to the backend if the request took more than 500ms or resulted in an error. Requires a smart proxy (OpenTelemetry Collector with a tail-sampler processor).
- Index selective fields: most backends let you store the full log payload but only index specific fields. Full-text search on unindexed fields is slower but storage is cheap. Index
level,service,trace_id, and key event fields; store the rest as cold attributes.
Structured logging is the foundation on which dashboards, distributed tracing, and alerting are built. Get it right from day one and every other observability investment compounds on top of it. Retrofit it into an existing system and you pay the cost twice.