Structured Logging Standards
Structured Logging Standards
Before you can ship logs to Elasticsearch, query them in Loki, or alert on them in Grafana, you need to decide what a log line actually is. That decision — made once, inconsistently, across fifty services — is the reason most logging pipelines become graveyards of unqueryable noise. This lesson establishes the production standards that companies like Stripe, Cloudflare, and GitHub use to make their logs a reliable source of truth at hundreds of thousands of events per second.
Why Unstructured Logs Fail at Scale
A classic unstructured log line looks like this: 2024-03-12 14:23:01 ERROR checkout failed for user 42 after 3201ms. A human can read it. A machine cannot reliably parse it. To extract the user ID, the duration, or the error type from a million such lines, you must write fragile regex — and the moment one team changes their log format, the regex breaks and your dashboard silently returns zeros.
Structured logging replaces free-text messages with machine-readable key/value documents — almost always JSON. Every field is a named, typed attribute. The log above becomes a JSON object where user_id, duration_ms, and level are first-class fields you can index, filter, and aggregate without any parsing logic at query time. This is not just a developer convenience — it is the architectural prerequisite for centralized logging to work.
The Canonical JSON Log Schema
There is no single official standard, but the industry has converged on a small set of mandatory base fields that every log line must carry, regardless of which service or language emits it. These fields are what allow your log aggregator to merge streams from a Go API, a Python worker, and a Node.js service into a coherent, jointly-queryable corpus.
Let us walk through the most important field groups and the reasoning behind each one.
Field Conventions: The Why Behind Every Key
Timestamp. Always ISO 8601 in UTC with millisecond precision: 2024-03-12T14:23:01.847Z. Never use Unix epoch integers in your application logs — they are unreadable without a converter, and every log UI will re-parse the ISO string anyway. The timezone suffix Z is mandatory; logs without it create ambiguity when services span regions.
Level. Use exactly these five, lowercase: debug, info, warn, error, fatal. Do not invent critical, severe, or INFORMATION. Mixed-case and synonym proliferation breaks log-level filters across your aggregator. Production services should emit info and above by default; debug is toggled dynamically via a feature flag or environment variable, never compiled in permanently.
Service, version, env. These three together identify what emitted the log. service should match your Kubernetes Service name exactly — that allows automatic correlation between metrics and logs without a mapping table. version should be the full semver tag or git SHA, so you can correlate a degradation with a specific deployment. env must be one of development, staging, production — it prevents staging noise from polluting production dashboards.
Namespaced field keys. Use dot notation for domain grouping: http.method, http.status, db.query_ms, error.type, error.stack. This mirrors the OpenTelemetry Semantic Conventions, which means your logs will automatically align with traces when you add distributed tracing. Non-namespaced flat keys for everything (method, queryMs) are a common mistake that leads to collisions between teams and makes cross-service aggregation painful.
http.*, db.*, messaging.*, rpc.*, error.*.Correlation IDs and Request IDs
The single most high-leverage practice in structured logging is propagating a unique identifier through every log line that belongs to the same request. Without this, when you see an error in service C, you have no way to find the upstream logs in service A and B that preceded it. With a request_id, you type one value into your log query UI and instantly see the complete narrative of that request across all services.
There are two distinct ID types in production systems, and confusing them causes operational pain:
request_id— generated at the API gateway or load balancer for every inbound HTTP request. Scoped to a single synchronous call chain. Use this to reconstruct what happened during one HTTP transaction. A good format is a ULID (lexicographically sortable, time-prefixed):req_01HXYZ9ABCDE.trace_id— a 128-bit W3C TraceContext ID that spans asynchronous boundaries, queue hops, and multiple services. This is your distributed tracing ID. It is generated once per user-visible operation and propagated via HTTP headers (traceparent) and message queue metadata. Use this to correlate logs with traces in Tempo, Jaeger, or X-Ray.
Both IDs must be injected into your logging context at the framework layer so that every log line emitted during that request carries them automatically — without requiring each developer to remember to pass them manually. In Go, this means using context.Context and a middleware that calls log.With(ctx, "trace_id", traceID). In Python it is a logging filter; in Node.js it is AsyncLocalStorage. The mechanism differs, but the principle is universal: IDs travel with the request context, not as function arguments.
What to Log — and What Not To
Every log line costs money and latency. At Google-scale, a single extra field added to every log line can cost hundreds of thousands of dollars annually in storage. The discipline is knowing what to include.
Log at every service boundary: incoming request (method, path, caller identity), outgoing call (destination, method, params summary), and the result (status, duration, error if any). This is the minimum for reconstructing a request's journey. Do not log every internal function call — that belongs in traces, not logs.
Never log secrets or PII — passwords, tokens, credit card numbers, SSNs, full email addresses. Use field-level redaction in your log library or a preprocessing step in your shipper. In your application code: "card_number": "[REDACTED]". Many compliance frameworks (PCI-DSS, GDPR, HIPAA) make this a hard requirement; violations are catastrophic and hard to remediate because log data cannot be unwritten once shipped to a cold storage tier.
Severity Discipline in Production
In practice, most teams have two problems: info logs that are so verbose they drown out real signals, and error logs that are so overloaded they lose meaning. Establishing a clear contract for each level — enforced in code review — is essential.
- debug — Internal state useful only when actively debugging. Disabled in production by default. Toggled per-service via a dynamic flag.
- info — Normal, meaningful business events: request received, job started, payment authorized. One per top-level operation. Never emit
infoin a tight loop. - warn — Recoverable anomalies that require monitoring but not immediate action: retry succeeded after failure, deprecated API used, config value missing with default applied.
- error — An operation failed and the system could not recover on its own. A human may need to act. Every
errorlog should includeerror.typeanderror.stack. - fatal — The process cannot continue. The application exits immediately after emitting this line. Use sparingly; it is not interchangeable with
error.
The operational test: if someone wakes up at 3 AM because an error alert fired, every error log must describe something worth waking up for. If it does not, downgrade it to warn. Alert fatigue starts with misclassified log severity.
debug level in production took down their entire logging stack for 20 minutes before a log-volume alert caught it.