Logging at Scale: ELK & Loki

Structured Logging Standards

18 min Lesson 2 of 28

Structured Logging Standards

Before you can ship logs to Elasticsearch, query them in Loki, or alert on them in Grafana, you need to decide what a log line actually is. That decision — made once, inconsistently, across fifty services — is the reason most logging pipelines become graveyards of unqueryable noise. This lesson establishes the production standards that companies like Stripe, Cloudflare, and GitHub use to make their logs a reliable source of truth at hundreds of thousands of events per second.

Why Unstructured Logs Fail at Scale

A classic unstructured log line looks like this: 2024-03-12 14:23:01 ERROR checkout failed for user 42 after 3201ms. A human can read it. A machine cannot reliably parse it. To extract the user ID, the duration, or the error type from a million such lines, you must write fragile regex — and the moment one team changes their log format, the regex breaks and your dashboard silently returns zeros.

Structured logging replaces free-text messages with machine-readable key/value documents — almost always JSON. Every field is a named, typed attribute. The log above becomes a JSON object where user_id, duration_ms, and level are first-class fields you can index, filter, and aggregate without any parsing logic at query time. This is not just a developer convenience — it is the architectural prerequisite for centralized logging to work.

Key idea: Structure is not about JSON for its own sake. It is about making every meaningful dimension of a log event a queryable field at write time, so you never need to parse text at query time. Parsing at query time is expensive, brittle, and impossible to do retroactively on historical data.

The Canonical JSON Log Schema

There is no single official standard, but the industry has converged on a small set of mandatory base fields that every log line must carry, regardless of which service or language emits it. These fields are what allow your log aggregator to merge streams from a Go API, a Python worker, and a Node.js service into a coherent, jointly-queryable corpus.

{
  "timestamp":   "2024-03-12T14:23:01.847Z",
  "level":       "error",
  "service":     "checkout-api",
  "version":     "v2.14.3",
  "env":         "production",
  "host":        "pod-checkout-api-6f4d9b-xk7p2",
  "trace_id":    "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id":     "00f067aa0ba902b7",
  "request_id":  "req_01HXYZ9ABCDE",
  "user_id":     "usr_42",
  "message":     "payment gateway timeout",
  "duration_ms": 3201,
  "http.method": "POST",
  "http.path":   "/v1/checkout",
  "http.status": 504,
  "error.type":  "GatewayTimeout",
  "error.stack": "checkout.go:214 ..."
}

Let us walk through the most important field groups and the reasoning behind each one.

Field Conventions: The Why Behind Every Key

Timestamp. Always ISO 8601 in UTC with millisecond precision: 2024-03-12T14:23:01.847Z. Never use Unix epoch integers in your application logs — they are unreadable without a converter, and every log UI will re-parse the ISO string anyway. The timezone suffix Z is mandatory; logs without it create ambiguity when services span regions.

Level. Use exactly these five, lowercase: debug, info, warn, error, fatal. Do not invent critical, severe, or INFORMATION. Mixed-case and synonym proliferation breaks log-level filters across your aggregator. Production services should emit info and above by default; debug is toggled dynamically via a feature flag or environment variable, never compiled in permanently.

Service, version, env. These three together identify what emitted the log. service should match your Kubernetes Service name exactly — that allows automatic correlation between metrics and logs without a mapping table. version should be the full semver tag or git SHA, so you can correlate a degradation with a specific deployment. env must be one of development, staging, production — it prevents staging noise from polluting production dashboards.

Namespaced field keys. Use dot notation for domain grouping: http.method, http.status, db.query_ms, error.type, error.stack. This mirrors the OpenTelemetry Semantic Conventions, which means your logs will automatically align with traces when you add distributed tracing. Non-namespaced flat keys for everything (method, queryMs) are a common mistake that leads to collisions between teams and makes cross-service aggregation painful.

Pro practice: Adopt the OpenTelemetry Semantic Conventions (semconv) for your field names — even if you are not yet using OTel for tracing. They are the emerging industry lingua franca. When you later add traces, your log fields map directly to span attributes with zero migration cost. The key namespaces to internalize: http.*, db.*, messaging.*, rpc.*, error.*.

Correlation IDs and Request IDs

The single most high-leverage practice in structured logging is propagating a unique identifier through every log line that belongs to the same request. Without this, when you see an error in service C, you have no way to find the upstream logs in service A and B that preceded it. With a request_id, you type one value into your log query UI and instantly see the complete narrative of that request across all services.

There are two distinct ID types in production systems, and confusing them causes operational pain:

request_id — generated at the API gateway or load balancer for every inbound HTTP request. Scoped to a single synchronous call chain. Use this to reconstruct what happened during one HTTP transaction. A good format is a ULID (lexicographically sortable, time-prefixed): req_01HXYZ9ABCDE.
trace_id — a 128-bit W3C TraceContext ID that spans asynchronous boundaries, queue hops, and multiple services. This is your distributed tracing ID. It is generated once per user-visible operation and propagated via HTTP headers (traceparent) and message queue metadata. Use this to correlate logs with traces in Tempo, Jaeger, or X-Ray.

Both IDs must be injected into your logging context at the framework layer so that every log line emitted during that request carries them automatically — without requiring each developer to remember to pass them manually. In Go, this means using context.Context and a middleware that calls log.With(ctx, "trace_id", traceID). In Python it is a logging filter; in Node.js it is AsyncLocalStorage. The mechanism differs, but the principle is universal: IDs travel with the request context, not as function arguments.

A single trace_id and request_id propagated across the API Gateway, Order Service, and Payment Service lets you reconstruct the full request narrative in your log backend with one query.

What to Log — and What Not To

Every log line costs money and latency. At Google-scale, a single extra field added to every log line can cost hundreds of thousands of dollars annually in storage. The discipline is knowing what to include.

Log at every service boundary: incoming request (method, path, caller identity), outgoing call (destination, method, params summary), and the result (status, duration, error if any). This is the minimum for reconstructing a request's journey. Do not log every internal function call — that belongs in traces, not logs.

Never log secrets or PII — passwords, tokens, credit card numbers, SSNs, full email addresses. Use field-level redaction in your log library or a preprocessing step in your shipper. In your application code: "card_number": "[REDACTED]". Many compliance frameworks (PCI-DSS, GDPR, HIPAA) make this a hard requirement; violations are catastrophic and hard to remediate because log data cannot be unwritten once shipped to a cold storage tier.

Production pitfall: A common mistake is logging entire HTTP request bodies for debugging. At low traffic this seems harmless. At production scale it: (1) leaks PII and secrets, (2) balloons your log volume 10-100x and your storage costs with it, (3) saturates your log shipper under load, causing log loss at precisely the worst moment — during an incident. Log request metadata (path, headers subset, content-length), never the body. If you need bodies for debugging, use distributed tracing with sampling, not logs.

# Go: structured logging with zerolog + correlation ID from context
# The middleware injects trace_id and request_id into every logger in context.

package middleware

import (
    "net/http"
    "time"

    "github.com/rs/zerolog"
    "github.com/rs/zerolog/log"
    "go.opentelemetry.io/otel/trace"
)

func RequestLogger(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        span := trace.SpanFromContext(r.Context())
        traceID := span.SpanContext().TraceID().String()
        spanID  := span.SpanContext().SpanID().String()

        // Attach IDs to logger; all downstream log calls inherit them.
        logger := log.With().
            Str("trace_id",   traceID).
            Str("span_id",    spanID).
            Str("request_id", r.Header.Get("X-Request-Id")).
            Str("method",     r.Method).
            Str("path",       r.URL.Path).
            Logger()

        ctx := logger.WithContext(r.Context())
        start := time.Now()

        rec := &statusRecorder{ResponseWriter: w, status: 200}
        next.ServeHTTP(rec, r.WithContext(ctx))

        zerolog.Ctx(ctx).Info().
            Int("http.status",   rec.status).
            Int64("duration_ms", time.Since(start).Milliseconds()).
            Str("service",  "checkout-api").
            Str("version",  "v2.14.3").
            Str("env",      "production").
            Msg("request completed")
    })
}

Severity Discipline in Production

In practice, most teams have two problems: info logs that are so verbose they drown out real signals, and error logs that are so overloaded they lose meaning. Establishing a clear contract for each level — enforced in code review — is essential.

debug — Internal state useful only when actively debugging. Disabled in production by default. Toggled per-service via a dynamic flag.
info — Normal, meaningful business events: request received, job started, payment authorized. One per top-level operation. Never emit info in a tight loop.
warn — Recoverable anomalies that require monitoring but not immediate action: retry succeeded after failure, deprecated API used, config value missing with default applied.
error — An operation failed and the system could not recover on its own. A human may need to act. Every error log should include error.type and error.stack.
fatal — The process cannot continue. The application exits immediately after emitting this line. Use sparingly; it is not interchangeable with error.

The operational test: if someone wakes up at 3 AM because an error alert fired, every error log must describe something worth waking up for. If it does not, downgrade it to warn. Alert fatigue starts with misclassified log severity.

Pro practice: Implement a "log budget" per endpoint in your observability platform. In Datadog and Grafana Cloud you can set log-based alerts that fire when a single service emits more than N lines per minute — an uncontrolled logging loop is a denial-of-service attack on your own pipeline. At Shopify, a single merchant-triggered job that logged at debug level in production took down their entire logging stack for 20 minutes before a log-volume alert caught it.