Logging at Scale: ELK & Loki

Retention, Cost & Compliance

18 min Lesson 9 of 28

Retention, Cost & Compliance

At small scale, logging feels free. At production scale — millions of containers, hundreds of services, thousands of requests per second — logging becomes one of the largest line items in your infrastructure budget. A busy microservices platform can generate 10 TB of raw logs per day. Storing all of it forever in hot Elasticsearch is financial suicide. Keeping none of it invites regulatory fines and makes post-mortems impossible. The discipline of retention engineering is knowing exactly what to keep, for how long, at what fidelity, and at what cost.

Tiered Retention: The Three-Zone Model

Production logging platforms at big tech companies implement a three-tier architecture, modeled after storage tiering. Each zone has different cost, query speed, and retention duration characteristics.

Three-tier log retention: hot (fast, expensive) to warm (balanced) to cold (cheap, slow). Move logs automatically based on age and access patterns.

In Elasticsearch, tier movement is automated via Index Lifecycle Management (ILM). In Loki, the equivalent is compaction rules and object storage lifecycle policies. The key insight is that you almost never need to query logs older than 30 days at sub-second latency — those queries are for audits and forensics, where waiting 30 seconds is acceptable.

# Elasticsearch ILM policy: hot -> warm -> cold -> delete

PUT _ilm/policy/logs-standard
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "searchable_snapshot": {
            "snapshot_repository": "s3-logs-archive"
          },
          "set_priority": { "priority": 0 }
        }
      },
      "delete": {
        "min_age": "365d",
        "actions": { "delete": {} }
      }
    }
  }
}

The searchable_snapshot action in the cold phase is key: Elasticsearch mounts the index directly from S3 without fully restoring it to disk. You pay S3 prices (~$0.023/GB/mo) instead of EBS prices (~$0.10/GB/mo), and queries still work — they are just slower. This is how teams cut their Elasticsearch storage bill by 60-80% without losing queryability.

Sampling Noisy Logs

Not every log line has equal value. A healthy payment API that logs every successful transaction at INFO level produces enormous volume with near-zero diagnostic value. Log sampling is the practice of keeping only a statistical fraction of repetitive, low-value log lines while retaining 100% of high-signal logs (errors, warnings, slow requests, security events).

There are two mainstream sampling strategies:

Head-based sampling: The decision is made at the start of the request — keep 1 in 100 requests unconditionally. Simple to implement in the shipper or SDK, but you may drop the 1-in-a-million request that caused a production bug.
Tail-based sampling: Buffer the request, wait for the outcome, then decide — always keep errors and slow requests, sample the rest. More CPU-intensive at the collector, but far more intelligent. This is the Google and Netflix model.

# OpenTelemetry Collector: tail-based sampling config (otel-collector-config.yaml)

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    policies:
      - name: always-keep-errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: always-keep-slow
        type: latency
        latency: { threshold_ms: 2000 }
      - name: sample-healthy-traffic
        type: probabilistic
        probabilistic: { sampling_percentage: 1 }

  # For pure log-level sampling without tracing context
  filter/drop-noisy-info:
    logs:
      exclude:
        match_type: expr
        expressions:
          # Drop 99% of INFO-level nginx access logs stochastically
          - 'attributes["log.level"] == "INFO" and resource.attributes["service.name"] == "nginx" and int(string(Now().Unix())) % 100 != 0'

exporters:
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [filter/drop-noisy-info]
      exporters: [loki]

Always maintain a sampling metadata label on your log streams (e.g., sampling_rate="0.01"). When you query sampled data and want to extrapolate true counts, divide by the sampling rate. Without this label, your dashboards will silently undercount and nobody will know why.

PII Redaction: Protecting User Data in Logs

Logs are a privacy minefield. Engineers frequently log HTTP request bodies, headers, or database query results that contain personally identifiable information (PII) — email addresses, phone numbers, credit card numbers, session tokens, passwords. Under GDPR, PCI-DSS, and HIPAA, storing unredacted PII in your logging platform is a compliance violation that can result in regulatory fines. The correct approach is to redact at the point of collection, before the data ever reaches your storage layer.

PII redaction should happen in the collector pipeline, not in the application — because you cannot trust every developer to sanitize every log call correctly. A defense-in-depth model treats the collector as the last line of defense.

# OpenTelemetry Collector: PII redaction transforms

processors:
  transform/redact-pii:
    log_statements:
      - context: log
        statements:
          # Redact email addresses
          - replace_pattern(attributes["message"], "([a-zA-Z0-9._%+\\-]+@[a-zA-Z0-9.\\-]+\\.[a-zA-Z]{2,})", "[EMAIL REDACTED]")
          # Redact credit card numbers (13-19 digits with optional separators)
          - replace_pattern(attributes["message"], "\\b(?:\\d[ \\-]?){13,19}\\b", "[CARD REDACTED]")
          # Redact Bearer tokens from Authorization headers
          - replace_pattern(attributes["http.request.header.authorization"], "Bearer\\s+[A-Za-z0-9\\-._~+\\/]+=*", "Bearer [TOKEN REDACTED]")
          # Drop high-risk key names entirely (defense-in-depth)
          - delete_matching_keys(attributes, "password|secret|token|ssn|dob")

# Promtail / Grafana Alloy equivalent (pipeline_stages section):
pipeline_stages:
  - replace:
      expression: '([a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,})'
      replace: '[EMAIL REDACTED]'
  - replace:
      expression: '(\b\d{4}[\s\-]?\d{4}[\s\-]?\d{4}[\s\-]?\d{4}\b)'
      replace: '[CARD REDACTED]'
  - replace:
      expression: '(Authorization: Bearer )[^\s]+'
      replace: '${1}[TOKEN REDACTED]'

Regex-based redaction is not foolproof. It will miss obfuscated or encoded PII (base64-encoded emails, URL-encoded phone numbers). For high-assurance compliance requirements — HIPAA covered entities, PCI Level 1 — combine regex redaction with a dedicated DLP (Data Loss Prevention) service such as Google Cloud DLP, AWS Macie, or the open-source Microsoft Presidio, as a second pass before data lands in cold storage. Regex alone is a good first layer; it is not a compliance guarantee.

Compliance Frameworks and What They Actually Require

Different regulations have different log retention mandates. As a DevOps engineer operating in regulated environments, you need to know these numbers without looking them up:

PCI-DSS v4.0 (Requirement 10.5): Audit logs must be retained for at least 12 months, with the most recent 3 months immediately available. Logs must be write-protected against modification.
HIPAA: 6 years retention for audit logs of PHI access. Encryption at rest is required. Access logs must record who accessed what and when.
SOC 2 Type II: No mandated duration, but auditors typically expect 12 months of logs covering the audit period with tamper evidence to prove log integrity.
GDPR: No minimum retention requirement, but the right to erasure creates a maximum — you cannot store logs containing personal data indefinitely. Implement a deletion workflow for your log stores, not just your databases.

Implementing S3 Object Lock in WORM (Write Once Read Many) mode — or the GCS equivalent — satisfies the tamper-evidence requirement for PCI and SOC 2. Once a log object is written and locked, not even a root-level cloud administrator can delete or overwrite it before the retention period expires.

Cost Optimization Levers in Practice

When your logging bill is too high, work through these levers in order — ranked by effort-to-impact ratio:

Raise log level thresholds in production. Switching from DEBUG to INFO in production typically reduces volume by 50-80% with near-zero engineering effort.
Drop high-volume, low-value log classes at the shipper. Health-check endpoints, Kubernetes liveness probes, and static asset requests are the top offenders — filter them before they reach your backend.
Compress aggressively. Loki uses Snappy or gzip by default. Elasticsearch\'s best_compression codec (DEFLATE) saves 20-40% versus the default. Enable it on all warm and cold indices.
Right-size your hot tier. Instrument how often on-call engineers query logs older than 7 days during incidents. If the answer is rarely, shrink the hot window from 30 days to 7 days.
Disable dynamic field mapping in Elasticsearch. Every new field you log becomes a mapped field by default. Disable dynamic mapping and explicitly map only fields you query on — unmapped fields are still stored in _source but do not consume high-cardinality term memory.

Build a Log Retention Policy document and version-control it alongside your IaC. It should specify: per-service log tier assignments, retention durations by tier, PII classification for each service, and the audit cadence for verifying that redaction is working. Auditors will ask for this document — having it ready is the difference between a clean audit and a finding.

Retention engineering is not a one-time setup task. Build a monthly review cadence: check your top 10 highest-volume log sources, verify redaction coverage, and reconcile actual retention costs against your targets. Logging platforms drift toward expensive over time as teams add new services and nobody removes old noisy sources.