Capstone: A Big-Tech Production Platform

Observability Stack

18 min Lesson 6 of 30

Observability Stack

When a production incident fires at 03:00, the difference between a 5-minute resolution and a 2-hour outage is almost always the quality of your observability stack. Metrics tell you something is wrong; traces tell you where; logs tell you why. At big-tech scale these three signals must be architected as a unified platform — not bolted-on tools — with retention policies, cardinality budgets, and SLO-driven alerting wired together before a single service ships to production.

The Three Pillars at Scale

Every observability system is constrained by the same three axes: ingestion throughput, query latency, and retention cost. The architectural choices below are driven by those constraints, not by vendor preferences.

Metrics — Prometheus + Thanos (or Mimir). A single Prometheus instance fails at roughly 1 M active time series on commodity hardware. Beyond that you need either Thanos (sidecar model, stores to object storage, global query layer) or Grafana Mimir (microservices, horizontally scalable). Google-scale deployments use Monarch; at mid-big-tech (10k–50k pods) Mimir with S3 backend and 13-month retention is the current canonical choice. Scrape interval 15 s for infrastructure metrics, 30 s for application metrics; never go lower than 10 s — you create cardinality explosion without real signal improvement.
Logs — OpenTelemetry Collector → Loki (or OpenSearch). Structured JSON logs only. Every log line emits trace_id, service.name, env, and severity. At >500 GB/day, Loki's chunk store on S3 with a 30-day hot tier and 1-year cold tier costs roughly 70 % less than an Elasticsearch cluster of equivalent query performance. Log sampling is legitimate at >10k req/s per service; sample DEBUG at 1 %, INFO at 10 %, WARN/ERROR at 100 %, and always propagate trace context so sampled logs stay correlated.
Traces — OpenTelemetry SDK → Tempo (or Jaeger). Tail-based sampling is mandatory at scale. Head-based sampling (sample at ingress) throws away the traces of slow and errored requests — exactly the ones you need. Tempo 2.x + trace-pipeline sampling keeps 100 % of error traces, 100 % of P99+ latency traces, and a configurable tail for normal traffic. Typical ratio: 1 % baseline tail + 100 % error/latency capture.

Architecture: Signal Flow Diagram

Signal flow: pods emit OTLP traces and logs to the OTel Collector; Prometheus scrapes metrics and remote-writes to Mimir; all three backends persist cold data to object storage; Grafana provides the unified query and alerting surface.

SLO Design and the Error Budget

An SLO without an error budget is just a number. The budget is the operational lever: when it is healthy you ship features; when it is burning you freeze the release pipeline and focus engineering on reliability. The alert hierarchy follows the burn-rate model from the Google SRE book, which you should treat as read-only specification at this point:

Page (P0) alert: burn rate > 14.4× for 1 minute. At this rate the entire 30-day error budget is consumed in 2 hours. Wake the on-call immediately.
Ticket (P1) alert: burn rate > 6× for 5 minutes. Budget gone in 5 days. Fix during business hours today.
Burn-rate warning: burn rate > 1× for 1 hour. Budget is shrinking; create a task, no pager needed.

Cardinality is the silent killer. Every unique label combination in Prometheus creates a new time series. A single label with high cardinality — user_id, request_id, trace_id — can explode a 100k-series Prometheus into 50 M series overnight. Enforce label value cardinality budgets with the Mimir cardinality API and reject metrics at the collector level using the filter processor. At Uber, a single high-cardinality metric from an SDK change caused a $250k/month infrastructure overspend before it was caught by a cardinality alarm.

Production Prometheus + Alertmanager Config

# prometheus.yml — scrape + remote_write to Mimir
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: prod-us-east-1
    env: production

rule_files:
  - /etc/prometheus/rules/*.yml

remote_write:
  - url: http://mimir-distributor.monitoring.svc:9009/api/v1/push
    queue_config:
      max_samples_per_send: 10000
      capacity: 100000
      max_shards: 30
    metadata_config:
      send: true

scrape_configs:
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: "true"
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod

# rules/slo-payment-api.yml — multi-window burn-rate alerts
groups:
  - name: payment-api-slo
    rules:
      # Availability SLO: 99.9% (error budget: 43.8 min/month)
      - record: job:http_errors:rate5m
        expr: |
          sum(rate(http_requests_total{job="payment-api",status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{job="payment-api"}[5m]))

      - alert: PaymentAPIErrorBudgetBurn_Critical
        expr: |
          (
            job:http_errors:rate5m > (14.4 * 0.001)
          )
          and
          (
            sum(rate(http_requests_total{job="payment-api",status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total{job="payment-api"}[1h]))
            > (14.4 * 0.001)
          )
        for: 1m
        labels:
          severity: critical
          slo: payment-api-availability
        annotations:
          summary: "Payment API burning error budget at >14.4x rate"
          runbook_url: "https://runbooks.internal/payment-api-5xx"
          description: "Current burn rate {{ $value | humanizePercentage }}, 2h to exhaustion"

      - alert: PaymentAPIErrorBudgetBurn_High
        expr: |
          job:http_errors:rate5m > (6 * 0.001)
        for: 5m
        labels:
          severity: high
          slo: payment-api-availability
        annotations:
          summary: "Payment API burning error budget at >6x rate"

OpenTelemetry Collector: Tail Sampling Config

# otel-collector-config.yaml — tail-based sampling pipeline
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 200000
    expected_new_traces_per_sec: 5000
    policies:
      - name: error-traces
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-traces
        type: latency
        latency: {threshold_ms: 500}
      - name: baseline-sample
        type: probabilistic
        probabilistic: {sampling_percentage: 1}

  batch:
    send_batch_size: 10000
    timeout: 5s
    send_batch_max_size: 20000

  filter/drop-high-cardinality:
    metrics:
      datapoint:
        - 'attributes["user_id"] != ""'

exporters:
  otlp/tempo:
    endpoint: http://tempo.monitoring.svc:4317
    tls:
      insecure: true
  loki:
    endpoint: http://loki-gateway.monitoring.svc/loki/api/v1/push
    default_labels_enabled:
      exporter: false
      job: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [otlp/tempo]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]

SLO Architecture Diagram

SLO hierarchy: SLIs are measured metrics; they define the SLO target and the error budget; burn-rate calculations across multiple windows drive the three-tier alert escalation policy.

Alertmanager Routing and Notification Strategy

Raw Prometheus alerts routed directly to Slack or PagerDuty without grouping create alert fatigue within weeks. The correct pattern is: group by SLO and cluster, inhibit lower-severity alerts when a critical fires on the same service, and deduplicate within a 5-minute group window. The Alertmanager config below encodes that pattern for the capstone platform.

# alertmanager.yml
global:
  resolve_timeout: 5m
  pagerduty_url: https://events.pagerduty.com/v2/enqueue

route:
  group_by: [alertname, cluster, slo]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: slack-warnings
  routes:
    - match:
        severity: critical
      receiver: pagerduty-critical
      continue: false
    - match:
        severity: high
      receiver: slack-oncall
      group_interval: 2m
      repeat_interval: 1h

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: high
    equal: [cluster, slo]

receivers:
  - name: pagerduty-critical
    pagerduty_configs:
      - routing_key: <PD_INTEGRATION_KEY>
        description: '{{ .GroupLabels.alertname }} on {{ .GroupLabels.cluster }}'
        details:
          runbook: '{{ (index .Alerts 0).Annotations.runbook_url }}'
          firing: '{{ .Alerts.Firing | len }}'

  - name: slack-oncall
    slack_configs:
      - api_url: <SLACK_WEBHOOK_URL>
        channel: '#oncall-alerts'
        title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: slack-warnings
    slack_configs:
      - api_url: <SLACK_WEBHOOK_URL>
        channel: '#platform-noise'
        send_resolved: true

Runbook-first alerting. Every production alert must have a runbook_url annotation pointing to a live, maintained runbook before it can fire in production. Alerts without runbooks are disabled. This is the single highest-leverage reliability practice: the engineer who gets paged at 03:00 needs the first three diagnostic commands, the expected failure modes, and the rollback procedure — not the source code. Codify this as a CI check in your alerting rule repository.

Grafana OnCall vs. PagerDuty: If you run Grafana OnCall on-cluster, a cluster-wide outage silences your own on-call system. Always route P0 (critical) alerts through an out-of-band channel — PagerDuty, OpsGenie, or a separate region's Alertmanager. Never let a single blast radius take out both the failing system and its incident notification path. This exact failure mode took down a major e-commerce platform for 4 hours during a Black Friday incident in 2023.

Retention, Cost, and Operational Hygiene

Observability infrastructure typically runs at 8–15 % of total cloud spend for companies that instrument thoroughly. Keeping that figure sustainable requires deliberate cost engineering:

Metrics: 13-month retention in Mimir (covers year-over-year capacity comparisons). Raw resolution for 7 days; 5-minute downsampling for 30 days; 1-hour downsampling beyond that. Downsampling reduces storage 40×.
Logs: 30-day hot tier in Loki (frequent access); 1-year cold tier in S3 Glacier with a 24h restore SLA. Enforce log_retention_days per namespace via Loki ruler policies — debug logs from a batch job should not cost the same as payment service error logs.
Traces: Tempo with S3 backend, 14-day retention for full traces. Error traces: 90 days. Slow traces: 30 days. Normal baseline: 7 days. This asymmetry reflects how investigations actually work — nobody needs a normal 200ms trace from 6 weeks ago.
Dashboards: Standardize on RED dashboards (Rate, Error, Duration) for every service. The first Grafana view any on-call engineer opens should answer: is this service healthy right now? Sprawling 40-panel dashboards with no clear hierarchy slow incident response.

By the time this observability stack is fully operational, every service in the capstone platform is instrumented, every SLO has a burn-rate alert wired to PagerDuty, and the platform team can answer the three incident questions — what broke, where it broke, and why — within a 5-minute MTTD target.