Capstone: A Big-Tech Production Platform

Observability Stack

18 min Lesson 6 of 30

Observability Stack

When a production incident fires at 03:00, the difference between a 5-minute resolution and a 2-hour outage is almost always the quality of your observability stack. Metrics tell you something is wrong; traces tell you where; logs tell you why. At big-tech scale these three signals must be architected as a unified platform — not bolted-on tools — with retention policies, cardinality budgets, and SLO-driven alerting wired together before a single service ships to production.

The Three Pillars at Scale

Every observability system is constrained by the same three axes: ingestion throughput, query latency, and retention cost. The architectural choices below are driven by those constraints, not by vendor preferences.

  • Metrics — Prometheus + Thanos (or Mimir). A single Prometheus instance fails at roughly 1 M active time series on commodity hardware. Beyond that you need either Thanos (sidecar model, stores to object storage, global query layer) or Grafana Mimir (microservices, horizontally scalable). Google-scale deployments use Monarch; at mid-big-tech (10k–50k pods) Mimir with S3 backend and 13-month retention is the current canonical choice. Scrape interval 15 s for infrastructure metrics, 30 s for application metrics; never go lower than 10 s — you create cardinality explosion without real signal improvement.
  • Logs — OpenTelemetry Collector → Loki (or OpenSearch). Structured JSON logs only. Every log line emits trace_id, service.name, env, and severity. At >500 GB/day, Loki's chunk store on S3 with a 30-day hot tier and 1-year cold tier costs roughly 70 % less than an Elasticsearch cluster of equivalent query performance. Log sampling is legitimate at >10k req/s per service; sample DEBUG at 1 %, INFO at 10 %, WARN/ERROR at 100 %, and always propagate trace context so sampled logs stay correlated.
  • Traces — OpenTelemetry SDK → Tempo (or Jaeger). Tail-based sampling is mandatory at scale. Head-based sampling (sample at ingress) throws away the traces of slow and errored requests — exactly the ones you need. Tempo 2.x + trace-pipeline sampling keeps 100 % of error traces, 100 % of P99+ latency traces, and a configurable tail for normal traffic. Typical ratio: 1 % baseline tail + 100 % error/latency capture.

Architecture: Signal Flow Diagram

Observability stack signal flow App Pods (OTel SDK) OTel Collector Traces Pipeline Metrics Pipeline Logs Pipeline Tail Sampling Processor Batch / Retry Tempo (Traces) Mimir (Metrics) Loki (Logs) S3 / GCS Long-term store Grafana Dashboards Alertmanager Prometheus scrape /metrics remote_write OTLP/gRPC gRPC remote_write HTTP
Signal flow: pods emit OTLP traces and logs to the OTel Collector; Prometheus scrapes metrics and remote-writes to Mimir; all three backends persist cold data to object storage; Grafana provides the unified query and alerting surface.

SLO Design and the Error Budget

An SLO without an error budget is just a number. The budget is the operational lever: when it is healthy you ship features; when it is burning you freeze the release pipeline and focus engineering on reliability. The alert hierarchy follows the burn-rate model from the Google SRE book, which you should treat as read-only specification at this point:

  • Page (P0) alert: burn rate > 14.4× for 1 minute. At this rate the entire 30-day error budget is consumed in 2 hours. Wake the on-call immediately.
  • Ticket (P1) alert: burn rate > 6× for 5 minutes. Budget gone in 5 days. Fix during business hours today.
  • Burn-rate warning: burn rate > 1× for 1 hour. Budget is shrinking; create a task, no pager needed.
Cardinality is the silent killer. Every unique label combination in Prometheus creates a new time series. A single label with high cardinality — user_id, request_id, trace_id — can explode a 100k-series Prometheus into 50 M series overnight. Enforce label value cardinality budgets with the Mimir cardinality API and reject metrics at the collector level using the filter processor. At Uber, a single high-cardinality metric from an SDK change caused a $250k/month infrastructure overspend before it was caught by a cardinality alarm.

Production Prometheus + Alertmanager Config

# prometheus.yml — scrape + remote_write to Mimir global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: prod-us-east-1 env: production rule_files: - /etc/prometheus/rules/*.yml remote_write: - url: http://mimir-distributor.monitoring.svc:9009/api/v1/push queue_config: max_samples_per_send: 10000 capacity: 100000 max_shards: 30 metadata_config: send: true scrape_configs: - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: "true" - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__meta_kubernetes_namespace] target_label: namespace - source_labels: [__meta_kubernetes_pod_name] target_label: pod
# rules/slo-payment-api.yml — multi-window burn-rate alerts groups: - name: payment-api-slo rules: # Availability SLO: 99.9% (error budget: 43.8 min/month) - record: job:http_errors:rate5m expr: | sum(rate(http_requests_total{job="payment-api",status=~"5.."}[5m])) / sum(rate(http_requests_total{job="payment-api"}[5m])) - alert: PaymentAPIErrorBudgetBurn_Critical expr: | ( job:http_errors:rate5m > (14.4 * 0.001) ) and ( sum(rate(http_requests_total{job="payment-api",status=~"5.."}[1h])) / sum(rate(http_requests_total{job="payment-api"}[1h])) > (14.4 * 0.001) ) for: 1m labels: severity: critical slo: payment-api-availability annotations: summary: "Payment API burning error budget at >14.4x rate" runbook_url: "https://runbooks.internal/payment-api-5xx" description: "Current burn rate {{ $value | humanizePercentage }}, 2h to exhaustion" - alert: PaymentAPIErrorBudgetBurn_High expr: | job:http_errors:rate5m > (6 * 0.001) for: 5m labels: severity: high slo: payment-api-availability annotations: summary: "Payment API burning error budget at >6x rate"

OpenTelemetry Collector: Tail Sampling Config

# otel-collector-config.yaml — tail-based sampling pipeline receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: tail_sampling: decision_wait: 10s num_traces: 200000 expected_new_traces_per_sec: 5000 policies: - name: error-traces type: status_code status_code: {status_codes: [ERROR]} - name: slow-traces type: latency latency: {threshold_ms: 500} - name: baseline-sample type: probabilistic probabilistic: {sampling_percentage: 1} batch: send_batch_size: 10000 timeout: 5s send_batch_max_size: 20000 filter/drop-high-cardinality: metrics: datapoint: - 'attributes["user_id"] != ""' exporters: otlp/tempo: endpoint: http://tempo.monitoring.svc:4317 tls: insecure: true loki: endpoint: http://loki-gateway.monitoring.svc/loki/api/v1/push default_labels_enabled: exporter: false job: true service: pipelines: traces: receivers: [otlp] processors: [tail_sampling, batch] exporters: [otlp/tempo] logs: receivers: [otlp] processors: [batch] exporters: [loki]

SLO Architecture Diagram

SLO and alerting hierarchy SLI availability latency p99 error rate SLO avail ≥ 99.9 % p99 ≤ 200 ms error ≤ 0.1 % 30-day window Error Budget 43.8 min / month remaining % Burn Rate multiwindow calc 1m + 5m + 60m Alert Tiers P0: rate > 14.4× P1: rate > 6× Warn: rate > 1× measured defines drives
SLO hierarchy: SLIs are measured metrics; they define the SLO target and the error budget; burn-rate calculations across multiple windows drive the three-tier alert escalation policy.

Alertmanager Routing and Notification Strategy

Raw Prometheus alerts routed directly to Slack or PagerDuty without grouping create alert fatigue within weeks. The correct pattern is: group by SLO and cluster, inhibit lower-severity alerts when a critical fires on the same service, and deduplicate within a 5-minute group window. The Alertmanager config below encodes that pattern for the capstone platform.

# alertmanager.yml global: resolve_timeout: 5m pagerduty_url: https://events.pagerduty.com/v2/enqueue route: group_by: [alertname, cluster, slo] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: slack-warnings routes: - match: severity: critical receiver: pagerduty-critical continue: false - match: severity: high receiver: slack-oncall group_interval: 2m repeat_interval: 1h inhibit_rules: - source_match: severity: critical target_match: severity: high equal: [cluster, slo] receivers: - name: pagerduty-critical pagerduty_configs: - routing_key: <PD_INTEGRATION_KEY> description: '{{ .GroupLabels.alertname }} on {{ .GroupLabels.cluster }}' details: runbook: '{{ (index .Alerts 0).Annotations.runbook_url }}' firing: '{{ .Alerts.Firing | len }}' - name: slack-oncall slack_configs: - api_url: <SLACK_WEBHOOK_URL> channel: '#oncall-alerts' title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' - name: slack-warnings slack_configs: - api_url: <SLACK_WEBHOOK_URL> channel: '#platform-noise' send_resolved: true
Runbook-first alerting. Every production alert must have a runbook_url annotation pointing to a live, maintained runbook before it can fire in production. Alerts without runbooks are disabled. This is the single highest-leverage reliability practice: the engineer who gets paged at 03:00 needs the first three diagnostic commands, the expected failure modes, and the rollback procedure — not the source code. Codify this as a CI check in your alerting rule repository.
Grafana OnCall vs. PagerDuty: If you run Grafana OnCall on-cluster, a cluster-wide outage silences your own on-call system. Always route P0 (critical) alerts through an out-of-band channel — PagerDuty, OpsGenie, or a separate region's Alertmanager. Never let a single blast radius take out both the failing system and its incident notification path. This exact failure mode took down a major e-commerce platform for 4 hours during a Black Friday incident in 2023.

Retention, Cost, and Operational Hygiene

Observability infrastructure typically runs at 8–15 % of total cloud spend for companies that instrument thoroughly. Keeping that figure sustainable requires deliberate cost engineering:

  • Metrics: 13-month retention in Mimir (covers year-over-year capacity comparisons). Raw resolution for 7 days; 5-minute downsampling for 30 days; 1-hour downsampling beyond that. Downsampling reduces storage 40×.
  • Logs: 30-day hot tier in Loki (frequent access); 1-year cold tier in S3 Glacier with a 24h restore SLA. Enforce log_retention_days per namespace via Loki ruler policies — debug logs from a batch job should not cost the same as payment service error logs.
  • Traces: Tempo with S3 backend, 14-day retention for full traces. Error traces: 90 days. Slow traces: 30 days. Normal baseline: 7 days. This asymmetry reflects how investigations actually work — nobody needs a normal 200ms trace from 6 weeks ago.
  • Dashboards: Standardize on RED dashboards (Rate, Error, Duration) for every service. The first Grafana view any on-call engineer opens should answer: is this service healthy right now? Sprawling 40-panel dashboards with no clear hierarchy slow incident response.

By the time this observability stack is fully operational, every service in the capstone platform is instrumented, every SLO has a burn-rate alert wired to PagerDuty, and the platform team can answer the three incident questions — what broke, where it broke, and why — within a 5-minute MTTD target.