Observability Cost & Data Management
Observability Cost & Data Management
At Google scale, observability infrastructure costs more than many companies' entire engineering budgets. Even at mid-size companies on managed observability platforms — Datadog, New Relic, Honeycomb — teams routinely receive bills of $500K–$2M per year and discover the costs only after they explode. Understanding why telemetry is expensive and what levers you can pull to control costs without sacrificing signal quality is a first-class engineering discipline, not a finance concern.
This lesson covers the three mechanisms that dominate observability cost: cardinality (the number of unique metric time-series), sampling (retaining a statistically useful fraction of traces and logs), and retention tiers (storing data at different cost/fidelity trade-offs over time). Master these three and you can cut bills by 60–80% while keeping every alert, dashboard, and incident investigation working.
Cardinality Explosions
Every Prometheus metric is identified by its name plus a set of label key-value pairs. The total number of distinct label combinations is the metric's cardinality. Prometheus stores each unique combination as a separate time-series, each requiring in-memory head chunks and on-disk blocks. Cardinality is the single biggest cost driver in metrics systems.
A seemingly harmless label decision compounds fast. A metric with labels region (5), http_method (6), status_code (12), and route (200 normalised paths) has 5 × 6 × 12 × 200 = 72,000 series. Add one more label with 50 values and you're at 3.6 million series. Add user_id with 10 million values and Prometheus runs out of RAM and crashes — taking your alerting system down exactly when you need it most.
user_id, order_id, request_id, email, IP address, or any UUID. Each new entity creates a new time-series. After a traffic spike or signup campaign, Prometheus OOM-kills itself within minutes. The fix is cultural and tooling-enforced: label values must come from a bounded, known-size set. High-cardinality values belong in trace span attributes and structured log fields — never in metric labels.
Detecting cardinality problems before they hit production requires active monitoring of your metrics pipeline itself. Prometheus exposes prometheus_tsdb_head_series (current active series count) and prometheus_tsdb_head_series_created_total (creation rate). Alert when these grow faster than your service count warrants.
The fix for existing cardinality problems is to relabel at the scrape layer before data enters TSDB, using Prometheus metric_relabel_configs. You can drop entire metrics, drop specific labels, or replace label values with normalised buckets — all without touching application code.
Sampling: Keeping What Matters
Distributed traces and logs are verbose by nature. A 1000 RPS service produces 86.4 million trace spans per day. Storing all of them costs an enormous amount and delivers diminishing returns — the 999 successful GET /healthz calls tell you nothing that the first one did not. Sampling is the practice of keeping a statistically representative subset while discarding the redundant majority.
There are three sampling strategies, each appropriate at a different point in the pipeline:
- Head-based sampling — the decision is made at the start of a request, before any spans are created. Simple, zero overhead, but blind: you do not know yet whether this trace will be interesting (error, slow, unusual). Appropriate for very high-volume, low-value traffic (health checks, metrics scrapes).
- Tail-based sampling — the decision is deferred until the entire trace is complete. The collector buffers spans and evaluates rules: keep if any span has an error, keep if end-to-end latency exceeds 2s, keep 1% of the rest. This is the correct strategy for production services because it guarantees you keep 100% of errors and slow traces. The cost is a buffer (typically 30–60 seconds of spans in memory).
- Adaptive (dynamic) sampling — the sample rate adjusts automatically based on observed error rates, latency distributions, and traffic volume per route. Honeycomb's dynamic sampling and Grafana Tempo's probabilistic sampling fall here. This is the state of the art for large services but requires more configuration.
error=true or HTTP status 5xx as non-negotiable — keep rate 100% for that trace, regardless of the global sampling ratio.
In the OTel Collector, tail-based sampling is implemented with the tailsampling processor. Configure it as follows — this configuration keeps all errors, all slow traces, and 5% of normal traffic:
Retention Tiers: The Cost-Fidelity Curve
Not all telemetry data has the same value over time. A spike in error rate is actionable for the first 72 hours while an incident is live. The same data is useful for trend analysis for 30 days. After 90 days, you rarely need raw data — you need aggregated summaries: daily p99 latency, weekly error rate by service. After a year, only compliance-mandated data survives.
Big-tech observability teams implement tiered retention that trades fidelity for cost at each stage:
- Hot tier (0–72 hours) — full resolution, full cardinality, fast queries. Stored in Prometheus local TSDB or Grafana Mimir ingesters. Cost: highest. Purpose: real-time dashboards and on-call investigations.
- Warm tier (72 hours – 30 days) — full resolution metrics compacted into object storage (S3/GCS) blocks via Thanos or Mimir. Query speed is 2–5× slower but cost drops 10×. Purpose: post-incident reviews and SLO burn-rate calculations.
- Cold tier (30 days – 1 year) — downsampled data only. Thanos Compactor runs
5mand1hdownsampling, reducing storage 100–1000× vs raw data. Purpose: capacity planning, QBR metrics, compliance evidence. - Archive / glacier (1 year+) — only pre-computed aggregates survive. Raw spans and logs are deleted; summary tables in a data warehouse (BigQuery, Redshift) persist for legal hold periods.
For logs, the tiering pattern maps naturally to Loki's storage backends or to a pipeline that routes logs through the OTel Collector:
- All logs go to Loki for 7–14 days (hot/warm).
- After 14 days, only
level=errorandlevel=warnlogs are retained in Loki;level=infoandlevel=debugare expired. - Error logs are archived to S3 with a 1-year lifecycle policy for compliance.
- High-value structured fields are extracted and written to a columnar store (ClickHouse, BigQuery) for long-term analytics.
sum(rate(http_requests_total[5m])) by (service) across millions of series at dashboard load time, run it once every 30 seconds via a recording rule and query the cheap pre-computed series. This cuts query cost 100× on busy Prometheus clusters and is the correct pattern for every SLO and dashboard metric that runs more than once per minute.
Cost Attribution and Governance
At organisations with multiple teams, telemetry cost attribution prevents runaway spending. Teams should own their own cost centres. In Datadog, per-team cost is visible via the Usage Attribution feature. In self-hosted Prometheus/Thanos, cost can be proxied by series count per team: label metrics with team="payments" and query count by (team) ({__name__=~".+"}) to produce a series-count bill per team. Set per-team series budgets and alert in CI when a PR would cause the deploying service to exceed its budget.
Observability data management is ultimately an engineering economics problem. The goal is not zero telemetry — it is the minimum telemetry required to meet your SLO investigation and alerting requirements. An observability plan that costs $2M/year but catches every incident in under 5 minutes may be worth every dollar. One that costs $500K/year but leaves your team blind during incidents is worthless. Calibrate the levers — cardinality, sampling rates, and retention windows — to the actual risk and recovery time objectives of your system, not to arbitrary cost targets.