Prometheus & Grafana

The Prometheus Model

18 min Lesson 1 of 32

The Prometheus Model

You have spent previous tutorials building observable foundations — distributed tracing with Jaeger, structured logging with Loki, and the three-pillars mental model. Now it is time to go deep on the most widely adopted metrics system in the DevOps world. Prometheus is not merely a metrics database: it is a complete model for how to think about measurement in cloud-native systems. Understanding that model — its pull-based architecture, its time-series storage engine, and the ecosystem that surrounds it — is the prerequisite for everything else in this tutorial.

At Google scale, teams run internal systems (Borgmon, then Monarch) that share Prometheus's core design philosophy. The Prometheus project, started at SoundCloud in 2012 and donated to the CNCF in 2016, made that philosophy available to everyone. Today it is the default metrics backend for Kubernetes clusters from every major cloud provider.

Pull-Based Scraping: The Fundamental Design Choice

Most metrics systems you may have encountered — StatsD, Graphite, many APM agents — are push-based: applications send metrics to a central collector. Prometheus inverts this. It is pull-based: Prometheus itself reaches out over HTTP to each target and scrapes an /metrics endpoint that the target exposes. This is not an implementation detail — it is a deliberate architectural decision with deep operational consequences.

The /metrics endpoint serves the Prometheus exposition format: a plain-text, line-oriented format that any HTTP client can read without special tooling. Each target is responsible for maintaining a current view of its own counters, gauges, histograms, and summaries, and for serving them on demand. Prometheus pulls this snapshot on a configurable interval — the scrape_interval — typically 15 or 30 seconds in production.

Key idea: In a pull model, Prometheus always knows whether a target is reachable. If the scrape fails, Prometheus has a metric for that: up == 0. In a push model, the absence of data is ambiguous — is the service dead, or just not sending? This difference makes alerting on service availability dramatically simpler and more reliable in a pull architecture.

Why pull instead of push? Several engineering reasons converge:

Health as a first-class signal: A failed scrape immediately surfaces as a missing or zero up metric. You do not need a separate health-check system.
Configuration lives with Prometheus, not targets: You control scrape frequency, timeouts, and relabeling centrally. Targets need only serve an endpoint — they do not need to know where Prometheus lives.
Local debugging: You can curl http://my-service:8080/metrics at any time and see exactly what Prometheus sees. There is no invisible agent, no buffering, no retry queue to reason about.
No metric loss from network partitions toward Prometheus: If Prometheus is temporarily unreachable, targets accumulate state in memory. When scraping resumes, the next scrape reflects the current state. Counter continuity is preserved because Prometheus tracks the last scraped value.

Pro practice: At large-scale deployments (thousands of targets), teams run multiple Prometheus instances in a federated or sharded topology. Each instance scrapes a subset of targets. Never assume a single Prometheus instance will scale indefinitely — 10 million active series is roughly where a single instance with 64 GB RAM begins to struggle. Design for horizontal scaling from the start.

The Exposition Format and the Client Libraries

A /metrics response looks like this — a plain text file where each line is a metric observation:

# HELP http_requests_total Total HTTP requests handled
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200",handler="/api/users"} 1827364
http_requests_total{method="GET",status="500",handler="/api/users"} 42
http_requests_total{method="POST",status="201",handler="/api/orders"} 93241

# HELP process_resident_memory_bytes Resident memory size in bytes
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.34217728e+08

# HELP http_request_duration_seconds HTTP request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.005"} 183921
http_request_duration_seconds_bucket{le="0.01"} 248034
http_request_duration_seconds_bucket{le="0.025"} 291047
http_request_duration_seconds_bucket{le="0.05"} 298201
http_request_duration_seconds_bucket{le="+Inf"} 298391
http_request_duration_seconds_sum 4872.3
http_request_duration_seconds_count 298391

The Prometheus project maintains official client libraries for Go, Java/JVM, Python, and Ruby. Community-supported libraries exist for virtually every language. Instrumenting a Go HTTP server is a matter of importing prometheus/client_golang and registering metrics — the library handles thread-safe accumulation and HTTP exposition. In Kubernetes environments, the kube-state-metrics exporter exposes cluster state, and node_exporter exposes OS-level metrics from every node — you scrape these exactly like application endpoints.

The Time-Series Database (TSDB)

Prometheus stores all scraped samples in its embedded TSDB — a purpose-built time-series database optimized for write-heavy, append-only workloads with high-cardinality label sets. Understanding its storage model helps you operate it correctly and avoid the most common production failures.

Each time series is identified by a unique combination of a metric name and a set of labels — key-value pairs that provide dimensions. The series http_requests_total{method="GET", status="200", service="orders"} is a completely separate time series from http_requests_total{method="POST", status="201", service="orders"}. Every distinct label combination creates a new series. This is both Prometheus's greatest power and its most common footgun.

The TSDB organizes data into two layers:

In-memory Head block: The most recent two hours of data live in a compressed, memory-mapped write-ahead log (WAL). Scrapes write here first — extremely fast, sequential I/O. On restart, the WAL replays to rebuild the head.
Persistent blocks: Every two hours, the head block is compacted and written to a persistent block on disk. Blocks are immutable. A background compactor merges smaller blocks into larger ones (covering up to 31% of the configured retention window) to reduce query overhead across long time ranges.

Default retention is 15 days of local storage. For long-term retention, Prometheus supports a remote_write interface that streams samples to an external backend — Thanos, Cortex, Mimir, or Victoria Metrics — which handles multi-year retention at scale with object storage.

Production pitfall — cardinality explosion: The single most common cause of Prometheus OOM crashes is high-cardinality labels. If you label metrics with user IDs, request IDs, or any unbounded string, you create millions of distinct time series. Each series consumes roughly 3–4 KB of memory in the head block. One million series = ~3 GB of RAM just for the head. Always design label sets with bounded cardinality. Prometheus exposes prometheus_tsdb_head_series — alert when it exceeds 80% of your capacity budget.

The Prometheus Ecosystem: Architecture Diagram

Prometheus does not operate in isolation. The production architecture involves a set of well-defined components, each with a specific role. Understanding what each component does — and what it does not do — prevents architectural mistakes that are expensive to undo.

The Prometheus ecosystem: Prometheus pulls metrics from targets discovered via service discovery, evaluates rules, fires alerts to Alertmanager, and exposes PromQL for Grafana. Remote write ships samples to long-term storage.

Service Discovery: How Prometheus Knows What to Scrape

In static environments you could list scrape targets manually. In Kubernetes, where pods are born and die constantly, static configuration is impossible. Prometheus has built-in service discovery integrations for Kubernetes, Consul, AWS EC2, GCE, Azure, and DNS-SD. In a Kubernetes cluster, the typical setup uses the kubernetes_sd_configs mechanism with relabeling rules to filter and transform the discovered targets.

# prometheus.yml — minimal production-style config for Kubernetes
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: prod-us-east-1
    region: us-east-1

rule_files:
  - /etc/prometheus/rules/*.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

scrape_configs:
  # Scrape Prometheus itself
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]

  # Kubernetes pods annotated for scraping
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with annotation prometheus.io/scrape: "true"
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: "true"
      # Use the port from annotation prometheus.io/port if set
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: ${1}
      # Carry namespace and pod name as labels
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod

remote_write:
  - url: "http://thanos-receive:10908/api/v1/receive"
    queue_config:
      max_samples_per_send: 10000
      max_shards: 30
      min_backoff: 1s
      max_backoff: 30s

The external_labels block is critical in multi-cluster setups. When remote-writing to a central store like Thanos, these labels attach to every series so you can distinguish cluster="prod-us-east-1" from cluster="prod-eu-west-1" in global queries.

What Prometheus Is Not

Understanding Prometheus's intentional limitations prevents architectural mistakes. Prometheus is designed for numeric time-series metrics with bounded cardinality. It is explicitly not designed for:

Log storage: Use Loki, Elasticsearch, or Splunk for logs. Do not encode log content into metric labels.
Event tracking or billing: Prometheus may lose up to one scrape interval of data (15–30 seconds) on crash. It offers no durability guarantees for individual events. Use Kafka or a transactional store for billing-critical counts.
Long-term retention out of the box: Default 15-day local storage. For years of history, remote_write to Thanos or Mimir from day one.
High-cardinality dimensions: No user IDs, request IDs, or session tokens as labels. These belong in traces (Jaeger/Tempo) or logs.

The right mental model: Prometheus answers "how is my system behaving in aggregate right now and over the past few weeks?" Traces answer "what happened during this specific request?" Logs answer "what did the process say at this exact moment?" Build all three; let each do its job.

In the next lesson you will go deep on the four metric types — counter, gauge, histogram, and summary — and the exposition format. With the pull model and TSDB architecture firmly in mind, those details will snap into place immediately.