Prometheus & Grafana

Project: Monitor a Service End-to-End

18 min Lesson 10 of 32

Project: Monitor a Service End-to-End

Every concept from the previous nine lessons — the Prometheus data model, metric types, PromQL, exporters, service discovery, recording and alerting rules, Alertmanager, and Grafana — exists to answer one operational question: is my service healthy, and will I know before my users do? This capstone project walks you through building a complete observability stack for a real HTTP service: instrument it, scrape it, write production-grade PromQL, build a Grafana dashboard that on-call engineers can act from, and wire up alerts that fire at the right time for the right reasons.

Architecture Overview

The stack you will build consists of a Go HTTP service instrumented with the prometheus/client_golang library, a Prometheus server that scrapes it, Alertmanager for notification routing, and Grafana for dashboards. In a production Kubernetes environment you would add a ServiceMonitor and rely on the Prometheus Operator; the patterns here map directly to that setup.

End-to-end observability: the HTTP service exposes metrics, Prometheus scrapes and evaluates rules, Alertmanager routes pages, and Grafana visualises everything.

Step 1: Instrument the Service

Add the four golden signals as Prometheus metrics at the application boundary. Define your metrics once at package level using promauto so registration is automatic and goroutine-safe.

// main.go  — instrument an HTTP service with four golden signals
package main

import (
    "net/http"
    "strconv"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total HTTP requests by method, path, and status.",
    }, []string{"method", "path", "status"})

    httpRequestDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Help:    "HTTP request latency.",
        Buckets: prometheus.DefBuckets, // .005 .01 .025 .05 .1 .25 .5 1 2.5 5 10
    }, []string{"method", "path"})

    httpRequestsInFlight = promauto.NewGauge(prometheus.GaugeOpts{
        Name: "http_requests_in_flight",
        Help: "Current number of requests being processed.",
    })
)

func instrument(path string, next http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        httpRequestsInFlight.Inc()
        defer httpRequestsInFlight.Dec()

        rw := &responseWriter{ResponseWriter: w, status: 200}
        start := time.Now()
        next(rw, r)
        dur := time.Since(start).Seconds()

        httpRequestsTotal.WithLabelValues(r.Method, path, strconv.Itoa(rw.status)).Inc()
        httpRequestDuration.WithLabelValues(r.Method, path).Observe(dur)
    }
}

func main() {
    http.Handle("/metrics", promhttp.Handler())
    http.HandleFunc("/api/orders", instrument("/api/orders", ordersHandler))
    http.ListenAndServe(":8080", nil)
}

Bucket design matters. The default Prometheus buckets (DefBuckets) are calibrated for typical network latencies. For internal RPC services your SLO may be 10ms p99 — use custom buckets like []float64{.001, .005, .01, .025, .05, .1, .25, .5, 1}. Histograms are useless for SLO measurement if your SLO boundary does not align with a bucket boundary.

Step 2: Scrape Configuration

Add a job for your service in prometheus.yml. In Kubernetes you would use a ServiceMonitor; locally, a static config suffices for this project. Keep scrape_interval at 15s — polling faster than once per scrape rarely adds value and increases TSDB write pressure.

# prometheus.yml
global:
  scrape_interval:     15s
  evaluation_interval: 15s

rule_files:
  - /etc/prometheus/rules/*.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

scrape_configs:
  - job_name: orders-api
    static_configs:
      - targets: ["orders-api:8080"]
    relabel_configs:
      - source_labels: [__address__]
        target_label:  instance
        regex:         '([^:]+)(?::\d+)?'
        replacement:   '$1'

Step 3: Recording Rules and PromQL

Write recording rules for every expression that feeds an alert or a dashboard. This precomputes results so dashboards load instantly and alert evaluation is cheap even at scale.

# /etc/prometheus/rules/orders-api.yml
groups:
  - name: orders_api_records
    interval: 15s
    rules:
      # Request rate per status class
      - record: job:http_requests:rate5m
        expr: |
          sum by (job, status) (
            rate(http_requests_total[5m])
          )

      # Error ratio — feeds SLO and alert
      - record: job:http_error_ratio:rate5m
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (job) (rate(http_requests_total[5m]))

      # p99 latency from histogram
      - record: job:http_request_duration_p99:rate5m
        expr: |
          histogram_quantile(0.99,
            sum by (job, le) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )

  - name: orders_api_alerts
    rules:
      - alert: HighErrorRate
        expr: job:http_error_ratio:rate5m{job="orders-api"} > 0.01
        for: 5m
        labels:
          severity: page
          team:     backend
        annotations:
          summary:     "Error rate {{ $value | humanizePercentage }} on {{ $labels.job }}"
          description: "More than 1% of requests are returning 5xx. Check logs and downstream dependencies."
          runbook_url:  "https://runbooks.internal/orders-api/high-error-rate"

      - alert: HighLatency
        expr: job:http_request_duration_p99:rate5m{job="orders-api"} > 0.5
        for: 10m
        labels:
          severity: warning
          team:     backend
        annotations:
          summary: "p99 latency {{ $value | humanizeDuration }} on {{ $labels.job }}"

The for clause is not optional. Evaluating an alert without a for period fires on the first scrape that crosses the threshold — single bad scrapes, network blips, or a rolling deploy causes false pages. Five minutes for a page-severity alert and fifteen minutes for warning are sensible starting points. Shorter durations increase noise; longer durations delay detection.

Step 4: Build the Grafana Dashboard

A production dashboard is structured for fast triage: a status row with the headline numbers at the top, a detail row with time-series below, and a resource row at the bottom. Engineers arriving at 3 AM should reach a hypothesis within 60 seconds.

Dashboard layout: three rows — status stats at top, latency time-series in the middle, saturation signals at the bottom — optimised for 60-second triage.

Provision the dashboard as a JSON file checked into git so it is reproducible. The key panels and their PromQL expressions:

-- Panel: Request Rate (stat, last value)
sum(job:http_requests:rate5m{job="orders-api"})

-- Panel: Error Ratio (stat, thresholds green=0 yellow=0.005 red=0.01)
job:http_error_ratio:rate5m{job="orders-api"}

-- Panel: p99 Latency (stat, unit=seconds, thresholds 0.1/0.5)
job:http_request_duration_p99:rate5m{job="orders-api"}

-- Panel: Request Rate by Status (time-series, stacked)
sum by (status) (job:http_requests:rate5m{job="orders-api"})

-- Panel: Latency Heatmap (heatmap — use the raw histogram)
sum by (le) (
  rate(http_request_duration_seconds_bucket{job="orders-api"}[$__rate_interval])
)

-- Panel: In-Flight Requests (time-series, alert annotation overlay)
http_requests_in_flight{job="orders-api"}

Always use $__rate_interval in Grafana panels instead of a hardcoded window like [5m]. Grafana sets this variable to at least four times the scrape interval, ensuring rate() always has enough data points regardless of how the user zooms the time range. Hardcoded windows produce gaps when zooming out past the window size.

Step 5: Alertmanager Routing and Runbooks

The alert fires from Prometheus — Alertmanager decides who gets paged and when. Use inhibition rules to suppress warning pages when a page-severity alert is already active on the same job, so on-call engineers receive one actionable notification rather than a storm.

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: https://hooks.slack.com/services/T.../B.../xxx

route:
  receiver: default-receiver
  group_by: [alertname, job]
  group_wait:      30s
  group_interval:  5m
  repeat_interval: 4h
  routes:
    - match:
        severity: page
        team: backend
      receiver: backend-pagerduty
    - match:
        severity: warning
      receiver: backend-slack

receivers:
  - name: default-receiver
    slack_configs:
      - channel: "#alerts-dev"
        title: '{{ .GroupLabels.alertname }}'
        text:  '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

  - name: backend-pagerduty
    pagerduty_configs:
      - routing_key: "YOUR_PAGERDUTY_KEY"
        description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'

  - name: backend-slack
    slack_configs:
      - channel: "#alerts-warning"
        text: '{{ range .Alerts }}{{ .Annotations.summary }}\nRunbook: {{ .Annotations.runbook_url }}{{ end }}'

inhibit_rules:
  - source_match:
      severity: page
    target_match:
      severity: warning
    equal: [alertname, job]

Validating the Full Stack

Before declaring the observability stack production-ready, verify each layer end-to-end. Use promtool check rules in CI to catch PromQL syntax errors before they reach production. Confirm Alertmanager routing with amtool config routes test — do not discover misconfigured routes during an incident.

The most common production failure mode is silent data loss. A scrape timeout, a full TSDB disk, or a broken relabelling rule will stop metrics flowing — and your alerts will simply stop firing (not fire on the absence of data). Implement a dead-man's switch: a Watchdog alert that fires continuously and whose absence in Alertmanager triggers a page. This pattern catches entire monitoring pipeline failures that target-level alerting cannot.

Run a chaos test: kill the upstream database connection and confirm the error ratio alert fires within five minutes, Alertmanager routes to PagerDuty, the Grafana dashboard shows the error rate spike with the alert annotation overlaid on the time axis, and the alert auto-resolves cleanly when the database recovers. If all four conditions pass, your observability stack is production-grade.