Prometheus & Grafana

Project: Monitor a Service End-to-End

18 min Lesson 10 of 32

Project: Monitor a Service End-to-End

Every concept from the previous nine lessons — the Prometheus data model, metric types, PromQL, exporters, service discovery, recording and alerting rules, Alertmanager, and Grafana — exists to answer one operational question: is my service healthy, and will I know before my users do? This capstone project walks you through building a complete observability stack for a real HTTP service: instrument it, scrape it, write production-grade PromQL, build a Grafana dashboard that on-call engineers can act from, and wire up alerts that fire at the right time for the right reasons.

Architecture Overview

The stack you will build consists of a Go HTTP service instrumented with the prometheus/client_golang library, a Prometheus server that scrapes it, Alertmanager for notification routing, and Grafana for dashboards. In a production Kubernetes environment you would add a ServiceMonitor and rely on the Prometheus Operator; the patterns here map directly to that setup.

End-to-end monitoring architecture HTTP Service /metrics :8080 Prometheus Scrape :8080/metrics Evaluate rules TSDB :9090 Alertmanager Route & silence :9093 Grafana Dashboards :3000 Explore / Alerts PagerDuty / Slack scrape fire alerts PromQL notify
End-to-end observability: the HTTP service exposes metrics, Prometheus scrapes and evaluates rules, Alertmanager routes pages, and Grafana visualises everything.

Step 1: Instrument the Service

Add the four golden signals as Prometheus metrics at the application boundary. Define your metrics once at package level using promauto so registration is automatic and goroutine-safe.

// main.go — instrument an HTTP service with four golden signals package main import ( "net/http" "strconv" "time" "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promauto" "github.com/prometheus/client_golang/prometheus/promhttp" ) var ( httpRequestsTotal = promauto.NewCounterVec(prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total HTTP requests by method, path, and status.", }, []string{"method", "path", "status"}) httpRequestDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{ Name: "http_request_duration_seconds", Help: "HTTP request latency.", Buckets: prometheus.DefBuckets, // .005 .01 .025 .05 .1 .25 .5 1 2.5 5 10 }, []string{"method", "path"}) httpRequestsInFlight = promauto.NewGauge(prometheus.GaugeOpts{ Name: "http_requests_in_flight", Help: "Current number of requests being processed.", }) ) func instrument(path string, next http.HandlerFunc) http.HandlerFunc { return func(w http.ResponseWriter, r *http.Request) { httpRequestsInFlight.Inc() defer httpRequestsInFlight.Dec() rw := &responseWriter{ResponseWriter: w, status: 200} start := time.Now() next(rw, r) dur := time.Since(start).Seconds() httpRequestsTotal.WithLabelValues(r.Method, path, strconv.Itoa(rw.status)).Inc() httpRequestDuration.WithLabelValues(r.Method, path).Observe(dur) } } func main() { http.Handle("/metrics", promhttp.Handler()) http.HandleFunc("/api/orders", instrument("/api/orders", ordersHandler)) http.ListenAndServe(":8080", nil) }
Bucket design matters. The default Prometheus buckets (DefBuckets) are calibrated for typical network latencies. For internal RPC services your SLO may be 10ms p99 — use custom buckets like []float64{.001, .005, .01, .025, .05, .1, .25, .5, 1}. Histograms are useless for SLO measurement if your SLO boundary does not align with a bucket boundary.

Step 2: Scrape Configuration

Add a job for your service in prometheus.yml. In Kubernetes you would use a ServiceMonitor; locally, a static config suffices for this project. Keep scrape_interval at 15s — polling faster than once per scrape rarely adds value and increases TSDB write pressure.

# prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s rule_files: - /etc/prometheus/rules/*.yml alerting: alertmanagers: - static_configs: - targets: ["alertmanager:9093"] scrape_configs: - job_name: orders-api static_configs: - targets: ["orders-api:8080"] relabel_configs: - source_labels: [__address__] target_label: instance regex: '([^:]+)(?::\d+)?' replacement: '$1'

Step 3: Recording Rules and PromQL

Write recording rules for every expression that feeds an alert or a dashboard. This precomputes results so dashboards load instantly and alert evaluation is cheap even at scale.

# /etc/prometheus/rules/orders-api.yml groups: - name: orders_api_records interval: 15s rules: # Request rate per status class - record: job:http_requests:rate5m expr: | sum by (job, status) ( rate(http_requests_total[5m]) ) # Error ratio — feeds SLO and alert - record: job:http_error_ratio:rate5m expr: | sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m])) # p99 latency from histogram - record: job:http_request_duration_p99:rate5m expr: | histogram_quantile(0.99, sum by (job, le) ( rate(http_request_duration_seconds_bucket[5m]) ) ) - name: orders_api_alerts rules: - alert: HighErrorRate expr: job:http_error_ratio:rate5m{job="orders-api"} > 0.01 for: 5m labels: severity: page team: backend annotations: summary: "Error rate {{ $value | humanizePercentage }} on {{ $labels.job }}" description: "More than 1% of requests are returning 5xx. Check logs and downstream dependencies." runbook_url: "https://runbooks.internal/orders-api/high-error-rate" - alert: HighLatency expr: job:http_request_duration_p99:rate5m{job="orders-api"} > 0.5 for: 10m labels: severity: warning team: backend annotations: summary: "p99 latency {{ $value | humanizeDuration }} on {{ $labels.job }}"
The for clause is not optional. Evaluating an alert without a for period fires on the first scrape that crosses the threshold — single bad scrapes, network blips, or a rolling deploy causes false pages. Five minutes for a page-severity alert and fifteen minutes for warning are sensible starting points. Shorter durations increase noise; longer durations delay detection.

Step 4: Build the Grafana Dashboard

A production dashboard is structured for fast triage: a status row with the headline numbers at the top, a detail row with time-series below, and a resource row at the bottom. Engineers arriving at 3 AM should reach a hypothesis within 60 seconds.

Grafana dashboard panel layout for the orders-api service ROW: Status (stat panels — last value) Request Rate 1.2k rps Error Ratio 0.3 % p99 Latency 48 ms In-Flight Requests 32 ROW: Latency Distribution (time-series) Request Rate by Status (stacked) Latency Heatmap (histogram) histogram_quantile over bucket series ROW: Saturation (in-flight, CPU throttle, queue depth) http_requests_in_flight go_goroutines / process_open_fds
Dashboard layout: three rows — status stats at top, latency time-series in the middle, saturation signals at the bottom — optimised for 60-second triage.

Provision the dashboard as a JSON file checked into git so it is reproducible. The key panels and their PromQL expressions:

-- Panel: Request Rate (stat, last value) sum(job:http_requests:rate5m{job="orders-api"}) -- Panel: Error Ratio (stat, thresholds green=0 yellow=0.005 red=0.01) job:http_error_ratio:rate5m{job="orders-api"} -- Panel: p99 Latency (stat, unit=seconds, thresholds 0.1/0.5) job:http_request_duration_p99:rate5m{job="orders-api"} -- Panel: Request Rate by Status (time-series, stacked) sum by (status) (job:http_requests:rate5m{job="orders-api"}) -- Panel: Latency Heatmap (heatmap — use the raw histogram) sum by (le) ( rate(http_request_duration_seconds_bucket{job="orders-api"}[$__rate_interval]) ) -- Panel: In-Flight Requests (time-series, alert annotation overlay) http_requests_in_flight{job="orders-api"}
Always use $__rate_interval in Grafana panels instead of a hardcoded window like [5m]. Grafana sets this variable to at least four times the scrape interval, ensuring rate() always has enough data points regardless of how the user zooms the time range. Hardcoded windows produce gaps when zooming out past the window size.

Step 5: Alertmanager Routing and Runbooks

The alert fires from Prometheus — Alertmanager decides who gets paged and when. Use inhibition rules to suppress warning pages when a page-severity alert is already active on the same job, so on-call engineers receive one actionable notification rather than a storm.

# alertmanager.yml global: resolve_timeout: 5m slack_api_url: https://hooks.slack.com/services/T.../B.../xxx route: receiver: default-receiver group_by: [alertname, job] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: - match: severity: page team: backend receiver: backend-pagerduty - match: severity: warning receiver: backend-slack receivers: - name: default-receiver slack_configs: - channel: "#alerts-dev" title: '{{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}' - name: backend-pagerduty pagerduty_configs: - routing_key: "YOUR_PAGERDUTY_KEY" description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}' - name: backend-slack slack_configs: - channel: "#alerts-warning" text: '{{ range .Alerts }}{{ .Annotations.summary }}\nRunbook: {{ .Annotations.runbook_url }}{{ end }}' inhibit_rules: - source_match: severity: page target_match: severity: warning equal: [alertname, job]

Validating the Full Stack

Before declaring the observability stack production-ready, verify each layer end-to-end. Use promtool check rules in CI to catch PromQL syntax errors before they reach production. Confirm Alertmanager routing with amtool config routes test — do not discover misconfigured routes during an incident.

The most common production failure mode is silent data loss. A scrape timeout, a full TSDB disk, or a broken relabelling rule will stop metrics flowing — and your alerts will simply stop firing (not fire on the absence of data). Implement a dead-man's switch: a Watchdog alert that fires continuously and whose absence in Alertmanager triggers a page. This pattern catches entire monitoring pipeline failures that target-level alerting cannot.

Run a chaos test: kill the upstream database connection and confirm the error ratio alert fires within five minutes, Alertmanager routes to PagerDuty, the Grafana dashboard shows the error rate spike with the alert annotation overlaid on the time axis, and the alert auto-resolves cleanly when the database recovers. If all four conditions pass, your observability stack is production-grade.