Platform Engineering & Developer Experience

Developer Experience Metrics

18 min Lesson 6 of 28

Developer Experience Metrics

A platform team that cannot measure developer experience cannot improve it. Metrics are the feedback loop that tells you whether golden paths are reducing friction or just adding ceremony, whether your build infrastructure is fast enough to stay out of engineers' way, and whether a new hire is productive within days or weeks. The two dominant frameworks — DORA and SPACE — complement each other: DORA answers "how well is software being delivered?" while SPACE answers "how does the human experience of engineering feel?" Production-grade platform teams instrument both.

DORA Metrics: The Delivery Heartbeat

DORA (DevOps Research and Assessment) identified four metrics that statistically separate elite software organisations from low performers. After years of State of DevOps research, these remain the tightest proxy for delivery health that generalises across company sizes and stacks.

Deployment Frequency (DF) — How often does a team deploy to production? Elite performers deploy on-demand (multiple times per day per team). A golden path that bundles CI/CD should make daily deployments the default, not the exception. Measure per team and per service, not org-wide — aggregates hide laggard teams.
Lead Time for Changes (LTC) — Time from code commit to running in production. Includes PR review latency, CI duration, and deployment pipeline. Elite: under one hour. High: one day to one week. Anything over a week signals process or infra debt. Long LTC is often caused by flaky tests that block merges or slow artifact promotion workflows — root-cause with histogram percentiles, not averages.
Change Failure Rate (CFR) — Percentage of deployments that cause a production incident requiring a hotfix or rollback. Elite: under 5%. High: 16–30%. CFR above 10% on a team's golden path means your scaffolded tests are insufficient or canary promotion thresholds are too loose.
Mean Time to Restore (MTTR) — Time from incident detection to service restoration. Elite: under one hour. Directly driven by your observability stack (covered in earlier tutorials), runbook quality, and whether on-call engineers can deploy a fix without a full release pipeline.

DORA categorises teams as Elite, High, Medium, or Low based on these four metrics together. A team that is Elite on DF but Low on MTTR is not an elite team — all four gates must be passed simultaneously. In practice, LTC and MTTR are the hardest to improve and often require platform-level investment rather than individual team changes.

Collecting DORA data requires instrumenting your deployment pipeline. The simplest production-ready approach is to emit deployment events from your CD system and query them in your metrics store. Here is a minimal Four Keys setup using a DORA-event webhook and BigQuery (the model used internally at Google):

# Emit a deployment event from your CD pipeline (GitHub Actions example)
# .github/workflows/deploy.yaml  (excerpt)

- name: Emit DORA deployment event
  if: success()
  run: |
    curl -sS -X POST "$FOUR_KEYS_ENDPOINT/event" \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer $FOUR_KEYS_TOKEN" \
      -d '{
        "event_type": "deployment",
        "id": "${{ github.run_id }}",
        "metadata": {
          "service": "${{ env.SERVICE_NAME }}",
          "environment": "production",
          "commit_sha": "${{ github.sha }}",
          "deployed_at": "${{ steps.deploy.outputs.timestamp }}"
        }
      }'

# Query Lead Time for Changes over the last 30 days (BigQuery)
# bq query --use_legacy_sql=false <<'SQL'
SELECT
  service,
  APPROX_QUANTILES(lead_time_seconds, 100)[OFFSET(50)] / 3600 AS p50_hours,
  APPROX_QUANTILES(lead_time_seconds, 100)[OFFSET(95)] / 3600 AS p95_hours,
  COUNT(*) AS deployments
FROM four_keys.deployments
WHERE DATE(deployed_at) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY service
ORDER BY p95_hours DESC;
# SQL

SPACE Framework: Beyond Delivery Speed

DORA is delivery-centric — it does not capture whether engineers are burned out, whether code review is adversarial, or whether the development environment is so slow that engineers context-switch into Slack instead of staying in flow. The SPACE framework (Nicole Forsgren et al., 2021) adds five complementary dimensions:

Satisfaction & Well-being — Developer Net Promoter Score (DevNPS), burnout signals from quarterly surveys. A platform team's goal is to increase satisfaction by reducing toil, not just by shipping features.
Performance — Outcome quality: reliability of delivered software, code review thoroughness. Not velocity. A team shipping buggy code fast scores badly on Performance despite good DF.
Activity — Observable counts: PRs merged, incidents resolved, on-call pages. Useful as context, dangerous as incentives — optimising Activity metrics produces Goodhart's Law failures.
Communication & Collaboration — PR review latency by author, cross-team dependency resolution time, architectural decision record (ADR) production rate. Service catalog coverage on Backstage is a proxy here.
Efficiency & Flow — Interruption rate (unplanned work ratio), focus time (deep work blocks per week), context switches per day. This is what build and deploy friction directly attacks.

At Google, the Engineering Productivity Research team runs quarterly developer surveys measuring SPACE dimensions alongside automated DORA collection. The two data sources are cross-correlated: a team that scores well on DORA metrics but poorly on developer satisfaction almost always has hidden toil (manual release steps, broken local dev environments) not captured by delivery speed alone.

Onboarding Time as a First-Class Platform Metric

Time-to-first-commit (TTFC) and time-to-first-deploy (TTFD) are among the most actionable platform metrics. If a new hire takes three weeks to get a local dev environment running and make their first production contribution, your platform has failed — regardless of how fast your existing teams ship. Target for elite platforms: TTFC under four hours for a senior engineer joining an existing team; TTFD (first real change in production) under three days.

Instrument onboarding time by creating a provisioning event at account creation and a commit event at first merge. The delta is your TTFC. Track it per team, per tech stack, and after every major platform change. A golden-path scaffolder that provisions a fully working local dev environment (devcontainer or nix flake), pre-seeded with correct secrets and service stubs, is the single highest-leverage investment in TTFC at scale.

# Measure time-to-first-commit via GitHub API (run daily in CI)
#!/usr/bin/env bash
set -euo pipefail

ORG="your-org"
LOOKBACK_DAYS=90

# Fetch new members added in the last LOOKBACK_DAYS days
NEW_MEMBERS=$(gh api "orgs/$ORG/members" --paginate \
  --jq ".[] | .login")

for USER in $NEW_MEMBERS; do
  JOIN_DATE=$(gh api "orgs/$ORG/memberships/$USER" \
    --jq '.updated_at' 2>/dev/null || continue)

  FIRST_COMMIT=$(gh api "search/commits" \
    -X GET \
    -f "q=org:$ORG author:$USER" \
    -f "sort=author-date" \
    -f "order=asc" \
    --jq '.items[0].commit.author.date' 2>/dev/null || echo "none")

  if [ "$FIRST_COMMIT" != "none" ]; then
    DELTA_HOURS=$(( ( $(date -d "$FIRST_COMMIT" +%s) - $(date -d "$JOIN_DATE" +%s) ) / 3600 ))
    echo "{\"user\":\"$USER\",\"join\":\"$JOIN_DATE\",\"first_commit\":\"$FIRST_COMMIT\",\"delta_hours\":$DELTA_HOURS}"
  fi
done | jq -s '.' > onboarding-ttfc.json
# Upload onboarding-ttfc.json to your metrics store (BigQuery, Datadog, etc.)

Build and Deploy Friction

Build and deploy friction is the cumulative tax developers pay every time they want to validate or ship code. It compounds: a 12-minute CI pipeline that runs 40 times per day costs a 10-engineer team roughly 80 engineering-hours per week. The platform team must own build time as an SLO, not a nice-to-have.

Key friction signals to measure continuously:

CI p50/p95 duration — by pipeline, by stage (lint, unit test, integration test, build, push). P95 matters more than mean — a pipeline that is usually 4 minutes but occasionally 20 minutes breaks flow more than a consistently 8-minute pipeline.
Flaky test rate — percentage of CI failures that are not reproducible on retry. Above 2% and engineers stop trusting CI red and start overriding it. Track per test file and quarantine aggressively.
Deployment pipeline wait time — time a successfully built artifact spends waiting for a deploy slot, approval, or environment availability. Often invisible but frequently the dominant contribution to LTC.
Local dev feedback loop — time from code save to seeing the change reflected in a running local service. Instrument with developer surveys because this is hard to capture automatically. Hot-reload setups (Skaffold, Tilt) should keep this under 5 seconds.

DORA metrics flow from the delivery pipeline while SPACE signals feed from surveys, onboarding events, and build systems — all converging in the platform engineering weekly dashboard.

Implementing a Lightweight Metrics Dashboard

You do not need a dedicated Four Keys deployment on day one. A Grafana dashboard querying GitHub, your CI provider, and your incident tool is sufficient for most teams. The critical discipline is consistent definition: "deployment" means exactly one thing (the CD pipeline completes a production rollout), "incident" means an alert that pages on-call, and "lead time" starts at the commit timestamp, not the PR merge timestamp. Inconsistent definitions produce metrics that look good on slides but mislead engineering decisions.

# Grafana dashboard JSON snippet — DORA panel for Deployment Frequency
# (Prometheus metrics emitted by Argo CD or Flux via a deployment webhook receiver)

# prometheus recording rule (rules/dora.yaml)
groups:
  - name: dora
    interval: 5m
    rules:
      - record: dora:deployment_frequency:rate7d
        expr: |
          sum by (service, environment) (
            increase(cd_deployment_total{environment="production"}[7d])
          ) / 7
        labels:
          window: "7d"

      - record: dora:lead_time_p95:seconds
        expr: |
          histogram_quantile(0.95,
            sum by (service, le) (
              rate(cd_lead_time_seconds_bucket{environment="production"}[7d])
            )
          )

# Grafana panel query (Prometheus data source)
# Panel: Deployment Frequency (deployments/day, last 7d)
# PromQL: dora:deployment_frequency:rate7d{environment="production"}
# Visualization: stat panel, threshold green >= 1 (daily), yellow >= 0.14 (weekly), red < 0.14

# Panel: Lead Time p95 (hours)
# PromQL: dora:lead_time_p95:seconds / 3600
# Threshold: green <= 1h (elite), yellow <= 24h (high), red > 24h

Never use deployment frequency as a KPI that teams are incentivised to maximise directly. Teams will split large changes into trivial commits to inflate the number. Use DORA metrics as a diagnostic tool for the platform team, not as a performance evaluation for product teams. Goodhart's Law applies the moment a measure becomes a target.

Closing the Loop: Metrics to Platform Improvements

Raw metrics data is only useful if it drives a structured improvement cycle. The platform team should run a weekly metrics review: look at the bottom quartile of teams on each DORA dimension, identify the systemic root cause (usually: slow CI, broken golden path template, missing runbooks, or lack of feature-flag infrastructure), and add a platform improvement to the backlog. Treat each metric regression as a platform bug. A mature platform engineering team publishes a quarterly "State of DevEx" report to the engineering org — analogous to the public State of DevOps Report — with trend lines, team benchmarks (anonymised), and a roadmap of friction-reducing investments planned for the next quarter.