Project: Design an IDP
Project: Design an IDP
This capstone lesson walks through the end-to-end design of a production-grade Internal Developer Platform (IDP) for a sample organisation — Acme Corp — a 400-engineer fintech running 120 microservices on Kubernetes across two AWS regions. By the end you will have a complete golden-path specification, a self-service surface design, the Backstage scaffolder template that powers it, the Crossplane Composition for on-demand PostgreSQL, and the Kyverno policies that enforce guardrails — every artefact ready to drop into a real monorepo.
Step 1 — Characterise the Organisation
Before writing a single line of YAML, collect four inputs:
- Service taxonomy. Acme has three archetypes: API service (Java/Go, REST/gRPC), async worker (Kafka consumer, Python/Go), and ML inference (Python, GPU-optional). Every golden path maps to one archetype.
- Cognitive tax survey. A 10-question dev survey reveals the top four pain points: (1) writing Kubernetes YAML from scratch, (2) provisioning databases, (3) wiring Datadog APM, (4) configuring RBAC. These become the IDP's first four self-service actions.
- Compliance requirements. PCI-DSS scope for the payments cluster means: immutable container images (no
:latest), no root containers, mandatory network policy, secrets from Vault (not env vars). These become Kyverno policies enforced at admission. - Existing assets inventory. Acme already has: Terraform modules for VPC/EKS, a Vault PKI mount, a Datadog agent DaemonSet, and 40 Helm charts of varying quality. The IDP wraps these; it does not replace them.
Step 2 — Define the Golden Path for an API Service
A golden path is a fully-specified, opinionated delivery path from git init to production. For Acme's API service archetype it covers six dimensions:
- Scaffold — a Backstage Software Template generates the repo skeleton (Dockerfile, Makefile,
.github/workflows/ci.yaml,k8s/manifests,catalog-info.yaml,docs/TechDocs stub) in under 60 seconds. - CI pipeline — GitHub Actions: lint → unit test →
docker build --provenance=true --sbom=true→ Trivy scan (block on CRITICAL) → push to ECR with immutable tagsha-<commit>→ update the GitOps repo via a PR tok8s/overlays/staging/<service>/image.yaml. - GitOps delivery — ArgoCD Application CR per environment (staging, production). Staging auto-syncs; production requires a manual sync gate or a JIRA approval webhook.
- Runtime — a pre-tested Helm chart with sensible defaults:
resources.requests.cpu: 200m,resources.limits.memory: 512Mi, HPA on CPU+RPS, PodDisruptionBudgetminAvailable: 1, Istio sidecar enabled, Datadog APM via admission controller. - Observability — automatic Datadog dashboard (provisioned by a Backstage action calling the Datadog API) and a default SLO (99.5% success rate, 300 ms p99 latency) wired to a PagerDuty service.
- Security — Vault AppRole injected by the Vault Agent sidecar; mTLS via Istio; NetworkPolicy denying all ingress except the mesh gateway and the monitoring namespace.
Step 3 — Design the Self-Service Surface
The self-service surface is what developers actually click or type. Acme ships three surfaces:
- Backstage portal — primary UI. Software Templates for scaffolding; a software catalog for discovery; TechDocs for documentation; Scorecards showing golden-path compliance per service.
- Platform CLI (
acme) — wraps Backstage scaffolder API calls for engineers who live in the terminal.acme new service,acme provision db,acme open-dash <service>. The CLI is a thin wrapper; all logic lives in the platform API. - GitOps self-service — for infrastructure primitives (databases, buckets, queues), the self-service action opens a PR to the platform GitOps repo. A human reviewer (or a policy bot) approves; Crossplane reconciles the resource. This keeps an audit trail in git — non-negotiable for PCI-DSS.
The Backstage scaffolder action for on-demand PostgreSQL is the highest-traffic self-service action (run ~40 times per month at Acme). It calls a custom scaffolder backend action that creates a Crossplane PostgreSQLInstance claim, commits it to the GitOps repo, and opens a PR tagged with the requesting team's JIRA project key:
Step 4 — The IDP Architecture Diagram
The diagram below shows how Acme's four platform planes interact at runtime: the developer plane, the control plane (Backstage + platform API), the delivery plane (GitOps), and the infrastructure plane (Crossplane + cloud APIs).
Step 5 — Enforcing Guardrails with Kyverno
Guardrails are the policies that make the golden path trustworthy. Acme ships four baseline Kyverno ClusterPolicies applied to every namespace except kube-system:
Step 6 — Measuring IDP Success
An IDP without metrics is a platform team acting on faith. Acme tracks four DORA-aligned platform KPIs at week-over-week granularity:
- Onboarding time to first deployment — target: < 2 hours from
acme new serviceto first staging deployment. Measured by diffing scaffold timestamp vs first ArgoCD sync timestamp stored in the platform telemetry DB. - Self-service success rate — percentage of database/queue provisioning requests that complete without a platform team ticket. Target: > 90%. Measured by the Crossplane claim reconciliation events.
- Golden path adoption — percentage of services with a Backstage scorecard score > 80/100. Target: > 75% of services. Surfaced on the engineering all-hands dashboard.
- Platform p99 API latency — the Backstage backend and platform API must respond in < 200 ms p99. Breaching this SLO pages the platform team — an IDP that is slower than filing a ticket will not be used.
Putting It All Together
Acme's IDP ships as a mono-repo: platform-gitops/ holds the desired state (ArgoCD ApplicationSets, Crossplane Compositions, Kyverno policies, Vault policies); platform-backstage/ holds the portal; acme-cli/ holds the CLI. Each component is versioned independently, deployed via its own GitOps pipeline, and has an SLO. The platform team treats this stack as a product: it has a roadmap, a changelog, a deprecation policy, and on-call rotation. That product mindset — not any particular tool choice — is what separates platforms that scale from platforms that become legacy monoliths.