Self-Service Infrastructure
Self-Service Infrastructure
In 2010 a developer who needed a database waited for a ticket. The DBA team provisioned it in two weeks. In 2025 the same developer opens their internal developer portal, fills in a form — engine, version, storage class, backup schedule — clicks Create, and has a running, policy-compliant database in four minutes. The infrastructure still gets provisioned by the same cloud APIs. The difference is who drives them, and whether guardrails prevent bad outcomes at the moment of action rather than in a post-incident review.
Self-service infrastructure is the operational heart of a mature internal developer platform (IDP). This lesson covers the three-layer model that makes it work: platform APIs that abstract complexity, a control-plane that reconciles desired state, and guardrails enforced before resources are ever created. Crossplane sits at the centre of the most influential open-source implementation of this pattern today.
Why Platform APIs, Not Raw Cloud APIs
Every major cloud exposes thousands of resource types across hundreds of API surfaces. A single RDS instance requires decisions about subnet groups, parameter groups, IAM roles, KMS keys, security group rules, backup windows, deletion protection, and multi-AZ topology. A developer should make none of those decisions — they are organisational decisions, made once, encoded in a platform API.
A platform API is an opinionated, organisation-scoped abstraction. Instead of aws_db_instance with fifty attributes, a developer sees PostgresDatabase with five: name, size (small/medium/large), environment, backup policy, and owner team. The platform API maps those five inputs to the forty-five infrastructure inputs that reflect your organisation's standards. This is the golden path from the previous lesson made machine-enforceable.
The contract a platform API must satisfy:
- Idempotent: calling it twice with the same input produces the same result; safe to re-apply on drift detection.
- Observable: callers can poll or watch the status of their request; the system emits events on state transitions.
- Self-documenting: the schema is machine-readable (OpenAPI or a Kubernetes CRD) so the portal, the CLI, and the policy engine all derive from a single source of truth.
- Auditable: every mutation carries actor identity, timestamp, and reason — written into an immutable audit log before the cloud API call is ever made.
kubectl for CLI access, and the watch API for real-time status. Your developers already know how to read a YAML manifest; you do not need to build a new client SDK.
Crossplane: Control Plane as Platform Foundation
Crossplane extends Kubernetes into a universal control plane for any infrastructure. It adds three primitives on top of standard Kubernetes:
- Provider: a controller pod that translates Crossplane resources into real cloud API calls (AWS, GCP, Azure, Vault, Helm, Terraform — there are 200+ official providers). Each provider ships its own CRDs: one CRD per cloud resource type.
- CompositeResourceDefinition (XRD): defines your custom platform API type — for example,
PostgresDatabase— as a CRD schema. This is the resource type developers and portal forms target. - Composition: a mapping from your high-level
PostgresDatabaseclaim to the underlying low-level provider resources (VPC, subnet group, IAM role, RDS instance, Route 53 record) with all organisational opinions baked in.
The reconciliation loop is pure Kubernetes: a developer applies a PostgresDatabase manifest; the Crossplane composite controller reads the Composition and materialises the constituent provider resources; each provider controller reconciles those resources against the real cloud API and writes status back. The developer watches the status field on their claim; the platform team never touches a ticket.
Once the XRD is registered, a developer can claim a database in any namespace they have permission to write to. They never see RDS parameter groups, subnet IDs, or KMS key ARNs — those live in the Composition, owned by the platform team:
compositeDeletePolicy: Foreground. Existing claims stay on the old revision; new claims pick up the new one. This gives you a safe migration path without a flag day that breaks all existing infrastructure. Google's internal platform tooling uses the same revision model — never mutate infrastructure under running workloads without a transition window.
Abstraction Layers: The Control Plane Stack
Crossplane is rarely used alone. Production platforms layer multiple control planes and abstraction levels. The standard stack at big-tech scale looks like this, from highest to lowest abstraction:
- Developer Portal / GitOps manifest: The developer's entry point. A Backstage software template renders a form, commits a
PostgresDatabaseYAML to the team's GitOps repository, and a Flux or Argo CD application syncs it to the platform cluster. - Platform API (Crossplane XRD / Claim): The organisational contract. Validates inputs, enforces naming conventions via an admission webhook, stamps default labels (cost-center, team, environment), and routes to the correct Composition based on environment.
- Composition: The organisational opinion. Maps the claim to a set of managed resources with all hardened defaults: encryption at rest, multi-AZ for prod, backup retention, deletion protection, security group allowing only the app's pod CIDR.
- Provider Managed Resources: The cloud API translation layer. One Crossplane provider CRD per cloud resource type (e.g.,
RDSInstance,SubnetGroup,DBParameterGroup). The provider controller calls the AWS SDK and reconciles continuously. - Cloud Control Plane (AWS/GCP/Azure): The actual infrastructure. The provider authenticates via IRSA/Workload Identity, making least-privilege API calls scoped to a single account or project per environment.
Guardrails: Policy Before Provisioning
Self-service without guardrails is just unsupervised cloud spending. Guardrails must fire before a resource is ever created, not after a cost spike or a security scan surfaces a violation. Three layers of guardrail work together in a mature platform:
- Schema validation (XRD openAPIV3Schema): Rejects malformed claims at admission. A developer cannot request
size: xlargeif the enum only allows small/medium/large. This is a synchronous, zero-latency gate. - OPA/Kyverno admission policies: Business logic the XRD schema cannot express. Examples: prod resources require a cost-center label; no developer namespace may provision more than three databases; the RDS major version must be on the approved list. Kyverno policies are CRDs themselves — version-controlled alongside the platform code.
- Composition-enforced defaults: Even if a developer submits a valid claim, the Composition applies immutable platform-wide settings they cannot override:
deletionPolicy: Deleteonly for dev,Orphanfor prod; encryption enforced regardless of what the claim says; backup retention minimum 7 days in production regardless of thebackupPolicyfield value.
crossplane beta render), and publish the diff of resulting managed resources before merging. At Spotify, every Composition PR includes a rendered-resource diff as a mandatory review artifact.
Production Failure Modes
Self-service infrastructure surfaces its own failure patterns that differ from manually provisioned infrastructure:
- Provider throttling cascades: A Composition that creates ten managed resources simultaneously will make ten concurrent cloud API calls. At scale — 50 teams creating databases simultaneously during a fleet rotation — you will hit AWS API rate limits. Crossplane providers expose
--max-reconcile-rateand--poll-intervalflags. Tune them: 10 max-reconcile-rate and a 10-minute poll interval is a sane starting point for RDS at 500-resource scale. - Orphaned resources from failed compositions: If a Composition creates resources A, B, and C and the creation of C fails permanently, resources A and B may remain — billed, unsecured, unmonitored. Implement a finalizer-based cleanup controller and alert on
XPostgresDatabaseclaims stuck inSynced: Falsefor more than 30 minutes. - IRSA role scope too broad: Many teams create one cross-account IAM role for the entire Crossplane provider. A compromised provider pod then has write access to all cloud resources in the account. Use a separate ProviderConfig per environment and scope each IAM role to a single resource type via condition keys (
rds:*only, not*:*).
Self-service infrastructure done well is invisible to developers and auditable to everyone else. The developer gets a database in four minutes; the security team sees a complete audit trail; the finance team has a cost-center tag on every resource; the platform team spends zero time on tickets. That four-minute provisioning time — with full policy compliance baked in — is the concrete product metric that justifies building the platform in the first place.