Cost & Platform Experience
Cost & Platform Experience
At Google, Airbnb, and Stripe, the platform team owns two responsibilities that are easy to undervalue until they become crises: keeping the cloud bill predictable, and making the platform something engineers actually want to use. FinOps guardrails prevent a single misconfigured autoscaler from generating a $400 k surprise invoice. Developer self-service — golden paths, service catalogs, platform CLIs — is what separates a platform that scales to 1,000 engineers from one that drowns in Slack support tickets at 50. This lesson covers both, at production depth.
FinOps Guardrails
Namespace resource quotas are the first line of defence. Every team namespace must carry a ResourceQuota that caps both requests and limits. Without it, a single runaway job can saturate an entire node pool and trigger autoscaler cascades that take 20 minutes to drain. The quota is enforced at admission time — if a pod's sum would exceed the namespace ceiling, the API server rejects it immediately, before any scheduling happens.
Karpenter consolidation and Spot budgets. You already have Karpenter running from Lesson 4. The key FinOps levers are: set consolidationPolicy: WhenUnderutilized on all general NodePools (already shown), configure a disruption.budgets block so consolidation never evicts more than 20% of pods per NodePool at once, and aggressively shift non-critical workloads — batch pipelines, CI runners, async workers — to Spot. A well-tuned shop runs 60-70% of total cluster compute on Spot, saving $1.2–1.8 M/year at mid-scale (200-node cluster, us-east-1 m6i.4xlarge mix).
Cloud cost allocation with tags enforced at IaC. Every AWS resource created by Terraform must carry at minimum: team, env, service, and cost-centre. Enforce this with a Terraform required_providers block and an OPA policy checked in CI — a merge that introduces a resource without the tag set fails the plan. In Kubernetes, annotate namespaces with team and cost-centre and use Kubecost (or OpenCost) to surface per-namespace spend daily.
Budget alerts and hard limits. In AWS, use Cost Anomaly Detection (aws ce create-anomaly-subscription) with an SNS topic that pages the platform on-call when 24-hour spend for any service tag exceeds 150% of the trailing 7-day average. Pair this with an AWS Budget action that automatically requests a Service Quota reduction or triggers a Lambda to scale in an autoscaling group if the daily budget threshold is breached. For GCP, Cloud Billing Budget notifications feed into a Pub/Sub → Cloud Function path. This circuit-breaker pattern caught a $60 k/day DynamoDB scan regression at a fintech two hours after it was introduced — before it ran overnight.
The Developer Self-Service Surface
A platform that requires filing a Jira ticket to get a new namespace is not a platform — it is a bureaucracy with YAML. The self-service surface is the set of interfaces through which engineers provision, deploy, observe, and troubleshoot their services without paging the platform team. At big-tech scale this has three canonical layers: a service catalog, a platform CLI, and golden-path templates.
Service catalog (Backstage or equivalent). Backstage, open-sourced by Spotify, is the de facto standard internal developer portal at companies above ~500 engineers. Every service registers itself via a catalog-info.yaml in its repo root. Backstage ingests this, links to Grafana dashboards, PagerDuty escalation policies, Runbooks in Confluence, and the service's Kubernetes workload status. Engineers onboard to a new service — understand its dependencies, SLOs, on-call rotation, and deploy history — in minutes instead of hours. The platform team's job is to keep the scaffolding plugins and software templates up to date; the product teams maintain their own catalog-info.yaml.
Platform CLI — plat. Large platform teams ship an internal CLI (commonly named plat, kit, or dx) that wraps kubectl, helm, argocd, and AWS/GCP CLIs behind opinionated subcommands. The goal is to eliminate the 14-step "bootstrap a new service" runbook and replace it with one command. Common subcommands:
plat new service payments-api --template go-grpc— scaffolds repo, registers in Backstage, creates ArgoCD app, provisions namespace with default quota and NetworkPolicyplat promote payments-api --from staging --to production— pins the image digest, creates a GitOps PR to the production overlay, links to the change management ticketplat logs payments-api --tail 200 --env production— streams from the Loki/CloudWatch aggregator for the service's pods without requiring knowledge of label selectorsplat status payments-api— shows rollout status, SLO burn rate, and last 5 deploys in one view
Golden-path templates. A golden path is an opinionated, pre-approved way to build and deploy a particular type of service — Go gRPC microservice, React SPA, Python ML worker, etc. — that bakes in all the platform defaults: resource requests, NetworkPolicy, Istio annotations, OTel sidecar injection, Kyverno-approved image registries, and a functioning CI/CD pipeline. Teams that follow the golden path spend zero time on platform configuration and focus entirely on product code. The platform team maintains the templates in a platform-templates repo; Backstage software templates trigger cookiecutter or copier rendering and open a PR against a new repo in the engineering GitHub org.
Paved roads vs. escape hatches. At mature big-tech platforms (Netflix Paved Roads, Spotify System Model, Airbnb OneTouch), there is a deliberate escape hatch: teams that have a legitimate reason to deviate from the golden path can do so via a RFC process, but they own the operational burden of the deviation. This creates a positive feedback loop — the more painful it is to maintain a deviation, the more teams converge on the paved road. The platform team reviews deviations quarterly and absorbs the patterns that have proven themselves into new golden paths.
By the end of this lesson the capstone platform has all of its guardrails in place: FinOps controls that prevent cost surprises, a service catalog that gives every engineer a single pane of glass, and a CLI plus template library that makes doing the right thing the easy thing. Lesson 10 assembles the complete picture and discusses where this career trajectory leads next.