Capstone: A Big-Tech Production Platform

Cost & Platform Experience

18 min Lesson 9 of 30

Cost & Platform Experience

At Google, Airbnb, and Stripe, the platform team owns two responsibilities that are easy to undervalue until they become crises: keeping the cloud bill predictable, and making the platform something engineers actually want to use. FinOps guardrails prevent a single misconfigured autoscaler from generating a $400 k surprise invoice. Developer self-service — golden paths, service catalogs, platform CLIs — is what separates a platform that scales to 1,000 engineers from one that drowns in Slack support tickets at 50. This lesson covers both, at production depth.

FinOps Guardrails

Namespace resource quotas are the first line of defence. Every team namespace must carry a ResourceQuota that caps both requests and limits. Without it, a single runaway job can saturate an entire node pool and trigger autoscaler cascades that take 20 minutes to drain. The quota is enforced at admission time — if a pod's sum would exceed the namespace ceiling, the API server rejects it immediately, before any scheduling happens.

# namespace-quota.yaml — applied per team namespace via GitOps
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: payments         # substitute per team
spec:
  hard:
    requests.cpu: "40"
    requests.memory: 80Gi
    limits.cpu: "80"
    limits.memory: 160Gi
    pods: "200"
    services: "20"
    persistentvolumeclaims: "10"
    count/jobs.batch: "50"
---
# LimitRange sets per-container defaults so pods without explicit requests
# still land in quota accounting and Karpenter can bin-pack correctly.
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: payments
spec:
  limits:
    - type: Container
      default:
        cpu: "500m"
        memory: 512Mi
      defaultRequest:
        cpu: "100m"
        memory: 128Mi
      max:
        cpu: "8"
        memory: 16Gi

Karpenter consolidation and Spot budgets. You already have Karpenter running from Lesson 4. The key FinOps levers are: set consolidationPolicy: WhenUnderutilized on all general NodePools (already shown), configure a disruption.budgets block so consolidation never evicts more than 20% of pods per NodePool at once, and aggressively shift non-critical workloads — batch pipelines, CI runners, async workers — to Spot. A well-tuned shop runs 60-70% of total cluster compute on Spot, saving $1.2–1.8 M/year at mid-scale (200-node cluster, us-east-1 m6i.4xlarge mix).

Cloud cost allocation with tags enforced at IaC. Every AWS resource created by Terraform must carry at minimum: team, env, service, and cost-centre. Enforce this with a Terraform required_providers block and an OPA policy checked in CI — a merge that introduces a resource without the tag set fails the plan. In Kubernetes, annotate namespaces with team and cost-centre and use Kubecost (or OpenCost) to surface per-namespace spend daily.

# Kubecost allocation query — top 10 namespaces by 7-day cost (REST API)
curl -sG "http://kubecost.platform.internal/model/allocation" \
  --data-urlencode "window=7d" \
  --data-urlencode "aggregate=namespace" \
  --data-urlencode "accumulate=true" \
  | jq '.data[0]
        | to_entries
        | sort_by(-.value.totalCost)
        | .[0:10]
        | map({namespace:.key, cost:.value.totalCost, cpuCost:.value.cpuCost, memCost:.value.ramCost})'

# Expected output shape:
# [
#   {"namespace":"ml-training","cost":4812.34,"cpuCost":2100.10,"memCost":1340.22},
#   {"namespace":"payments",   "cost":3201.87,"cpuCost":1890.44,"memCost": 980.11},
#   ...
# ]

# Weekly cost report pushed to Slack via scheduled k8s CronJob
# (see platform-runbooks/finops/weekly-cost-report for the full job manifest)

Showback before chargeback. Big-tech teams start with showback — making cost visible per team without moving money. Only after teams have had two quarters of data and tuning opportunity do they flip to chargeback (actual internal billing). Forcing chargeback on day one triggers political resistance that kills the FinOps programme. Let data build trust first.

Budget alerts and hard limits. In AWS, use Cost Anomaly Detection (aws ce create-anomaly-subscription) with an SNS topic that pages the platform on-call when 24-hour spend for any service tag exceeds 150% of the trailing 7-day average. Pair this with an AWS Budget action that automatically requests a Service Quota reduction or triggers a Lambda to scale in an autoscaling group if the daily budget threshold is breached. For GCP, Cloud Billing Budget notifications feed into a Pub/Sub → Cloud Function path. This circuit-breaker pattern caught a $60 k/day DynamoDB scan regression at a fintech two hours after it was introduced — before it ran overnight.

The Developer Self-Service Surface

A platform that requires filing a Jira ticket to get a new namespace is not a platform — it is a bureaucracy with YAML. The self-service surface is the set of interfaces through which engineers provision, deploy, observe, and troubleshoot their services without paging the platform team. At big-tech scale this has three canonical layers: a service catalog, a platform CLI, and golden-path templates.

Service catalog (Backstage or equivalent). Backstage, open-sourced by Spotify, is the de facto standard internal developer portal at companies above ~500 engineers. Every service registers itself via a catalog-info.yaml in its repo root. Backstage ingests this, links to Grafana dashboards, PagerDuty escalation policies, Runbooks in Confluence, and the service's Kubernetes workload status. Engineers onboard to a new service — understand its dependencies, SLOs, on-call rotation, and deploy history — in minutes instead of hours. The platform team's job is to keep the scaffolding plugins and software templates up to date; the product teams maintain their own catalog-info.yaml.

# catalog-info.yaml — lives in every service repo root
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: payments-api
  description: Core payment processing microservice
  annotations:
    github.com/project-slug: acme/payments-api
    pagerduty.com/service-id: PABC123
    grafana/dashboard-url: https://grafana.internal/d/payments
    runbook-url: https://wiki.internal/runbooks/payments-api
  tags:
    - payments
    - critical-path
spec:
  type: service
  lifecycle: production
  owner: team-payments
  system: financial-platform
  dependsOn:
    - component:ledger-service
    - resource:payments-db
  providesApis:
    - payments-rest-api

Platform CLI — plat. Large platform teams ship an internal CLI (commonly named plat, kit, or dx) that wraps kubectl, helm, argocd, and AWS/GCP CLIs behind opinionated subcommands. The goal is to eliminate the 14-step "bootstrap a new service" runbook and replace it with one command. Common subcommands:

plat new service payments-api --template go-grpc — scaffolds repo, registers in Backstage, creates ArgoCD app, provisions namespace with default quota and NetworkPolicy
plat promote payments-api --from staging --to production — pins the image digest, creates a GitOps PR to the production overlay, links to the change management ticket
plat logs payments-api --tail 200 --env production — streams from the Loki/CloudWatch aggregator for the service's pods without requiring knowledge of label selectors
plat status payments-api — shows rollout status, SLO burn rate, and last 5 deploys in one view

Measure time-to-first-deploy. The canonical FinOps metric for the self-service surface is Time To First Deploy (TTFD): how long from "engineer has repo access" to "service is running in staging." At Stripe this is under 15 minutes for standard microservices. Instrument TTFD with timestamps in your CI pipeline (repo creation event → first successful ArgoCD sync) and track it as a platform SLO. When it degrades, the platform team investigates — not the product engineers.

FinOps guardrails (left) and developer self-service surface (right) as the two pillars of platform maturity.

Golden-path templates. A golden path is an opinionated, pre-approved way to build and deploy a particular type of service — Go gRPC microservice, React SPA, Python ML worker, etc. — that bakes in all the platform defaults: resource requests, NetworkPolicy, Istio annotations, OTel sidecar injection, Kyverno-approved image registries, and a functioning CI/CD pipeline. Teams that follow the golden path spend zero time on platform configuration and focus entirely on product code. The platform team maintains the templates in a platform-templates repo; Backstage software templates trigger cookiecutter or copier rendering and open a PR against a new repo in the engineering GitHub org.

The "just open a PR" trap. Teams sometimes ask the platform to give them direct write access to the cluster to "move faster." This is a false economy. Every team that bypasses GitOps adds a surface that the platform cannot audit, cannot roll back, and cannot observe in cost allocation. The self-service surface must be the only path — and it must be fast enough that teams prefer it. If engineers are bypassing the platform, the correct response is to make the platform faster, not to grant raw cluster access.

Paved roads vs. escape hatches. At mature big-tech platforms (Netflix Paved Roads, Spotify System Model, Airbnb OneTouch), there is a deliberate escape hatch: teams that have a legitimate reason to deviate from the golden path can do so via a RFC process, but they own the operational burden of the deviation. This creates a positive feedback loop — the more painful it is to maintain a deviation, the more teams converge on the paved road. The platform team reviews deviations quarterly and absorbs the patterns that have proven themselves into new golden paths.

By the end of this lesson the capstone platform has all of its guardrails in place: FinOps controls that prevent cost surprises, a service catalog that gives every engineer a single pane of glass, and a CLI plus template library that makes doing the right thing the easy thing. Lesson 10 assembles the complete picture and discusses where this career trajectory leads next.