Platform Engineering & Developer Experience

Building vs Buying Platform Components

18 min Lesson 9 of 28

Building vs Buying Platform Components

Every platform team eventually faces the same crossroads: should we integrate an open-source CNCF project, purchase a commercial SaaS product, or build the capability ourselves? At scale, this decision repeats dozens of times — for the service catalog, for secrets management, for progressive delivery, for developer portals, for cost attribution. The engineers who answer it badly burn millions of dollars, years of engineering effort, or both. The engineers who answer it well build platforms that compound in value over time.

This lesson gives you a structured framework for navigating the CNCF landscape — which has 170+ projects across 18 categories as of 2025 — without drowning in choice paralysis. The framework applies equally to CNCF Graduated projects, commercial tooling (Harness, Humanitec, Port, Cortex), and full in-house builds.

The CNCF Landscape Is Not a Shopping Catalogue

The canonical mistake is treating the CNCF landscape as a feature matrix to compare column by column. The landscape is a taxonomy of solved problems, not a product recommendation. A Graduated project like Argo CD is production-proven at thousands of companies; a Sandbox project is experimental research. The maturity levels matter enormously: Graduated (rigorous TOC review, production adoption evidence, security audits) → Incubating (growing adoption, vendor-neutral governance) → Sandbox (early-stage, high churn risk).

The CNCF Annual Survey 2024 found that 84% of respondents run Kubernetes in production, but the median organization uses only 7–10 CNCF projects actively. That gap between 170+ available projects and ~10 in use is where good platform judgment lives. Selection pressure should be severe.

A Decision Framework: Four Axes

Evaluate every platform component on these four axes before deciding:

Differentiation: Does this capability differentiate your business? If yes, build. No company gains competitive advantage from off-the-shelf CI. But a bespoke developer portal deeply integrated with your internal toolchain is a real differentiator.
Operational burden: What does day-2 look like? Running Kafka, Vault, or a distributed tracing backend in-house requires dedicated SRE time. A SaaS alternative (Confluent, HCP Vault, Honeycomb) shifts that burden to the vendor. Calculate honestly: 2 FTEs at $200k each to operate a tool versus $240k/yr SaaS is not a saving.
Maturity and lock-in: CNCF Graduated projects with vendor-neutral APIs (OTel, Prometheus, Flux) carry low lock-in risk. A proprietary SaaS with no export API is a trap. OpenTelemetry is the exemplary case: instrument once, send to any backend. Optimize for open standards at the instrumentation layer even when you buy at the storage/visualization layer.
Integration complexity: Can you wire this into your existing golden paths cheaply? A tool requiring a custom operator, a dedicated namespace per tenant, and a custom admission webhook is a heavy integration cost. Weight this against build.

Decision matrix for platform component selection: differentiation and operational burden drive the build-vs-buy axis; lock-in and TCO apply universally.

Evaluating the CNCF Landscape Without Drowning

With 170+ projects, a structured filter is essential. Apply these layers in order:

Problem first, project second. Define the problem in one sentence ("we need policy enforcement at admission time for Kubernetes workloads") before opening the landscape. This prevents the gravitational pull of interesting-but-irrelevant projects.
Maturity gate. Start from Graduated. If nothing fits, look at Incubating. Sandbox projects should only enter your evaluation if you have the capacity to contribute upstream and accept significant API churn.
Adoption signal. Check GitHub stars trajectory (not peak), CNCF end-user survey mentions, and conference talk density in the last 12 months. A Graduated project with declining PR velocity is a warning signal.
Threat model fit. Some tools have known production failure modes at your scale. Jaeger works beautifully up to ~100k traces/sec; beyond that, Tempo or a commercial backend scales better. Validate against your p95 load.
Governance and vendor diversity. A project controlled by a single vendor carries acquisition or abandonment risk. Check the TOC member distribution and the ratio of single-employer commits in the last 6 months.

Use the CNCF Technology Radar (published twice yearly) as a tiebreaker when two projects solve the same problem. It reflects the operational experience of dozens of large-scale end users — it is not perfect but is a strong prior. The 2024 radar moved Argo CD and Crossplane to "Adopt", Tetragon to "Trial", and several Sandbox scheduling projects to "Assess".

Practical Evaluation Script: Scoring a Candidate Project

When a candidate CNCF project reaches the shortlist, run this evaluation before committing engineering effort to a proof of concept:

#!/usr/bin/env bash
# cncf-eval.sh — quick due-diligence script for a CNCF candidate project
# Usage: ./cncf-eval.sh <github-org> <repo>
# Requires: gh CLI, jq, curl

ORG=$1
REPO=$2
BASE="https://api.github.com/repos/${ORG}/${REPO}"

echo "=== CNCF Project Due Diligence: ${ORG}/${REPO} ==="

# 1. Maturity indicators
STARS=$(curl -sf "${BASE}" | jq '.stargazers_count')
FORKS=$(curl -sf "${BASE}" | jq '.forks_count')
OPEN_ISSUES=$(curl -sf "${BASE}" | jq '.open_issues_count')
echo "Stars: ${STARS} | Forks: ${FORKS} | Open Issues: ${OPEN_ISSUES}"

# 2. Release cadence (last 5 releases)
echo "--- Recent releases ---"
gh release list --repo "${ORG}/${REPO}" --limit 5

# 3. Contributor diversity (unique orgs in last 90 days)
echo "--- Top contributor orgs (proxy for vendor diversity) ---"
gh api "repos/${ORG}/${REPO}/contributors?per_page=30" \
  | jq -r '.[].login' \
  | xargs -I{} gh api users/{} \
  | jq -r '.company // "unknown"' \
  | sort | uniq -c | sort -rn | head -10

# 4. Security posture
echo "--- Open CVEs / security advisories ---"
gh api "repos/${ORG}/${REPO}/security-advisories" \
  | jq '[.[] | select(.state=="published")] | length'

# 5. Helm chart availability (prefer official)
helm search hub "${REPO}" --output json | jq '.[0:3] | .[] | {name,version,description}'

echo "=== Evaluation complete. Score manually on: maturity, ops burden, lock-in, integration cost ==="

After the script, do a 2-hour spike: stand it up locally against a representative workload, inject one realistic failure (kill the leader pod, exhaust a resource limit), and observe recovery behavior. Skip this and you will discover the failure mode in production at 2 AM.

Total Cost of Ownership: The Honest Calculation

The biggest mistake in build-vs-buy is counting only licensing cost against the build estimate. A realistic TCO for a self-operated CNCF tool includes:

Initial integration: operator installation, RBAC, golden-path wiring, observability, documentation. Typical range: 1–4 engineer-weeks.
Ongoing operations: upgrades (Kubernetes operators lag by 1–3 major versions regularly), incident response, capacity planning. Budget 0.1–0.3 FTE per non-trivial component.
Upgrade tax: a Kubernetes cluster upgrade often requires coordinated upgrades of 10–15 platform components. At 3 cluster upgrades per year and 2 engineer-days per component, this compounds fast.
Opportunity cost: every engineer-week spent operating platform plumbing is one not spent building product-facing capabilities.

A common anti-pattern at growing startups is operating a full self-hosted observability stack (Prometheus, Loki, Tempo, Grafana) before the engineering org has the SRE maturity to maintain it. When the Loki ingester OOM-kills at 3 AM and nobody knows how to recover compactor state, the cost is not the AWS bill — it is developer trust. If your observability platform goes down during an incident, you are blind exactly when sight matters most. Until you have a dedicated platform SRE, buy your observability backend.

Component-by-Component Guidance (2025 Defaults)

For common platform capability areas, here is current big-tech consensus on the build-vs-buy axis, informed by CNCF adoption surveys and large-scale production experience:

GitOps / CD: Adopt CNCF. Argo CD (Graduated) or Flux (Graduated). No commercial tool competes meaningfully at this layer. Self-operate with confidence.
Secrets management: Buy or adopt CNCF Graduated. HashiCorp Vault (open core, now BSL licensed) is widely deployed; HCP Vault reduces ops burden by ~70%. At under 50 engineers, HCP Vault is almost always the right call.
Service catalog / developer portal: Start with Backstage (CNCF Incubating), consider commercial if ops capacity is limited. Port, Cortex, and Roadie (hosted Backstage) reduce the Backstage plugin-maintenance burden significantly. Backstage requires ongoing TypeScript engineering to stay current.
Progressive delivery: Adopt CNCF. Argo Rollouts or Flagger (Graduated/Incubating) are production-proven at hyperscale. Flagger + Linkerd is a particularly lean combination.
Policy enforcement: Adopt CNCF. OPA/Gatekeeper or Kyverno (both Graduated). Kyverno has lower authoring friction for platform teams new to Rego.
Observability backend: Buy below 500 engineer scale; self-operate above it. Honeycomb, Grafana Cloud, and Datadog are worth the cost at startup/scale-up size. Above 500 engineers, TCO favors self-operating Grafana stack + VictoriaMetrics or Thanos.
Platform orchestration (IDP backbone): Avoid full custom build before 200 engineers. The engineering cost of a production-grade IDP from scratch is 3–5 senior engineers for 12+ months. Start with Backstage + Crossplane and only invest in bespoke tooling where those genuinely fail your requirements.

# Crossplane — install and configure an AWS provider composition (self-service DB pattern)
# Shows the "adopt CNCF" path for infrastructure provisioning

# Install Crossplane into the platform cluster
helm repo add crossplane-stable https://charts.crossplane.io/stable
helm install crossplane crossplane-stable/crossplane \
  --namespace crossplane-system --create-namespace \
  --set args='{"--enable-composition-revisions"}' \
  --version 1.16.0

# Install the AWS provider
kubectl apply -f - <<EOF
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
  name: provider-aws-rds
spec:
  package: xpkg.upbound.io/upbound/provider-aws-rds:v1.14.0
EOF

# Wait for provider to become healthy
kubectl wait provider/provider-aws-rds --for=condition=Healthy --timeout=120s

# Verify composition readiness
kubectl get crossplane
# NAME                              INSTALLED   HEALTHY   PACKAGE
# provider-aws-rds                  True        True      ...

# Now apply a PostgresDatabase custom resource (as seen in Lesson 2)
# kubectl apply -f orders-db.yaml
# kubectl get postgresdb orders-db -w

The Governance Trap: When Open Source Licenses Change

The HashiCorp BSL relicensing in 2023 (Terraform, Vault, Consul moving from MPL 2.0 to Business Source License) was a watershed event for platform engineering. It proved that even mature, widely-adopted open-source projects can change their licensing terms in ways that affect commercial use. OpenTofu (the CNCF Sandbox Terraform fork) emerged as the community response. Lesson for platform teams: audit the license of every non-CNCF component in your stack annually. Prefer Apache 2.0 or MIT for foundational components. When a BSL or SSPL license appears, evaluate the fork ecosystem immediately — do not wait until renewal.

Maintain a Platform Bill of Materials (PBOM): a versioned list of every component in your IDP, its license, its CNCF maturity level, the engineer responsible for upgrades, and the last security audit date. Store it in your service catalog (a Backstage TechDocs page or a platform-bom.yaml in your GitOps repo). Review it in your quarterly platform retrospective. This is the single artifact that separates mature platform teams from teams that discover a critical dependency is unmaintained during an incident.