Capstone: A Big-Tech Production Platform

The Kubernetes Platform

22 min Lesson 4 of 30

The Kubernetes Platform

Once accounts and network are correct, the Kubernetes layer is where the platform team spends the most engineering capital. At Google, Netflix, Airbnb, and Shopify, the cluster architecture decisions made in the first month are still running—and being worked around—three years later. This lesson covers three decisions that matter most at big-tech scale: how to architect the control plane, how to design the node fleet, and how to topology-map multiple environments across clusters.

Control Plane Architecture

In a managed offering—EKS, GKE, AKS—the control plane is cloud-managed but not zero-configuration. The decisions that remain yours are consequential:

Private endpoint only. Production clusters at big-tech use private API endpoints exclusively. The API server is reachable only from inside the VPC. CI/CD runners and engineers tunnel via a bastion or AWS Systems Manager Session Manager. A public endpoint is an attack surface with no upside once you have a functioning VPN or SSM policy in place.
Cluster version cadence. EKS supports n-2 minor versions; you need a tested upgrade path every 14 weeks to stay current. Use Blue/Green cluster upgrades—provision a new cluster version, migrate traffic via weighted DNS, drain the old—rather than in-place upgrades. In-place upgrades on large clusters surface incompatible admission webhooks and deprecated APIs; the failure mode is a silent partial-upgrade that corrupts workload scheduling.
Add-on management via IaC. Every critical add-on—CoreDNS, kube-proxy, VPC CNI, cluster autoscaler, cert-manager, external-dns—must be managed through your IaC layer (Terraform EKS blueprints or Crossplane), not kubectl apply one-shots. Unmanaged add-ons drift within weeks in a multi-engineer environment.

ETCD is the cluster. On self-managed control planes, ETCD must run on dedicated nodes with SSD-backed storage, five members minimum for quorum, and automated snapshots to durable object storage every 30 minutes. On EKS/GKE this is abstracted, but you still need to know that large clusters (>500 nodes, >10k pods) hit ETCD size limits. Regularly prune events with kubectl delete events --all -A and audit CRD storage bloat. Uncontrolled CRD proliferation from third-party operators has caused ETCD compaction stalls that delayed pod scheduling by 45+ seconds in production.

Node Strategy: Fleet Composition

Treat node selection as a cost-performance optimisation problem, not a capacity provisioning one. The goal is to minimise total vCPU-hours consumed for a given workload throughput. Production fleet design layers multiple node tiers, each with a distinct purpose:

General purpose (m6i, n2-standard): default landing zone for mixed microservices. The 4:1 memory-to-CPU ratio fits most application pods. Use these as the on-demand base that Karpenter provisions first.
Compute-optimised (c6i, c2-standard): CPU-bound API gateways, TLS termination, compression-heavy batch jobs. Avoids paying for memory that will never be used.
Memory-optimised (r6i, m1-ultramem): JVM-heavy services, on-cluster Kafka brokers, in-memory analytics caches.
Spot / Preemptible: stateless workloads and CI runner pools—60-80% cost reduction. Pair spot nodes with PodDisruptionBudget and topologySpreadConstraints so a spot reclamation wave cannot simultaneously evict all replicas of a service.

Managed node groups vs. Karpenter. Managed node groups are predictable but slow: 2-3 minutes to provision a new node. Karpenter provisions nodes in roughly 30 seconds by calling the EC2 API directly. Its consolidation pass also right-sizes underutilised nodes, saving 15-25% on compute cost without manual tuning. Migration is a two-sprint project: deploy Karpenter, create NodePools mirroring your existing node group labels, then cordon-and-drain the old groups.

# Karpenter NodePool — production general-purpose tier
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general
spec:
  template:
    metadata:
      labels:
        node-class: general
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1
        kind: EC2NodeClass
        name: default
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand", "spot"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["m6i.xlarge","m6i.2xlarge","m6i.4xlarge",
                   "m6a.xlarge","m6a.2xlarge","c6i.2xlarge"]
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["us-east-1a","us-east-1b","us-east-1c"]
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s
  limits:
    cpu: "800"
    memory: 3200Gi
---
# GPU pool — tainted; only tolerating pods land here
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-batch
spec:
  template:
    spec:
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1
        kind: EC2NodeClass
        name: default
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["g5.2xlarge","g5.4xlarge","g5.8xlarge"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
      taints:
        - key: nvidia.com/gpu
          effect: NoSchedule
  limits:
    cpu: "192"
    memory: 768Gi

Multi-Environment Topology

Big-tech platform teams converge on one of two multi-environment patterns. Understanding the trade-offs is a senior-level judgement call:

Pattern A — Cluster per environment. Separate clusters for dev, staging, and prod. This is the dominant pattern at companies with strong compliance requirements (SOC2, PCI-DSS, HIPAA). Blast radius is maximally isolated: a misconfigured RBAC policy or a rogue operator in dev cannot touch prod. The cost is operational overhead—three cluster upgrade cycles, three sets of add-ons to maintain, three kubeconfig contexts to manage.

Pattern B — Namespace isolation within a cluster. Multiple environments share a cluster, separated by namespaces and NetworkPolicy. This is viable for development and staging, but production workloads handling customer PII or payment data must live in a dedicated cluster due to regulatory requirements and the blast radius of a compromised service account.

The production-standard topology for a platform team of 10-50 engineers is a hybrid: a shared services cluster (internal tooling, CI runners, observability stack, ArgoCD), a non-prod cluster (dev + staging namespaces with namespace-per-team, tight NetworkPolicy), and one or more production clusters (one per region, or one per product domain at large scale). This keeps upgrade burden manageable while enforcing hard isolation where it matters.

Name clusters after their purpose, not a version number. Use platform-shared-use1, workloads-nonprod-use1, workloads-prod-use1, workloads-prod-euw1. Version-named clusters (eks-v127-prod) create confusion during Blue/Green upgrades when both versions exist simultaneously in your kubeconfig.

Three-cluster topology: shared services hub deploys via GitOps to non-prod and production clusters.

Workload Isolation: Taints, Tolerations, and Topology Spread

Node pools alone are not enough. Within a cluster, you must control which pods land on which nodes and how they spread across failure domains. Three primitives work together:

Taints + Tolerations — keep workloads off nodes that are not sized for them (the GPU pool example above). Always taint specialty nodes and require an explicit toleration in the workload spec.
Node Affinity — soft (preferredDuringSchedulingIgnoredDuringExecution) or hard (requiredDuring...) rules to steer pods toward node classes. Use soft affinity for workloads that can tolerate the general pool if the preferred pool is full.
TopologySpreadConstraints — the single most underused primitive. A maxSkew: 1 constraint across topology.kubernetes.io/zone ensures replicas spread across AZs. Without this, the default scheduler packs pods onto available nodes and you can end up with all replicas in one AZ—making the service unavailable on a single AZ failure.

# Production Deployment — spread constraints + node affinity
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-api
  namespace: prod
spec:
  replicas: 6
  template:
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: payment-api
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: payment-api
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 80
              preference:
                matchExpressions:
                  - key: node-class
                    operator: In
                    values: [general]
      containers:
        - name: payment-api
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              memory: "1Gi"   # no CPU limit — avoids CPU throttling

Never set CPU limits in production. CPU limits enforce throttling at the cgroup level even when the node has spare capacity. A pod with limits.cpu: 500m will be throttled when it spikes to 600m, regardless of node load—this manifests as latency spikes that are very hard to diagnose. Set CPU requests for scheduling purposes and let the pod burst freely. Set only memory limits, as memory is not compressible: a pod that exceeds its memory limit is OOMKilled, which is the correct outcome.

Autoscaling: VPA, HPA, and KEDA

Big-tech clusters use all three autoscalers for different purposes. They are not alternatives—they complement each other:

HPA (Horizontal Pod Autoscaler) — scale replicas based on CPU, memory, or custom metrics (via metrics-server or prometheus-adapter). The standard for stateless services. Target ~65-70% CPU utilisation; too-high targets cause oscillation under bursty traffic.
VPA (Vertical Pod Autoscaler) — runs in recommendation mode only in production. Do not enable auto-apply mode: it evicts pods to resize them, causing unnecessary disruption. Use VPA recommendations to inform your static resource requests during the next deployment cycle.
KEDA (Kubernetes Event-Driven Autoscaler) — scale based on external event sources: SQS queue depth, Kafka consumer lag, Prometheus query results, Redis list length. Indispensable for async workloads where CPU and memory tell you nothing about actual load.

Bootstrap your HPA correctly. Set --horizontal-pod-autoscaler-initial-readiness-delay=30s and --horizontal-pod-autoscaler-cpu-initialization-period=5m in the controller-manager flags (or the EKS add-on config). Without these, HPA counts newly-started pods—which have not yet warmed their JVM or connection pools—as underloaded and scales down prematurely, then immediately scales up again, causing a scale-oscillation loop visible as periodic 503 spikes.