The Kubernetes Platform
The Kubernetes Platform
Once accounts and network are correct, the Kubernetes layer is where the platform team spends the most engineering capital. At Google, Netflix, Airbnb, and Shopify, the cluster architecture decisions made in the first month are still running—and being worked around—three years later. This lesson covers three decisions that matter most at big-tech scale: how to architect the control plane, how to design the node fleet, and how to topology-map multiple environments across clusters.
Control Plane Architecture
In a managed offering—EKS, GKE, AKS—the control plane is cloud-managed but not zero-configuration. The decisions that remain yours are consequential:
- Private endpoint only. Production clusters at big-tech use private API endpoints exclusively. The API server is reachable only from inside the VPC. CI/CD runners and engineers tunnel via a bastion or AWS Systems Manager Session Manager. A public endpoint is an attack surface with no upside once you have a functioning VPN or SSM policy in place.
- Cluster version cadence. EKS supports n-2 minor versions; you need a tested upgrade path every 14 weeks to stay current. Use Blue/Green cluster upgrades—provision a new cluster version, migrate traffic via weighted DNS, drain the old—rather than in-place upgrades. In-place upgrades on large clusters surface incompatible admission webhooks and deprecated APIs; the failure mode is a silent partial-upgrade that corrupts workload scheduling.
- Add-on management via IaC. Every critical add-on—CoreDNS, kube-proxy, VPC CNI, cluster autoscaler, cert-manager, external-dns—must be managed through your IaC layer (Terraform EKS blueprints or Crossplane), not
kubectl applyone-shots. Unmanaged add-ons drift within weeks in a multi-engineer environment.
kubectl delete events --all -A and audit CRD storage bloat. Uncontrolled CRD proliferation from third-party operators has caused ETCD compaction stalls that delayed pod scheduling by 45+ seconds in production.
Node Strategy: Fleet Composition
Treat node selection as a cost-performance optimisation problem, not a capacity provisioning one. The goal is to minimise total vCPU-hours consumed for a given workload throughput. Production fleet design layers multiple node tiers, each with a distinct purpose:
- General purpose (m6i, n2-standard): default landing zone for mixed microservices. The 4:1 memory-to-CPU ratio fits most application pods. Use these as the on-demand base that Karpenter provisions first.
- Compute-optimised (c6i, c2-standard): CPU-bound API gateways, TLS termination, compression-heavy batch jobs. Avoids paying for memory that will never be used.
- Memory-optimised (r6i, m1-ultramem): JVM-heavy services, on-cluster Kafka brokers, in-memory analytics caches.
- Spot / Preemptible: stateless workloads and CI runner pools—60-80% cost reduction. Pair spot nodes with
PodDisruptionBudgetandtopologySpreadConstraintsso a spot reclamation wave cannot simultaneously evict all replicas of a service.
Managed node groups vs. Karpenter. Managed node groups are predictable but slow: 2-3 minutes to provision a new node. Karpenter provisions nodes in roughly 30 seconds by calling the EC2 API directly. Its consolidation pass also right-sizes underutilised nodes, saving 15-25% on compute cost without manual tuning. Migration is a two-sprint project: deploy Karpenter, create NodePools mirroring your existing node group labels, then cordon-and-drain the old groups.
Multi-Environment Topology
Big-tech platform teams converge on one of two multi-environment patterns. Understanding the trade-offs is a senior-level judgement call:
Pattern A — Cluster per environment. Separate clusters for dev, staging, and prod. This is the dominant pattern at companies with strong compliance requirements (SOC2, PCI-DSS, HIPAA). Blast radius is maximally isolated: a misconfigured RBAC policy or a rogue operator in dev cannot touch prod. The cost is operational overhead—three cluster upgrade cycles, three sets of add-ons to maintain, three kubeconfig contexts to manage.
Pattern B — Namespace isolation within a cluster. Multiple environments share a cluster, separated by namespaces and NetworkPolicy. This is viable for development and staging, but production workloads handling customer PII or payment data must live in a dedicated cluster due to regulatory requirements and the blast radius of a compromised service account.
The production-standard topology for a platform team of 10-50 engineers is a hybrid: a shared services cluster (internal tooling, CI runners, observability stack, ArgoCD), a non-prod cluster (dev + staging namespaces with namespace-per-team, tight NetworkPolicy), and one or more production clusters (one per region, or one per product domain at large scale). This keeps upgrade burden manageable while enforcing hard isolation where it matters.
platform-shared-use1, workloads-nonprod-use1, workloads-prod-use1, workloads-prod-euw1. Version-named clusters (eks-v127-prod) create confusion during Blue/Green upgrades when both versions exist simultaneously in your kubeconfig.
Workload Isolation: Taints, Tolerations, and Topology Spread
Node pools alone are not enough. Within a cluster, you must control which pods land on which nodes and how they spread across failure domains. Three primitives work together:
- Taints + Tolerations — keep workloads off nodes that are not sized for them (the GPU pool example above). Always taint specialty nodes and require an explicit toleration in the workload spec.
- Node Affinity — soft (
preferredDuringSchedulingIgnoredDuringExecution) or hard (requiredDuring...) rules to steer pods toward node classes. Use soft affinity for workloads that can tolerate the general pool if the preferred pool is full. - TopologySpreadConstraints — the single most underused primitive. A
maxSkew: 1constraint acrosstopology.kubernetes.io/zoneensures replicas spread across AZs. Without this, the default scheduler packs pods onto available nodes and you can end up with all replicas in one AZ—making the service unavailable on a single AZ failure.
limits.cpu: 500m will be throttled when it spikes to 600m, regardless of node load—this manifests as latency spikes that are very hard to diagnose. Set CPU requests for scheduling purposes and let the pod burst freely. Set only memory limits, as memory is not compressible: a pod that exceeds its memory limit is OOMKilled, which is the correct outcome.
Autoscaling: VPA, HPA, and KEDA
Big-tech clusters use all three autoscalers for different purposes. They are not alternatives—they complement each other:
- HPA (Horizontal Pod Autoscaler) — scale replicas based on CPU, memory, or custom metrics (via
metrics-serverorprometheus-adapter). The standard for stateless services. Target ~65-70% CPU utilisation; too-high targets cause oscillation under bursty traffic. - VPA (Vertical Pod Autoscaler) — runs in recommendation mode only in production. Do not enable auto-apply mode: it evicts pods to resize them, causing unnecessary disruption. Use VPA recommendations to inform your static resource requests during the next deployment cycle.
- KEDA (Kubernetes Event-Driven Autoscaler) — scale based on external event sources: SQS queue depth, Kafka consumer lag, Prometheus query results, Redis list length. Indispensable for async workloads where CPU and memory tell you nothing about actual load.
--horizontal-pod-autoscaler-initial-readiness-delay=30s and --horizontal-pod-autoscaler-cpu-initialization-period=5m in the controller-manager flags (or the EKS add-on config). Without these, HPA counts newly-started pods—which have not yet warmed their JVM or connection pools—as underloaded and scales down prematurely, then immediately scales up again, causing a scale-oscillation loop visible as periodic 503 spikes.