Capstone: A Big-Tech Production Platform

The Capstone Brief

18 min Lesson 1 of 30

The Capstone Brief

You have spent the previous 49 tutorials learning every layer of a modern production platform — Linux internals, network fundamentals, containers, Kubernetes, Terraform, GitOps, full-stack observability, SRE practice, security, compliance, and cost engineering. Each tutorial was purposefully scoped to a single domain. This final tutorial has a different structure: it is an integration project, and its scope is deliberately wide.

Over the next nine lessons you will design and build a complete big-tech-grade production platform for a realistic scaling company. Every decision will involve trade-offs across cost, reliability, security, and engineering velocity — the same trade-offs a Staff or Principal engineer navigates daily. There are no toy examples. The commands are real, the YAML is production-ready, and the failure modes are the ones that have caused real incidents at real companies.

The Scenario: Arctiq Commerce

Arctiq Commerce is a B2C e-commerce company currently running a monolithic Laravel application on two bare-metal servers behind a hardware load balancer. They have grown to 2 million registered users and are projecting 10x growth over the next 18 months driven by a new mobile app launch. Their current architecture cannot handle the load, their release cycle is two weeks long, a single deploy takes the site down for 4 minutes, and they have no observability beyond server-level CPU graphs in a legacy monitoring tool.

The board has approved a platform re-architecture. You are the platform engineering lead. Your mandate: build a platform that is cloud-native, secure, observable, and self-service so that product engineering teams can ship features daily without needing to know how the infrastructure works.

Requirements and Constraints

Good platform engineering starts with explicit requirements. Vague mandates like "make it scalable" lead to over-engineered platforms that nobody uses. Before writing a single line of Terraform, a senior engineer writes down the requirements with numbers attached.

Functional Requirements

Multi-region active-active: Two AWS regions (us-east-1 primary, eu-west-1 secondary) with automatic failover. Regional latency target: < 80 ms p99 for EU users, < 60 ms p99 for US users.
Kubernetes workloads: 12 product engineering teams, each owning 2–8 microservices. Peak pod count: ~4,000. Cluster must scale to 8,000 pods without architectural changes.
Zero-downtime deploys: Every service deployment must use a rolling or blue-green strategy. Maximum acceptable deployment time: 10 minutes from merge to production.
Data tier: PostgreSQL (RDS Aurora) for transactional data, Redis Cluster for session and caching, Kafka for async event streaming between services, S3 for object storage with lifecycle rules.
Developer self-service: A team should be able to provision a new service namespace, CI pipeline, and observability dashboards without opening a ticket to the platform team.

Non-Functional Requirements (the ones that decide architecture)

Availability SLO: 99.95% uptime per quarter (allows 131 minutes downtime per quarter, roughly 22 minutes per month). This rules out single-AZ deployments and single-region databases.
RTO / RPO: Recovery Time Objective 15 minutes, Recovery Point Objective 5 minutes. This mandates synchronous replication for Aurora, continuous WAL shipping, and a pre-warmed DR cluster — not cold standby.
Security posture: PCI-DSS Level 2 (card data flows through the platform). This forces network segmentation, encryption in transit and at rest everywhere, audit logging on all admin actions, and quarterly pen testing.
Compliance: GDPR for EU customers. All data processing must be documented. Customer PII must be deletable within 30 days of a deletion request, which means event streams containing PII need a compaction or tombstone strategy.
Cost ceiling: AWS spend must not exceed $85,000/month at 2x current traffic, with a hard alert at 90% of budget. Cost must be tracked per team per sprint via tag-based allocation.

Operational Constraints

The platform team has 4 engineers. Every operational process that cannot be automated will not be done reliably — design for zero-touch operations wherever possible.
The existing application is a Laravel monolith. The migration to microservices will be incremental over 12 months using the strangler-fig pattern. The platform must support both the monolith and new services simultaneously.
All infrastructure must be defined in Terraform and stored in Git. No ClickOps. An IAM audit will flag any resource not created by the CI/CD pipeline service account.

Key idea: The RTO, RPO, and availability SLO are not marketing numbers — they are architectural drivers. A 99.95% SLO with a 15-minute RTO forces specific decisions: active-active multi-AZ, Aurora Global Database with read replicas, automated failover via Route 53 health checks, and runbooks that have been tested under fire drill conditions. Every architecture decision in the next eight lessons traces back to one of these numbers.

The Platform Architecture — 30,000-Foot View

Before diving into any single layer, a senior engineer draws the full picture so every subsequent decision is made in context. Below is the target-state platform architecture. It is not final — it will evolve as you work through each layer — but it establishes the reference frame.

Arctiq Commerce target-state platform: two active AWS regions, EKS compute, Aurora Global Database, MSK Kafka with MirrorMaker2 cross-region replication, and a shared observability and security layer.

How the Next Eight Lessons Build This Platform

Each lesson corresponds to one architectural layer. They are ordered by dependency — you cannot deploy Kubernetes without networking, you cannot enforce policy without the security layer in place, and you cannot do SRE practice without observability. The order also mirrors how a real platform team delivers: infrastructure first, compute second, delivery third, observability fourth, security fifth, reliability sixth, and then the experience layer that product teams actually interact with.

Foundation — Accounts, Network & IAM: AWS Organizations, VPC architecture (3 AZs, transit gateway, private endpoints), IAM least-privilege roles, SCPs.
Infrastructure as Code: Terraform module structure, remote state with S3 + DynamoDB locking, Atlantis for GitOps-driven plan/apply, environment promotion.
The Kubernetes Platform: EKS cluster design, Karpenter, Istio, cluster add-ons, namespace-level RBAC and resource quotas.
CI/CD & GitOps Delivery: GitHub Actions pipelines, ArgoCD ApplicationSets, progressive delivery with Argo Rollouts, release gates.
Observability Stack: kube-prometheus-stack, OpenTelemetry Collector, distributed tracing, SLO-based alerting, Grafana dashboards as code.
Security & Compliance: HashiCorp Vault for secrets, OPA/Gatekeeper for admission control, Falco runtime security, PCI network segmentation.
Reliability — SRE Practice & DR: SLO budget tracking, GameDay drills, Aurora failover testing, chaos experiments, DR runbooks.
Cost & Platform Experience: Karpenter spot strategy, Kubecost tag-based allocation, Internal Developer Portal with Backstage, golden-path templates.

Senior engineer habit: Before every architecture session, write down the three constraints that would make you choose a fundamentally different approach if they changed. For Arctiq Commerce those are: (1) PCI-DSS scope — if card data moved out-of-scope, network segmentation could be simplified significantly; (2) the $85k/month cost ceiling — if removed, active-active would use larger on-demand instances instead of Spot with Karpenter; (3) the 4-engineer platform team — if the team were 12 engineers, more bespoke tooling would be justified. Knowing your key constraints makes every design decision faster and more defensible.

How to Read the Remaining Lessons

Each of the next eight lessons will contain a reference Terraform module or Kubernetes manifest that is production-ready — not a tutorial skeleton, but something you could fork into a real environment with minor variable changes. Commands will be shown as they would run in a CI/CD pipeline, not manually from a laptop. Trade-off discussions will name specific numbers: latency budgets, failure rates, dollar costs, and team-hours of operational burden.

The measure of success for this capstone is not whether you can repeat back what each component does — you already know that. The measure is whether you can answer the harder question: given these constraints, why did we make this choice instead of the alternatives, and what would have to change to make a different choice correct?

Production pitfall: The single most common mistake when building a new platform is scoping creep during the foundation phase. Teams spend six months perfecting IAM policies and VPC flow log pipelines and never ship the Kubernetes cluster that product teams actually need. Use the requirements above as a forcing function: if a decision does not trace back to a stated requirement, defer it. A platform that product teams can use in three months beats a perfect platform that ships in twelve.