Infrastructure as Code Layer
Infrastructure as Code Layer
The IaC layer is the skeleton of the entire platform. Everything else — Kubernetes, CI/CD pipelines, observability agents, security tooling — depends on what Terraform provisions and how it is structured. Done poorly, IaC becomes the slowest, most dangerous part of the delivery chain: state lock contention blocks teams, blast-radius incidents take down prod, and drift between environments causes impossible-to-reproduce bugs. Done well, it is invisible — engineers submit PRs, a pipeline applies them safely, and the platform self-heals.
This lesson covers three things every big-tech platform team owns outright: the repository layout that scales past 50 engineers, a state strategy that makes blast radius a first-class design constraint, and the internal module catalog that keeps every team from re-inventing the same EKS node group.
Repository Layout: The Monorepo Approach
At Netflix, Airbnb, and Stripe, infrastructure lives in a single monorepo with deep CODEOWNERS enforcement. A polyrepo approach offers sharper security boundaries (product teams cannot even read foundation HCL) but loses the ability to trace cross-layer dependencies and review atomic changes. For this capstone we use a monorepo. The directory tree below is the canonical layout — every path reflects a deliberate decision:
The numbered layer directories (01-foundation, 02-platform, 03-workloads) encode dependency order. A CI rule rejects any PR that tries to reference outputs upward — workloads can read platform outputs, but platform must never depend on a workload state file.
modules/ directory and all 01-foundation/ paths must require approval from the platform team. A product engineer should never be able to merge a change to the VPC CIDR or the EKS control-plane version without review. GitHub CODEOWNERS + branch protection with "required reviews from code owners" enforces this automatically.
State Strategy: Isolation as a First-Class Constraint
State is the blast-radius boundary. A single terraform apply can only destroy what lives in its state file — nothing more. The three-layer split in the directory tree maps directly to three separate state files per environment. Layer 1 (foundation) is touched once per quarter; Layer 3 (workloads) is touched dozens of times per day. They must never share a lock.
Each layer's backend.tf follows an identical pattern. The S3 bucket itself is provisioned by the foundation layer in the security account, with versioning, server-side encryption (SSE-KMS), and MFA-delete enabled. DynamoDB provides the state lock table. This is configured once per environment:
terraform_remote_state across trust boundaries. If product team A can read the foundation state file directly, they can read every sensitive output — RDS master credentials, private subnet IDs, KMS key ARNs. The production-hardened pattern is to publish non-sensitive cross-layer values to SSM Parameter Store as part of the lower layer's apply, and have upper layers read them with aws_ssm_parameter data sources. The platform team controls what gets published; product teams read only what they need.
For cross-layer output sharing the pattern looks like this — the platform layer writes, and workload layers read:
The Internal Module Catalog
At any company with more than three platform engineers, ad-hoc module writing produces a proliferation of subtly incompatible EKS node groups, RDS instances without final snapshots, and S3 buckets with public access silently enabled. The module catalog solves this by making the right thing the easy thing: a product engineer module "payments_db" and gets encryption, parameter groups, deletion protection, and CloudWatch alarms for free.
A production-grade internal module has three properties that distinguish it from a quick wrapper:
- Opinionated defaults that encode policy. Encryption is on by default and cannot be turned off via a variable. Deletion protection on RDS defaults to
trueand requires an explicitallow_major_version_upgrade = falseto change. Security is opt-out, not opt-in — and even then, policy/Conftest gates block insecure configurations in CI before a plan is approved. - Version-pinned and changelog-tracked. Modules live under
modules/in the monorepo. Each module has aCHANGELOG.md. Consumers reference a git tag:source = "git::https://github.com/company/infra//modules/rds-postgres?ref=v3.2.1". Major version bumps are announced in the internal platform newsletter; teams migrate on their own schedule, but the platform team deprecates old versions with a hard sunset date. - Outputs cover 100% of what consumers need. A module that doesn't output its own ARN forces consumers to use data sources and creates implicit coupling between apply order and state. Every resource provisioned by the module should have a corresponding output.
Here is a stripped-down but production-representative internal rds-postgres module signature. The full module is around 180 lines of HCL; the interface is what matters:
v3.2.1 → v3.2.2) must be backward-compatible — adding an optional variable with a default is a patch. A minor version adds new resources. A major version breaks the interface or removes resources. When you merge a major bump, open a tracking issue listing every environment still on the old version, and set a 60-day sunset. Teams that miss the deadline get their apply blocked by a Conftest rule that rejects deprecated module versions.
The state strategy and module catalog together create a virtuous cycle: because each root module is small (one layer, one environment), plans are fast (under 30 seconds for a workload module), and Conftest policies run against the plan JSON before any human approves. A misconfigured S3 bucket or a missing required tag is caught by conftest test --policy policy/ plan.json in the PR pipeline, not at apply time, and certainly not in production.
With this IaC foundation stable, the next lesson provisions the Kubernetes platform on top of it — EKS cluster configuration, add-on management via Helm and Terraform, and the Karpenter node autoscaler that makes the compute layer elastic at hyperscaler scale.