Advanced Terraform & IaC Patterns

Structuring Terraform at Scale

18 min Lesson 1 of 28

Structuring Terraform at Scale

At a startup you can get away with a single main.tf and a shared state file. At 50 engineers across 10 product teams managing six AWS accounts, that approach collapses within weeks: state lock contention, blast-radius failures, and on-call incidents caused by the wrong team touching the wrong resource. This lesson teaches the repo layout, environment separation, and layered state strategy that top-tier engineering organizations use to keep infrastructure changes safe, reviewable, and autonomous across teams.

The Core Constraint: Blast Radius

Every structural decision in large-scale Terraform flows from one question: if this terraform apply goes wrong, what is the maximum damage? A monolithic root module that manages networking, IAM, RDS, and Kubernetes in one state file can take down production with a single misplaced count. The remedy is state isolation — splitting infrastructure into layers where each layer is a separate Terraform root module with its own backend state.

State = blast radius boundary. Anything inside one state file can be accidentally destroyed by a single terraform apply. Design your layers so that losing any single state file affects only one logical scope (e.g., one app's compute, not the entire VPC).

The Three-Layer Model

The industry-standard pattern divides infrastructure into three layers, applied from the bottom up. Each layer can only reference outputs from layers below it — never sideways or upward.

Three-Layer Terraform State Model Layer 1 — Foundation (accounts, VPCs, DNS, IAM org roles) Managed by Platform/Infra team · Changed rarely · Blast radius: entire AWS account Layer 2 — Platform (EKS/RDS clusters, secrets manager, shared ALB) Managed by Platform team · Changed monthly · Blast radius: all apps in cluster Layer 3 — Application (per-service compute, queues, app DBs) Managed by product teams · Changed daily · Blast radius: one service reads outputs reads outputs
Three-layer Terraform state model: Foundation → Platform → Application, each with its own state and blast radius.

Lower layers expose outputs — VPC IDs, subnet IDs, cluster endpoints — via terraform_remote_state or (preferred at scale) a parameter store / SSM pattern, so app teams never need read access to the foundation state file.

Monorepo vs. Polyrepo

Both models work. The decision is organizational, not technical.

  • Monorepo — all Terraform in one repo, directories per layer/environment. Great for discoverability and atomic cross-layer PRs. Requires strong CODEOWNERS rules and per-path CI triggers so a change to layer1/ does not trigger plan for every app module.
  • Polyrepo — each layer (or team) owns its own repo. Natural security boundary: product teams literally cannot see foundation HCL. Harder to trace cross-layer dependencies. Common at large enterprises with separate security-compliance ownership of foundation infra.
Start with monorepo. It is much easier to split later (git subtree / repo extraction) than to merge. Most unicorn-stage companies run a monorepo for infra and add CODEOWNERS automation to enforce team boundaries.

Canonical Monorepo Layout

The layout below is production-hardened across hundreds of AWS environments. Every path is deliberate:

infra/ ├── modules/ # Reusable internal modules (NOT root modules) │ ├── vpc/ │ │ ├── main.tf │ │ ├── variables.tf │ │ ├── outputs.tf │ │ └── README.md │ ├── eks-cluster/ │ └── rds-postgres/ │ ├── layer1-foundation/ # Root module — one per AWS account │ ├── production/ │ │ ├── main.tf # calls modules/vpc, modules/org-iam │ │ ├── backend.tf # S3 key: "layer1/production/terraform.tfstate" │ │ ├── terraform.tfvars │ │ └── outputs.tf │ └── staging/ │ ├── main.tf │ ├── backend.tf # S3 key: "layer1/staging/terraform.tfstate" │ └── terraform.tfvars │ ├── layer2-platform/ # Root module — per environment │ ├── production/ │ │ ├── eks.tf │ │ ├── rds.tf │ │ ├── backend.tf # S3 key: "layer2/production/terraform.tfstate" │ │ └── data.tf # data "terraform_remote_state" "foundation" { ... } │ └── staging/ │ └── layer3-apps/ # Root module — per service per environment ├── payments-api/ │ ├── production/ │ │ ├── main.tf │ │ ├── backend.tf # S3 key: "layer3/payments-api/production/terraform.tfstate" │ │ └── variables.tf │ └── staging/ └── notifications-svc/ ├── production/ └── staging/

Environment Separation Strategies

There are three approaches to separating environments in Terraform. Understanding the trade-offs prevents costly migrations later.

  1. Directory-per-environment (shown above) — each environment is a separate root module directory with its own backend.tf and .tfvars. This is the safest and most explicit approach. You cannot accidentally apply staging config to production. The cost: some HCL duplication, mitigated by shared modules.
  2. Workspaces — one root module, multiple named workspaces, one state file per workspace. Works for small, truly-identical environments (dev/test). Breaks down when environments diverge: different instance sizes, different subnets, different DNS zones. Avoid for production/staging at scale — the temptation to add terraform.workspace == "production" ? ... : ... conditionals metastasizes into unmaintainable code.
  3. Separate accounts — the AWS Well-Architected standard for regulated industries. Production, staging, sandbox, and security-tooling each live in separate AWS accounts linked under AWS Organizations. Each account has its own layer1 root module. This is the gold standard for SaaS companies with SOC-2 or PCI requirements.
Never share state files between environments. A terraform destroy run against staging with a shared state file has deleted production RDS instances at multiple companies. State isolation is non-negotiable. One backend key = one environment.

Backend Configuration at Scale

At scale, every team configures their S3 backend with the same three non-negotiables: versioning, encryption, and DynamoDB locking. The backend key must encode the layer, service name, and environment so state files are self-describing:

# layer3-apps/payments-api/production/backend.tf terraform { backend "s3" { bucket = "acme-terraform-state-prod" key = "layer3/payments-api/production/terraform.tfstate" region = "us-east-1" encrypt = true dynamodb_table = "acme-terraform-locks" kms_key_id = "arn:aws:kms:us-east-1:123456789012:key/mrk-abc123" } } # Read foundation outputs without giving this team access to foundation state data "aws_ssm_parameter" "vpc_id" { name = "/infra/production/layer1/vpc_id" } data "aws_ssm_parameter" "private_subnets" { name = "/infra/production/layer1/private_subnet_ids" }

The SSM parameter pattern is superior to terraform_remote_state for cross-team consumption because it decouples state file access from value consumption. The foundation team writes outputs to SSM; app teams read from SSM. No IAM permissions to the foundation S3 bucket are needed for app teams, and the foundation can refactor its internals without changing the SSM key contracts.

CODEOWNERS and Per-Path CI

The directory layout only enforces team boundaries if your CI/CD system enforces it too. A production-grade setup combines:

  • .github/CODEOWNERSlayer1/ requires approval from @infra-platform; layer3-apps/payments-api/ requires @team-payments.
  • Per-path CI triggers — GitHub Actions on.push.paths or Atlantis per-directory plans so only the affected root modules run terraform plan on each PR.
  • Protected branch rules — no direct pushes to main; plan output posted as PR comment; terraform apply only runs after merge on the CI runner, never from a developer laptop in production.
Never run terraform apply from a developer laptop against production. CI runners should hold the production credentials; developers hold only read-only roles that allow terraform plan. This single policy prevents the most common class of human-error production incidents.

Common Failure Modes

At this stage teams routinely make three structural mistakes:

  • God module — one module that creates everything. The module.app call takes 80 inputs and manages 400 resources. Refactoring it mid-flight is a multi-week state surgery project. Decompose early.
  • Hardcoded account IDs in shared modules — shared modules should never reference specific account IDs or region strings. Pass them as variables. A module with a hardcoded 123456789012 is impossible to reuse across accounts.
  • Missing state locking — two engineers run terraform apply simultaneously, the second overwrites the first's state, and you lose track of which resources Terraform knows about. Always configure a DynamoDB lock table. It costs pennies and prevents catastrophic state corruption.

The layout and discipline established here is the foundation on which the rest of this tutorial builds: workspaces, advanced modules, testing, and Terragrunt all assume you already have clean layer separation and isolated state per environment.