Advanced Terraform & IaC Patterns

Project: A Multi-Env Terraform Platform

18 min Lesson 10 of 28

Project: A Multi-Env Terraform Platform

This capstone lesson assembles every pattern from the tutorial into a single, production-grade reference platform. You will design a repository layout, wire up remote state with cross-environment promotion guards, write a reusable module library, and attach a CI pipeline that runs fmt, validate, plan, and gated apply — automatically for dev, manually approved for staging and production. By the end you will have a blueprint you can fork and run at any company.

The Repository Layout

All infrastructure lives in one monorepo. Tooling (Terragrunt, a thin Makefile, and a shared module library) keeps it DRY. The top-level split is environment first, layer second — exactly the inverse of a naive modules-first layout, which tends to leak cross-environment state references.

infra/ ├── modules/ # Reusable, versioned building blocks │ ├── vpc/ │ ├── eks/ │ ├── rds/ │ └── iam-role/ ├── live/ # Environment roots (Terragrunt) │ ├── terragrunt.hcl # Root config: remote state bucket, provider defaults │ ├── dev/ │ │ ├── env.hcl # env = "dev", aws_account_id = "111122223333" │ │ ├── network/ │ │ │ └── terragrunt.hcl │ │ ├── eks/ │ │ │ └── terragrunt.hcl │ │ └── rds/ │ │ └── terragrunt.hcl │ ├── staging/ │ │ └── ... (same shape) │ └── prod/ │ └── ... (same shape) └── .github/ └── workflows/ ├── plan.yml # PR: plan all changed stacks └── apply.yml # Merge to main: apply dev; gate staging+prod

The root terragrunt.hcl centralises the S3 backend and DynamoDB lock table so no stack ever specifies a bucket name directly. Environments map to separate AWS accounts — never to workspaces — because account-level IAM boundaries are the only blast-radius guarantee that survives a leaked credential.

Root Terragrunt Config

The root file reads env.hcl from each environment directory, injects the account ID, and generates a unique state key per stack. Every child terragrunt.hcl inherits this via find_in_parent_folders().

# live/terragrunt.hcl locals { env_vars = read_terragrunt_config(find_in_parent_folders("env.hcl")) env = local.env_vars.locals.env account_id = local.env_vars.locals.aws_account_id region = "us-east-1" } remote_state { backend = "s3" generate = { path = "backend.tf" if_exists = "overwrite_terragrunt" } config = { bucket = "acme-terraform-state-${local.account_id}" key = "${local.env}/${path_relative_to_include()}/terraform.tfstate" region = local.region encrypt = true dynamodb_table = "acme-terraform-locks" # Block all public access — state files contain secrets } } generate "provider" { path = "provider.tf" if_exists = "overwrite_terragrunt" contents = <<EOF provider "aws" { region = "${local.region}" assume_role { role_arn = "arn:aws:iam::${local.account_id}:role/TerraformCIRole" } default_tags { tags = { Environment = "${local.env}" ManagedBy = "terraform" Repo = "github.com/acme/infra" } } } EOF }
Never use Terraform workspaces for environment separation in production. Workspaces share a single backend bucket root and a single provider configuration. A misconfigured terraform.workspace reference silently applies dev code to prod state. Separate AWS accounts with separate IAM roles is the only audit-safe model.

The Multi-Env Architecture

Multi-Environment Terraform Platform Architecture GitHub Repo infra/ monorepo plan.yml / apply.yml GitHub Actions Plan (PR) Apply (merge) AWS: Dev Account Auto-apply on merge network eks + rds AWS: Staging Manual approval gate network eks + rds AWS: Production 2-person approval gate network eks + rds S3 state (dev account) S3 state (staging account) S3 state (prod account)
Three-account promotion pipeline: dev auto-applies; staging and production require human approval gates in GitHub Actions.

The CI/CD Pipeline

The plan.yml workflow triggers on every pull request and runs terragrunt run-all plan scoped to only the stacks whose files changed, using git diff path filtering. The apply.yml workflow triggers on merge to main. Dev applies immediately; staging and prod each have a GitHub Actions environment with required reviewers configured in the repo settings — Terraform never touches those accounts without a named human approving the plan output.

# .github/workflows/apply.yml name: Terraform Apply on: push: branches: [main] paths: ['live/**'] env: TG_VERSION: "0.55.1" TF_VERSION: "1.8.5" jobs: apply-dev: name: Apply — Dev (auto) runs-on: ubuntu-latest permissions: id-token: write # OIDC — no long-lived AWS keys in secrets contents: read steps: - uses: actions/checkout@v4 - uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: arn:aws:iam::111122223333:role/TerraformCIRole aws-region: us-east-1 - uses: hashicorp/setup-terraform@v3 with: { terraform_version: "${{ env.TF_VERSION }}" } - name: Install Terragrunt run: | curl -sL https://github.com/gruntwork-io/terragrunt/releases/download/v${{ env.TG_VERSION }}/terragrunt_linux_amd64 \ -o /usr/local/bin/terragrunt && chmod +x /usr/local/bin/terragrunt - name: Apply dev working-directory: live/dev run: terragrunt run-all apply --terragrunt-non-interactive -auto-approve apply-staging: name: Apply — Staging (gated) needs: apply-dev runs-on: ubuntu-latest environment: staging # <-- requires reviewer in repo Settings > Environments permissions: id-token: write contents: read steps: - uses: actions/checkout@v4 - uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: arn:aws:iam::444455556666:role/TerraformCIRole aws-region: us-east-1 - uses: hashicorp/setup-terraform@v3 with: { terraform_version: "${{ env.TF_VERSION }}" } - name: Install Terragrunt run: | curl -sL https://github.com/gruntwork-io/terragrunt/releases/download/v${{ env.TG_VERSION }}/terragrunt_linux_amd64 \ -o /usr/local/bin/terragrunt && chmod +x /usr/local/bin/terragrunt - name: Apply staging working-directory: live/staging run: terragrunt run-all apply --terragrunt-non-interactive -auto-approve apply-prod: name: Apply — Production (2-person gated) needs: apply-staging runs-on: ubuntu-latest environment: production # requires 2 reviewers + protection rules permissions: id-token: write contents: read steps: - uses: actions/checkout@v4 - uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: arn:aws:iam::777788889999:role/TerraformCIRole aws-region: us-east-1 - uses: hashicorp/setup-terraform@v3 with: { terraform_version: "${{ env.TF_VERSION }}" } - name: Install Terragrunt run: | curl -sL https://github.com/gruntwork-io/terragrunt/releases/download/v${{ env.TG_VERSION }}/terragrunt_linux_amd64 \ -o /usr/local/bin/terragrunt && chmod +x /usr/local/bin/terragrunt - name: Apply production working-directory: live/prod run: terragrunt run-all apply --terragrunt-non-interactive -auto-approve
Use OIDC, not long-lived access keys. The id-token: write permission plus configure-aws-credentials with role-to-assume means GitHub Actions assumes an IAM role via short-lived tokens. There is nothing in Secrets to rotate or leak. Every major cloud provider supports OIDC federation with GitHub Actions as of 2024.

Child Stack: EKS Layer (dev example)

Each child terragrunt.hcl is tiny — it only specifies the module source, version pin, and environment-specific inputs. Cross-layer data flows through dependency blocks that call terraform output on the network stack's remote state, keeping layers decoupled without hard-coded resource IDs.

# live/dev/eks/terragrunt.hcl include "root" { path = find_in_parent_folders() } locals { env_vars = read_terragrunt_config(find_in_parent_folders("env.hcl")) env = local.env_vars.locals.env } dependency "network" { config_path = "../network" mock_outputs = { # lets plan run without network already applied vpc_id = "vpc-mock" private_subnets = ["subnet-mock-a", "subnet-mock-b"] } mock_outputs_allowed_terraform_commands = ["validate", "plan"] } terraform { source = "git::https://github.com/acme/infra.git//modules/eks?ref=v1.4.2" } inputs = { env = local.env cluster_name = "acme-${local.env}" vpc_id = dependency.network.outputs.vpc_id private_subnets = dependency.network.outputs.private_subnets node_instance_type = local.env == "prod" ? "m6i.2xlarge" : "t3.medium" desired_nodes = local.env == "prod" ? 6 : 2 }

Module Versioning and Promotion

Modules are pinned by Git tag (?ref=v1.4.2). The promotion workflow is: update the tag in dev → run plan → apply dev → PR review → update staging tag → apply staging → 2-person sign-off → update prod tag → apply prod. This means dev always runs the newest module version, staging follows within hours, and production follows within days — with a human verifying the plan diff at each gate.

Never pin to a branch or ?ref=main in a module source. If the module branch advances while a plan is waiting for approval, the apply executes different code than the plan showed. Always pin to an immutable tag or commit SHA. Enforce this with a Conftest OPA policy (Lesson 8) that rejects any source not matching ?ref=v*.

Drift Detection as a Scheduled Job

Manual or console changes silently diverge production from the declared state. Add a nightly drift-detection workflow that runs terragrunt run-all plan --detailed-exitcode across all prod stacks and posts a Slack alert when exit code 2 (changes detected) is returned. This is the operational glue that makes GitOps for infrastructure actually enforceable.

# .github/workflows/drift.yml (excerpt) on: schedule: - cron: '0 6 * * *' # 06:00 UTC daily jobs: drift-check: runs-on: ubuntu-latest environment: production # uses prod OIDC role — read-only plan only steps: - uses: actions/checkout@v4 - name: Drift check — prod id: plan working-directory: live/prod run: | terragrunt run-all plan --terragrunt-non-interactive \ --detailed-exitcode 2>&1 | tee plan.out echo "exit_code=$?" >> $GITHUB_OUTPUT continue-on-error: true - name: Alert on drift if: steps.plan.outputs.exit_code == '2' run: | curl -s -X POST ${{ secrets.SLACK_WEBHOOK }} \ -H "Content-Type: application/json" \ -d '{"text":"*Drift detected* in production infra. Review the plan output in GitHub Actions."}'

What You Have Built

Putting this all together, you have a platform where: every infrastructure change is a PR; the blast radius of any single apply is bounded by layer isolation; environments map to separate AWS accounts with separate IAM roles; OIDC eliminates static credentials from CI; module versions are immutable and explicitly promoted; policy-as-code rejects bad patterns before they reach plan; and drift surfaces automatically rather than silently. This is the operational baseline that mature engineering organizations expect from a senior DevOps engineer on day one.

Practical bootstrap order for a new platform: (1) Create three AWS accounts under an AWS Organization. (2) Create the S3 state bucket and DynamoDB lock table in each account with a bootstrap script. (3) Set up OIDC identity providers in each account. (4) Push the repo layout and configure GitHub Environments with reviewers. (5) Apply the network layer in dev — it is the foundation everything else reads from. (6) Layer eks and rds on top. (7) Promote to staging, then prod. The whole bootstrap takes an experienced engineer about two days; incremental changes after that are safe and auditable indefinitely.