Advanced Terraform & IaC Patterns

Project: A Multi-Env Terraform Platform

18 min Lesson 10 of 28

Project: A Multi-Env Terraform Platform

This capstone lesson assembles every pattern from the tutorial into a single, production-grade reference platform. You will design a repository layout, wire up remote state with cross-environment promotion guards, write a reusable module library, and attach a CI pipeline that runs fmt, validate, plan, and gated apply — automatically for dev, manually approved for staging and production. By the end you will have a blueprint you can fork and run at any company.

The Repository Layout

All infrastructure lives in one monorepo. Tooling (Terragrunt, a thin Makefile, and a shared module library) keeps it DRY. The top-level split is environment first, layer second — exactly the inverse of a naive modules-first layout, which tends to leak cross-environment state references.

infra/
├── modules/                  # Reusable, versioned building blocks
│   ├── vpc/
│   ├── eks/
│   ├── rds/
│   └── iam-role/
├── live/                     # Environment roots (Terragrunt)
│   ├── terragrunt.hcl        # Root config: remote state bucket, provider defaults
│   ├── dev/
│   │   ├── env.hcl           # env = "dev", aws_account_id = "111122223333"
│   │   ├── network/
│   │   │   └── terragrunt.hcl
│   │   ├── eks/
│   │   │   └── terragrunt.hcl
│   │   └── rds/
│   │       └── terragrunt.hcl
│   ├── staging/
│   │   └── ... (same shape)
│   └── prod/
│       └── ... (same shape)
└── .github/
    └── workflows/
        ├── plan.yml          # PR: plan all changed stacks
        └── apply.yml         # Merge to main: apply dev; gate staging+prod

The root terragrunt.hcl centralises the S3 backend and DynamoDB lock table so no stack ever specifies a bucket name directly. Environments map to separate AWS accounts — never to workspaces — because account-level IAM boundaries are the only blast-radius guarantee that survives a leaked credential.

Root Terragrunt Config

The root file reads env.hcl from each environment directory, injects the account ID, and generates a unique state key per stack. Every child terragrunt.hcl inherits this via find_in_parent_folders().

# live/terragrunt.hcl
locals {
  env_vars   = read_terragrunt_config(find_in_parent_folders("env.hcl"))
  env        = local.env_vars.locals.env
  account_id = local.env_vars.locals.aws_account_id
  region     = "us-east-1"
}

remote_state {
  backend = "s3"
  generate = {
    path      = "backend.tf"
    if_exists = "overwrite_terragrunt"
  }
  config = {
    bucket         = "acme-terraform-state-${local.account_id}"
    key            = "${local.env}/${path_relative_to_include()}/terraform.tfstate"
    region         = local.region
    encrypt        = true
    dynamodb_table = "acme-terraform-locks"
    # Block all public access — state files contain secrets
  }
}

generate "provider" {
  path      = "provider.tf"
  if_exists = "overwrite_terragrunt"
  contents  = <<EOF
provider "aws" {
  region = "${local.region}"
  assume_role {
    role_arn = "arn:aws:iam::${local.account_id}:role/TerraformCIRole"
  }
  default_tags {
    tags = {
      Environment = "${local.env}"
      ManagedBy   = "terraform"
      Repo        = "github.com/acme/infra"
    }
  }
}
EOF
}

Never use Terraform workspaces for environment separation in production. Workspaces share a single backend bucket root and a single provider configuration. A misconfigured terraform.workspace reference silently applies dev code to prod state. Separate AWS accounts with separate IAM roles is the only audit-safe model.

The Multi-Env Architecture

Three-account promotion pipeline: dev auto-applies; staging and production require human approval gates in GitHub Actions.

The CI/CD Pipeline

The plan.yml workflow triggers on every pull request and runs terragrunt run-all plan scoped to only the stacks whose files changed, using git diff path filtering. The apply.yml workflow triggers on merge to main. Dev applies immediately; staging and prod each have a GitHub Actions environment with required reviewers configured in the repo settings — Terraform never touches those accounts without a named human approving the plan output.

# .github/workflows/apply.yml
name: Terraform Apply

on:
  push:
    branches: [main]
    paths: ['live/**']

env:
  TG_VERSION: "0.55.1"
  TF_VERSION: "1.8.5"

jobs:
  apply-dev:
    name: Apply — Dev (auto)
    runs-on: ubuntu-latest
    permissions:
      id-token: write   # OIDC — no long-lived AWS keys in secrets
      contents: read
    steps:
      - uses: actions/checkout@v4

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::111122223333:role/TerraformCIRole
          aws-region: us-east-1

      - uses: hashicorp/setup-terraform@v3
        with: { terraform_version: "${{ env.TF_VERSION }}" }

      - name: Install Terragrunt
        run: |
          curl -sL https://github.com/gruntwork-io/terragrunt/releases/download/v${{ env.TG_VERSION }}/terragrunt_linux_amd64 \
            -o /usr/local/bin/terragrunt && chmod +x /usr/local/bin/terragrunt

      - name: Apply dev
        working-directory: live/dev
        run: terragrunt run-all apply --terragrunt-non-interactive -auto-approve

  apply-staging:
    name: Apply — Staging (gated)
    needs: apply-dev
    runs-on: ubuntu-latest
    environment: staging          # <-- requires reviewer in repo Settings > Environments
    permissions:
      id-token: write
      contents: read
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::444455556666:role/TerraformCIRole
          aws-region: us-east-1
      - uses: hashicorp/setup-terraform@v3
        with: { terraform_version: "${{ env.TF_VERSION }}" }
      - name: Install Terragrunt
        run: |
          curl -sL https://github.com/gruntwork-io/terragrunt/releases/download/v${{ env.TG_VERSION }}/terragrunt_linux_amd64 \
            -o /usr/local/bin/terragrunt && chmod +x /usr/local/bin/terragrunt
      - name: Apply staging
        working-directory: live/staging
        run: terragrunt run-all apply --terragrunt-non-interactive -auto-approve

  apply-prod:
    name: Apply — Production (2-person gated)
    needs: apply-staging
    runs-on: ubuntu-latest
    environment: production        # requires 2 reviewers + protection rules
    permissions:
      id-token: write
      contents: read
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::777788889999:role/TerraformCIRole
          aws-region: us-east-1
      - uses: hashicorp/setup-terraform@v3
        with: { terraform_version: "${{ env.TF_VERSION }}" }
      - name: Install Terragrunt
        run: |
          curl -sL https://github.com/gruntwork-io/terragrunt/releases/download/v${{ env.TG_VERSION }}/terragrunt_linux_amd64 \
            -o /usr/local/bin/terragrunt && chmod +x /usr/local/bin/terragrunt
      - name: Apply production
        working-directory: live/prod
        run: terragrunt run-all apply --terragrunt-non-interactive -auto-approve

Use OIDC, not long-lived access keys. The id-token: write permission plus configure-aws-credentials with role-to-assume means GitHub Actions assumes an IAM role via short-lived tokens. There is nothing in Secrets to rotate or leak. Every major cloud provider supports OIDC federation with GitHub Actions as of 2024.

Child Stack: EKS Layer (dev example)

Each child terragrunt.hcl is tiny — it only specifies the module source, version pin, and environment-specific inputs. Cross-layer data flows through dependency blocks that call terraform output on the network stack's remote state, keeping layers decoupled without hard-coded resource IDs.

# live/dev/eks/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}

locals {
  env_vars = read_terragrunt_config(find_in_parent_folders("env.hcl"))
  env      = local.env_vars.locals.env
}

dependency "network" {
  config_path = "../network"
  mock_outputs = {                         # lets plan run without network already applied
    vpc_id          = "vpc-mock"
    private_subnets = ["subnet-mock-a", "subnet-mock-b"]
  }
  mock_outputs_allowed_terraform_commands = ["validate", "plan"]
}

terraform {
  source = "git::https://github.com/acme/infra.git//modules/eks?ref=v1.4.2"
}

inputs = {
  env             = local.env
  cluster_name    = "acme-${local.env}"
  vpc_id          = dependency.network.outputs.vpc_id
  private_subnets = dependency.network.outputs.private_subnets
  node_instance_type = local.env == "prod" ? "m6i.2xlarge" : "t3.medium"
  desired_nodes   = local.env == "prod" ? 6 : 2
}

Module Versioning and Promotion

Modules are pinned by Git tag (?ref=v1.4.2). The promotion workflow is: update the tag in dev → run plan → apply dev → PR review → update staging tag → apply staging → 2-person sign-off → update prod tag → apply prod. This means dev always runs the newest module version, staging follows within hours, and production follows within days — with a human verifying the plan diff at each gate.

Never pin to a branch or ?ref=main in a module source. If the module branch advances while a plan is waiting for approval, the apply executes different code than the plan showed. Always pin to an immutable tag or commit SHA. Enforce this with a Conftest OPA policy (Lesson 8) that rejects any source not matching ?ref=v*.

Drift Detection as a Scheduled Job

Manual or console changes silently diverge production from the declared state. Add a nightly drift-detection workflow that runs terragrunt run-all plan --detailed-exitcode across all prod stacks and posts a Slack alert when exit code 2 (changes detected) is returned. This is the operational glue that makes GitOps for infrastructure actually enforceable.

# .github/workflows/drift.yml (excerpt)
on:
  schedule:
    - cron: '0 6 * * *'   # 06:00 UTC daily

jobs:
  drift-check:
    runs-on: ubuntu-latest
    environment: production   # uses prod OIDC role — read-only plan only
    steps:
      - uses: actions/checkout@v4
      - name: Drift check — prod
        id: plan
        working-directory: live/prod
        run: |
          terragrunt run-all plan --terragrunt-non-interactive \
            --detailed-exitcode 2>&1 | tee plan.out
          echo "exit_code=$?" >> $GITHUB_OUTPUT
        continue-on-error: true

      - name: Alert on drift
        if: steps.plan.outputs.exit_code == '2'
        run: |
          curl -s -X POST ${{ secrets.SLACK_WEBHOOK }} \
            -H "Content-Type: application/json" \
            -d '{"text":"*Drift detected* in production infra. Review the plan output in GitHub Actions."}'

What You Have Built

Putting this all together, you have a platform where: every infrastructure change is a PR; the blast radius of any single apply is bounded by layer isolation; environments map to separate AWS accounts with separate IAM roles; OIDC eliminates static credentials from CI; module versions are immutable and explicitly promoted; policy-as-code rejects bad patterns before they reach plan; and drift surfaces automatically rather than silently. This is the operational baseline that mature engineering organizations expect from a senior DevOps engineer on day one.

Practical bootstrap order for a new platform: (1) Create three AWS accounts under an AWS Organization. (2) Create the S3 state bucket and DynamoDB lock table in each account with a bootstrap script. (3) Set up OIDC identity providers in each account. (4) Push the repo layout and configure GitHub Environments with reviewers. (5) Apply the network layer in dev — it is the foundation everything else reads from. (6) Layer eks and rds on top. (7) Promote to staging, then prod. The whole bootstrap takes an experienced engineer about two days; incremental changes after that are safe and auditable indefinitely.