This capstone lesson assembles every pattern from the tutorial into a single, production-grade reference platform. You will design a repository layout, wire up remote state with cross-environment promotion guards, write a reusable module library, and attach a CI pipeline that runs fmt, validate, plan, and gated apply — automatically for dev, manually approved for staging and production. By the end you will have a blueprint you can fork and run at any company.
The Repository Layout
All infrastructure lives in one monorepo. Tooling (Terragrunt, a thin Makefile, and a shared module library) keeps it DRY. The top-level split is environment first, layer second — exactly the inverse of a naive modules-first layout, which tends to leak cross-environment state references.
The root terragrunt.hcl centralises the S3 backend and DynamoDB lock table so no stack ever specifies a bucket name directly. Environments map to separate AWS accounts — never to workspaces — because account-level IAM boundaries are the only blast-radius guarantee that survives a leaked credential.
Root Terragrunt Config
The root file reads env.hcl from each environment directory, injects the account ID, and generates a unique state key per stack. Every child terragrunt.hcl inherits this via find_in_parent_folders().
Never use Terraform workspaces for environment separation in production. Workspaces share a single backend bucket root and a single provider configuration. A misconfigured terraform.workspace reference silently applies dev code to prod state. Separate AWS accounts with separate IAM roles is the only audit-safe model.
The Multi-Env Architecture
Three-account promotion pipeline: dev auto-applies; staging and production require human approval gates in GitHub Actions.
The CI/CD Pipeline
The plan.yml workflow triggers on every pull request and runs terragrunt run-all plan scoped to only the stacks whose files changed, using git diff path filtering. The apply.yml workflow triggers on merge to main. Dev applies immediately; staging and prod each have a GitHub Actions environment with required reviewers configured in the repo settings — Terraform never touches those accounts without a named human approving the plan output.
Use OIDC, not long-lived access keys. The id-token: write permission plus configure-aws-credentials with role-to-assume means GitHub Actions assumes an IAM role via short-lived tokens. There is nothing in Secrets to rotate or leak. Every major cloud provider supports OIDC federation with GitHub Actions as of 2024.
Child Stack: EKS Layer (dev example)
Each child terragrunt.hcl is tiny — it only specifies the module source, version pin, and environment-specific inputs. Cross-layer data flows through dependency blocks that call terraform output on the network stack's remote state, keeping layers decoupled without hard-coded resource IDs.
Modules are pinned by Git tag (?ref=v1.4.2). The promotion workflow is: update the tag in dev → run plan → apply dev → PR review → update staging tag → apply staging → 2-person sign-off → update prod tag → apply prod. This means dev always runs the newest module version, staging follows within hours, and production follows within days — with a human verifying the plan diff at each gate.
Never pin to a branch or ?ref=main in a module source. If the module branch advances while a plan is waiting for approval, the apply executes different code than the plan showed. Always pin to an immutable tag or commit SHA. Enforce this with a Conftest OPA policy (Lesson 8) that rejects any source not matching ?ref=v*.
Drift Detection as a Scheduled Job
Manual or console changes silently diverge production from the declared state. Add a nightly drift-detection workflow that runs terragrunt run-all plan --detailed-exitcode across all prod stacks and posts a Slack alert when exit code 2 (changes detected) is returned. This is the operational glue that makes GitOps for infrastructure actually enforceable.
# .github/workflows/drift.yml (excerpt)
on:
schedule:
- cron: '0 6 * * *' # 06:00 UTC daily
jobs:
drift-check:
runs-on: ubuntu-latest
environment: production # uses prod OIDC role — read-only plan only
steps:
- uses: actions/checkout@v4
- name: Drift check — prod
id: plan
working-directory: live/prod
run: |
terragrunt run-all plan --terragrunt-non-interactive \
--detailed-exitcode 2>&1 | tee plan.out
echo "exit_code=$?" >> $GITHUB_OUTPUT
continue-on-error: true
- name: Alert on drift
if: steps.plan.outputs.exit_code == '2'
run: |
curl -s -X POST ${{ secrets.SLACK_WEBHOOK }} \
-H "Content-Type: application/json" \
-d '{"text":"*Drift detected* in production infra. Review the plan output in GitHub Actions."}'
What You Have Built
Putting this all together, you have a platform where: every infrastructure change is a PR; the blast radius of any single apply is bounded by layer isolation; environments map to separate AWS accounts with separate IAM roles; OIDC eliminates static credentials from CI; module versions are immutable and explicitly promoted; policy-as-code rejects bad patterns before they reach plan; and drift surfaces automatically rather than silently. This is the operational baseline that mature engineering organizations expect from a senior DevOps engineer on day one.
Practical bootstrap order for a new platform: (1) Create three AWS accounts under an AWS Organization. (2) Create the S3 state bucket and DynamoDB lock table in each account with a bootstrap script. (3) Set up OIDC identity providers in each account. (4) Push the repo layout and configure GitHub Environments with reviewers. (5) Apply the network layer in dev — it is the foundation everything else reads from. (6) Layer eks and rds on top. (7) Promote to staging, then prod. The whole bootstrap takes an experienced engineer about two days; incremental changes after that are safe and auditable indefinitely.