This capstone lesson turns every concept from the tutorial — HCL syntax, providers, variables, state, remote backends, data sources, meta-arguments, and modules — into a single, end-to-end production-grade project. You will provision a three-tier web stack on AWS: a custom VPC with public and private subnets across multiple availability zones, an Auto Scaling Group of EC2 instances behind an Application Load Balancer, and remote state stored in S3 with DynamoDB locking. This is the pattern used by platform-engineering teams at companies like Stripe, Shopify, and Airbnb for their foundational cloud workloads.
Project Directory Structure
Organize the project as a root module that calls two reusable child modules: modules/network for VPC and subnets, and modules/compute for the load balancer, Auto Scaling Group, and security groups. Remote state is bootstrapped separately — you never let Terraform manage the S3 bucket and DynamoDB table that hold its own state file.
Bootstrap vs. managed state: The backend-bootstrap/ directory is a tiny, separate Terraform workspace that uses local state and is run exactly once per environment. It creates the S3 bucket (with versioning and server-side encryption) and the DynamoDB table. Because those resources hold your main stack's state, they must never be managed by the main stack — a destroy would wipe your state file and leave you with an unrecoverable blast radius.
Step 1 — Bootstrap Remote State
Before the main stack can use a remote backend, the backend infrastructure must exist. This is a one-time operation per environment. In CI pipelines at large organizations this step is gated behind a separate "bootstrap" pipeline that requires SRE approval to run.
Step 2 — Root Module: Versions, Backend, and Locals
The versions.tf file pins every provider to a minor version range using the pessimistic constraint operator (~>). Unpinned providers are one of the most common sources of surprise infrastructure drift in shared repositories — a terraform init -upgrade on a colleague's machine can pull a provider with a breaking change and corrupt real infrastructure if the plan is auto-applied.
The network module builds a hub-and-spoke VPC: public subnets host the ALB and NAT Gateways, private subnets host the EC2 instances. Each AZ gets one public and one private subnet. Using for_each over a slice of the AZ list makes the module AZ-count-agnostic — it works for a dev stack with two AZs and a production stack with three, driven entirely by a variable.
Three-AZ VPC layout: ALB nodes and NAT Gateways live in public subnets; EC2 instances (ASG) live in private subnets and reach the internet only via NAT.
Step 4 — The Compute Module (ALB + ASG)
The compute module wires together the Application Load Balancer in the public subnets, a launch template referencing the latest Amazon Linux 2023 AMI via a data source, and an Auto Scaling Group that spans the private subnets. The security group model is explicit and minimal: the ALB accepts port 443 from the internet, and EC2 instances accept port 443 only from the ALB's security group — never from 0.0.0.0/0.
Production pitfall — IMDSv1 on EC2: Omitting http_tokens = "required" in the launch template leaves IMDSv1 enabled. IMDSv1 is reachable from any process on the instance, including server-side request forgery (SSRF) vulnerabilities in application code. The Capital One breach (2019) exploited IMDS to extract IAM credentials. Always enforce IMDSv2 in every launch template. AWS now defaults new accounts to IMDSv2-only, but existing accounts and AMIs may still default to IMDSv1.
Step 5 — Root Module and Deployment Workflow
The root module wires the two child modules together, passing the network outputs into the compute module's inputs. It also emits the critical outputs consumed by subsequent CI pipeline steps — the smoke test URL and the ALB ARN for DNS record creation.
# main.tf (root module)
provider "aws" {
region = var.aws_region
default_tags {
tags = local.common_tags
}
}
module "network" {
source = "./modules/network"
name_prefix = local.name_prefix
vpc_cidr = var.vpc_cidr
az_count = local.az_count
tags = local.common_tags
}
module "compute" {
source = "./modules/compute"
name_prefix = local.name_prefix
vpc_id = module.network.vpc_id
public_subnet_ids = module.network.public_subnet_ids
private_subnet_ids = module.network.private_subnet_ids
instance_type = var.instance_type
acm_certificate_arn = var.acm_certificate_arn
instance_profile_name = var.instance_profile_name
access_log_bucket = var.access_log_bucket
asg_min = var.asg_min
asg_max = var.asg_max
asg_desired = var.asg_desired
tags = local.common_tags
}
# outputs.tf
output "alb_dns_name" {
description = "DNS name of the Application Load Balancer."
value = module.compute.alb_dns_name
}
output "vpc_id" {
description = "VPC ID."
value = module.network.vpc_id
}
output "asg_name" {
description = "Name of the Auto Scaling Group."
value = module.compute.asg_name
}
# ---
# Deployment commands (run by CI after plan is approved):
# Init (downloads providers, configures S3 backend):
terraform init \
-backend-config="bucket=acme-terraform-state-prod" \
-backend-config="key=web-stack/production/terraform.tfstate" \
-backend-config="region=us-east-1"
# Plan (output saved as artifact for review gate):
terraform plan -var-file=envs/production.tfvars -out=tfplan
# Apply (uses the saved plan — no re-plan surprises):
terraform apply tfplan
# Post-apply smoke test:
ALB=$(terraform output -raw alb_dns_name)
curl -sf --retry 5 --retry-delay 10 "https://${ALB}/health" \
|| { echo "Smoke test failed — rolling back"; terraform destroy -auto-approve -target=module.compute; exit 1; }
Always apply a saved plan file in CI. Running terraform apply without -out=tfplan and then terraform apply tfplan means Terraform creates a fresh plan at apply time. Between human review and apply, another pipeline or manual change could alter the state — producing an apply that does not match what was reviewed. Saving the plan with -out and applying that exact artifact is the only way to guarantee plan-review integrity. HashiCorp Terraform Cloud enforces this as a mandatory workflow feature for enterprise plans.
Production Failure Modes to Know
After running dozens of Terraform-managed rollouts you will encounter predictable failure patterns. Knowing them in advance turns a midnight incident into a ten-minute fix:
State lock not released after an interrupted apply: Run terraform force-unlock <LOCK_ID> — the lock ID is shown in the error. Verify the previous apply actually failed before unlocking; if it completed, the unlock is harmless. If another apply is genuinely running, never force-unlock.
Desired capacity drift in ASG: If an operator manually adjusts desired_capacity in the AWS console, the next terraform plan will show a diff and reset it. Use ignore_changes = [desired_capacity] in the ASG lifecycle block if you manage desired capacity through a separate auto-scaling policy.
NAT Gateway EIP limit: AWS default is 5 EIPs per region. A three-AZ stack needs 3 EIPs for NAT Gateways. Across multiple environments in one region you hit the limit quickly — request a quota increase as part of the initial infrastructure setup, before the first apply.
AMI deregistration: If the AMI used by the launch template is deregistered, new ASG instances fail to launch but existing instances are unaffected. The fix is to update the launch template's AMI reference and trigger an instance refresh. Always pin launch templates to AMIs managed via AWS Image Builder or Packer pipelines, not to public AMIs that can be removed.
Provider version mismatch across workspaces: A colleague runs terraform init -upgrade and commits an updated .terraform.lock.hcl that pins a new provider version. Your CI picks it up on the next run. The new provider may have a breaking schema change for a resource you use. Solution: review lock file diffs in PRs with the same scrutiny as application code changes.
Module versioning in team environments: In solo or small-team projects it is acceptable to reference modules via relative paths (./modules/network). In large organizations, modules are published to a private Terraform registry or to a Git repository with tagged releases, and callers pin to a semantic version: source = "git::https://github.com/acme/terraform-modules.git//network?ref=v2.3.0". This ensures that a module change in one team's branch does not silently break another team's infrastructure on their next init.