Every production workload at a serious company lives inside a carefully architected VPC. This capstone lesson walks you through designing, building, and validating a multi-AZ VPC that you would actually deploy at a FAANG-scale organisation — with layered network segmentation, least-privilege IAM, and the operational runbooks that keep it running.
Architecture Overview
The target design is a three-tier, dual-AZ VPC: one public tier for load balancers, one private-app tier for your compute, and one private-data tier for databases and caches. Traffic flows inward through a single, auditable path; nothing reaches the data tier without passing the app tier first.
Three-tier, dual-AZ production VPC: public load-balancer tier, private app tier, isolated data tier, and VPC Endpoints for AWS service access without traversing the internet.
Step 1 — CIDR Planning and Subnet Sizing
Choose a /16 block from RFC 1918 space (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) that does not overlap any on-premises or peer VPC range you will ever need to connect. Carve it so you can expand without re-IP-ing:
Public tier (/24 per AZ): small — only load balancers and NAT Gateways land here. Two AZs means two /24s (10.0.1.0/24, 10.0.2.0/24).
Private-app tier (/24 per AZ): sized for your autoscaling groups and ECS tasks. Plan for 3–5× current pod count.
Private-data tier (/24 per AZ): RDS Multi-AZ requires two subnets. ElastiCache cluster mode also needs per-AZ placement.
Spare blocks: leave at least 10.0.100.0/22 unused for future tiers (internal tools, Kubernetes node pools, transit attachments).
AWS reserves 5 IPs in every subnet (network, broadcast, router, DNS, future). A /27 gives you only 27 usable addresses — too few for Kubernetes. Use /24 or larger for compute subnets.
Step 2 — Terraform Skeleton
Express the entire design as code. Below is the foundational Terraform that creates the VPC, subnets, and routing in an idempotent, reviewable way.
Never use a single shared NAT Gateway across AZs. If that AZ goes down, all private-tier egress dies. One NAT per AZ doubles cost marginally but prevents cross-AZ data transfer charges and eliminates a single point of failure. This is a common cost-cutting shortcut that causes cascading failures.
Step 3 — VPC Endpoints (Cut the Internet Path)
Private-app and private-data nodes need to reach S3, ECR, SSM, and Secrets Manager without touching the public internet. VPC Endpoints route this traffic over the AWS backbone, removing the NAT Gateway bottleneck for high-throughput services (ECR image pulls are surprisingly large) and eliminating egress charges for S3 and DynamoDB.
Every compute resource gets an IAM Instance Profile or Task Role — never hardcoded credentials. Design roles with the minimum permissions to perform the job, then deny anything broader at the SCP level.
# iam.tf — App-tier task role (ECS example)
data "aws_iam_policy_document" "ecs_assume" {
statement {
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["ecs-tasks.amazonaws.com"]
}
}
}
resource "aws_iam_role" "app_task" {
name = "prod-app-task-role"
assume_role_policy = data.aws_iam_policy_document.ecs_assume.json
}
data "aws_iam_policy_document" "app_task_policy" {
# Read secrets — only the secrets this service owns
statement {
actions = ["secretsmanager:GetSecretValue"]
resources = ["arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/myapp/*"]
}
# Write structured logs — no wildcard on log groups
statement {
actions = ["logs:CreateLogStream", "logs:PutLogEvents"]
resources = ["arn:aws:logs:us-east-1:123456789012:log-group:/prod/myapp:*"]
}
# S3 — read-only on the app\'s own prefix
statement {
actions = ["s3:GetObject", "s3:ListBucket"]
resources = [
"arn:aws:s3:::myorg-prod-assets",
"arn:aws:s3:::myorg-prod-assets/myapp/*"
]
}
# Explicit deny of any data-destructive action (defense in depth)
statement {
effect = "Deny"
actions = ["s3:DeleteObject", "s3:DeleteBucket",
"rds:DeleteDBInstance", "rds:DeleteDBCluster"]
resources = ["*"]
}
}
resource "aws_iam_role_policy" "app_task" {
name = "prod-app-task-inline"
role = aws_iam_role.app_task.id
policy = data.aws_iam_policy_document.app_task_policy.json
}
Use IAM Access Analyzer continuously (aws accessanalyzer create-analyzer --type ACCOUNT_UNUSED_ACCESS). It surfaces roles with permissions that were never exercised in the past 90 days — use it as a quarterly tightening pass to drift back to least-privilege.
Step 5 — Validate the Design
Infrastructure is not done when terraform apply returns. Validate every security boundary:
Connectivity: launch an SSM-managed instance in each private-app subnet. Run curl -I https://ecr.aws — it should resolve to a private IP (VPC endpoint). Run curl https://ifconfig.me — you should see the NAT Gateway EIP, confirming NAT egress works.
Data tier isolation: from a private-data subnet instance, verify curl https://checkip.amazonaws.com times out — no internet route.
Security Group audit: aws ec2 describe-security-groups --filters "Name=vpc-id,Values=vpc-XXXX" | jq '.SecurityGroups[] | select(.IpPermissions[].IpRanges[].CidrIp == "0.0.0.0/0")' — the only SG with a public ingress rule should be the ALB SG on port 443.
IAM simulation: run aws iam simulate-principal-policy --policy-source-arn arn:aws:iam::ACCOUNT:role/prod-app-task-role --action-names s3:DeleteObject --resource-arns arn:aws:s3:::myorg-prod-assets and confirm the result is implicitDeny or explicitDeny.
A production VPC review at a mature company includes a threat model: for each data path in the diagram, ask "what happens if this component is compromised?" The public subnet being breached should not give lateral access to the data tier — that is what Security Groups, NACLs, and IAM boundaries enforce independently.
Operational Runbook Essentials
Tag everything consistently — the tags you set in Terraform become your cost allocation, incident scoping, and access-control foundation. At minimum: Env, Service, Team, ManagedBy=terraform. Enable VPC Flow Logs to CloudWatch Logs or S3 from day one — you cannot retroactively reconstruct a network incident without them. Enable CloudTrail in all regions, with a dedicated S3 bucket that denies bucket policy deletion.
This architecture is the foundation. From here you layer on: Transit Gateway for multi-account routing, AWS Network Firewall for deep packet inspection, PrivateLink for exposing services to other VPCs, and GuardDuty for runtime threat detection. The three-tier pattern scales from 5 engineers to 5,000.