This capstone lesson walks you through deploying a production-grade, highly-available web tier on AWS from scratch using the AWS CLI. The architecture combines an Application Load Balancer (ALB), an Auto Scaling Group (ASG), and a Multi-AZ RDS instance — the canonical three-tier reference deployment you will encounter inside every serious AWS workload. By the end you will have a stack that survives single-instance failure, single-AZ failure, and traffic spikes without manual intervention, and you will understand exactly why each component is wired the way it is.
Target Architecture
The diagram below shows what you are building. Traffic enters through the ALB (public subnets, two AZs), is distributed to EC2 instances inside the ASG (private subnets, two AZs), which then connect to an RDS Multi-AZ cluster (data subnets, two AZs). No EC2 instance or RDS instance is reachable from the internet directly — only the ALB is public.
ALB + ASG + Multi-AZ RDS: the canonical three-tier HA architecture. All EC2 and RDS resources live in private/data subnets; only the ALB is internet-facing.
Step 1 — VPC and Subnets
Everything runs inside a dedicated VPC with three subnet tiers per AZ: public (ALB), private (EC2), and data (RDS). Separating subnet tiers gives you independent security group and NACL controls at each boundary.
Data subnets have no route to the internet — not even via NAT. RDS instances in data subnets can only be reached by resources inside the VPC. This is the correct security posture: your database is never one misconfigured security group rule away from the internet.
Step 2 — Security Groups
Three security groups enforce the principle of least privilege at the network layer. Each group only opens the exact port and source needed — never 0.0.0.0/0 for anything other than the ALB listener.
# ALB SG — accepts HTTPS from the internet (and HTTP for redirect)
ALB_SG=$(aws ec2 create-security-group \
--group-name ha-alb-sg --description "ALB inbound" \
--vpc-id $VPC_ID --query 'GroupId' --output text)
aws ec2 authorize-security-group-ingress --group-id $ALB_SG \
--ip-permissions \
IpProtocol=tcp,FromPort=443,ToPort=443,IpRanges=[{CidrIp=0.0.0.0/0}] \
IpProtocol=tcp,FromPort=80,ToPort=80,IpRanges=[{CidrIp=0.0.0.0/0}]
# EC2 SG — accepts only from ALB SG on port 8080 (or 443 if TLS-terminated at EC2)
APP_SG=$(aws ec2 create-security-group \
--group-name ha-app-sg --description "App tier inbound" \
--vpc-id $VPC_ID --query 'GroupId' --output text)
aws ec2 authorize-security-group-ingress --group-id $APP_SG \
--protocol tcp --port 8080 --source-group $ALB_SG
# RDS SG — accepts only from EC2 SG on port 5432 (PostgreSQL)
DB_SG=$(aws ec2 create-security-group \
--group-name ha-db-sg --description "DB tier inbound" \
--vpc-id $VPC_ID --query 'GroupId' --output text)
aws ec2 authorize-security-group-ingress --group-id $DB_SG \
--protocol tcp --port 5432 --source-group $APP_SG
Step 3 — RDS Multi-AZ
Create a DB subnet group spanning both data subnets, then launch a Multi-AZ PostgreSQL instance. The --multi-az flag provisions a synchronous standby in the second AZ. RDS handles the DNS failover automatically within 60–120 seconds if the primary goes down — your application reconnects to the same hostname.
The Launch Template defines the exact specification for every instance the ASG creates. User data bootstraps the application. Pass the RDS endpoint via SSM Parameter Store — never bake secrets into the Launch Template or an AMI.
# Store the DB endpoint in SSM so user data can fetch it at boot
aws ssm put-parameter \
--name /ha-web/db-endpoint \
--value "$DB_ENDPOINT" \
--type String --overwrite
# IAM instance profile allowing SSM access and CloudWatch agent
# (assume IAM role ha-web-instance-role with AmazonSSMManagedInstanceCore
# + CloudWatchAgentServerPolicy already created)
# User data script (base64-encoded for the CLI)
USER_DATA=$(base64 -w0 <<'USERDATA'
#!/bin/bash
set -euo pipefail
yum update -y
yum install -y amazon-ssm-agent amazon-cloudwatch-agent
# Fetch config from SSM
DB_HOST=$(aws ssm get-parameter --name /ha-web/db-endpoint --query 'Parameter.Value' --output text --region us-east-1)
# Install and start your app (example: a simple Python/Gunicorn service)
pip3 install flask gunicorn psycopg2-binary
cat > /opt/app/app.py <<'EOF'
from flask import Flask
app = Flask(__name__)
@app.route('/health')
def health():
return {'status': 'ok'}, 200
@app.route('/')
def home():
return {'message': 'Hello from HA Web Tier'}, 200
EOF
systemctl enable gunicorn-app
systemctl start gunicorn-app
USERDATA
)
# Create Launch Template
LT_ID=$(aws ec2 create-launch-template \
--launch-template-name ha-web-lt \
--version-description "v1" \
--launch-template-data "{
\"ImageId\": \"ami-0c02fb55956c7d316\",
\"InstanceType\": \"t3.small\",
\"SecurityGroupIds\": [\"$APP_SG\"],
\"IamInstanceProfile\": {\"Name\": \"ha-web-instance-profile\"},
\"UserData\": \"$USER_DATA\",
\"MetadataOptions\": {\"HttpTokens\": \"required\", \"HttpEndpoint\": \"enabled\"},
\"BlockDeviceMappings\": [{
\"DeviceName\": \"/dev/xvda\",
\"Ebs\": {\"VolumeSize\": 30, \"VolumeType\": \"gp3\", \"Encrypted\": true, \"DeleteOnTermination\": true}
}],
\"TagSpecifications\": [{
\"ResourceType\": \"instance\",
\"Tags\": [{\"Key\": \"Name\", \"Value\": \"ha-web-asg\"}, {\"Key\": \"Env\", \"Value\": \"prod\"}]
}]
}" \
--query 'LaunchTemplate.LaunchTemplateId' --output text)
Set "HttpTokens": "required" in MetadataOptions on every Launch Template. This enforces IMDSv2, which prevents SSRF attacks from reading instance credentials via the metadata endpoint — a well-known attack vector against EC2 workloads.
Step 5 — ALB and Target Group
Create an internet-facing ALB in both public subnets. Attach a Target Group with a health check path — the ASG registers instances against this Target Group, and the ALB only routes traffic to instances that pass the health check.
The ASG references the Launch Template, spans both private subnets, and registers new instances into the Target Group automatically. Use a Target Tracking scaling policy on average CPU: AWS automatically adjusts the fleet size to keep CPU at the target. Combine with a minimum of 2 instances (one per AZ) to guarantee AZ-level fault tolerance at all times.
Set --health-check-type ELB, not the default EC2. With EC2 health checks, the ASG only replaces an instance when the underlying VM is completely unreachable. With ELB health checks, if your application process crashes or starts returning 5xx errors, the ALB marks the instance unhealthy, the ASG terminates it, and launches a replacement automatically. This is the critical difference between "server is alive" and "application is healthy".
Validating the Stack
Once both instances in the ASG are InService in the Target Group, run these validation steps to confirm each HA property works:
# 1. Verify instances are healthy in the Target Group
aws elbv2 describe-target-health --target-group-arn $TG_ARN \
--query 'TargetHealthDescriptions[*].{Instance:Target.Id,State:TargetHealth.State}'
# 2. Confirm the ALB returns 200 from your application
curl -sI https://$ALB_DNS/health
# 3. Simulate instance failure — terminate one instance and watch ASG replace it
INSTANCE_ID=$(aws autoscaling describe-auto-scaling-instances \
--query 'AutoScalingInstances[0].InstanceId' --output text)
aws ec2 terminate-instances --instance-ids $INSTANCE_ID
# Within ~2 minutes, the ASG launches a replacement in the same or other AZ
# 4. Watch ASG activity log
aws autoscaling describe-scaling-activities \
--auto-scaling-group-name ha-web-asg \
--max-items 5
# 5. Test RDS failover (forces a Multi-AZ switchover, ~60s downtime)
aws rds reboot-db-instance \
--db-instance-identifier ha-web-db \
--force-failover
# Your app must handle reconnect — connection pool with retry is required
Production Failure Modes to Know
The architecture is resilient, but it is not immune to all failure classes. Engineers at top-tier companies know these edge cases:
Thundering herd on scale-out: When the ASG launches 6 new instances simultaneously during a traffic spike, all 6 hit your RDS connection pool at once. Set max_connections via a parameter group and use a connection pooler like PgBouncer or RDS Proxy in transaction mode to absorb the burst.
Health check grace period too short: If --health-check-grace-period is less than your application boot time, the ASG will terminate the instance before it is ready and enter a launch-terminate loop. Set it to at least 1.5× your observed cold-start time.
Stale Launch Template: If you update the Launch Template version but forget to update the ASG reference from $Latest to a pinned version, a concurrent scale event may launch the wrong version. Pin to an explicit version in critical production environments.
Single NAT Gateway SPOF: The setup above uses one NAT Gateway in AZ-a. If AZ-a has a partial outage, instances in AZ-b lose outbound internet access. For true HA, deploy one NAT Gateway per AZ and route each private subnet to its local NAT.