Cloud Fundamentals: AWS Core Services

Project: A Highly-Available Web Tier

18 min Lesson 10 of 30

Project: A Highly-Available Web Tier

This capstone lesson walks you through deploying a production-grade, highly-available web tier on AWS from scratch using the AWS CLI. The architecture combines an Application Load Balancer (ALB), an Auto Scaling Group (ASG), and a Multi-AZ RDS instance — the canonical three-tier reference deployment you will encounter inside every serious AWS workload. By the end you will have a stack that survives single-instance failure, single-AZ failure, and traffic spikes without manual intervention, and you will understand exactly why each component is wired the way it is.

Target Architecture

The diagram below shows what you are building. Traffic enters through the ALB (public subnets, two AZs), is distributed to EC2 instances inside the ASG (private subnets, two AZs), which then connect to an RDS Multi-AZ cluster (data subnets, two AZs). No EC2 instance or RDS instance is reachable from the internet directly — only the ALB is public.

ALB + ASG + Multi-AZ RDS: the canonical three-tier HA architecture. All EC2 and RDS resources live in private/data subnets; only the ALB is internet-facing.

Step 1 — VPC and Subnets

Everything runs inside a dedicated VPC with three subnet tiers per AZ: public (ALB), private (EC2), and data (RDS). Separating subnet tiers gives you independent security group and NACL controls at each boundary.

# Create VPC
VPC_ID=$(aws ec2 create-vpc --cidr-block 10.0.0.0/16 \
  --tag-specifications 'ResourceType=vpc,Tags=[{Key=Name,Value=ha-web-vpc}]' \
  --query 'Vpc.VpcId' --output text)
aws ec2 modify-vpc-attribute --vpc-id $VPC_ID --enable-dns-hostnames

# Public subnets (ALB)
PUB_A=$(aws ec2 create-subnet --vpc-id $VPC_ID --cidr-block 10.0.1.0/24 \
  --availability-zone us-east-1a \
  --tag-specifications 'ResourceType=subnet,Tags=[{Key=Name,Value=pub-1a}]' \
  --query 'Subnet.SubnetId' --output text)

PUB_B=$(aws ec2 create-subnet --vpc-id $VPC_ID --cidr-block 10.0.2.0/24 \
  --availability-zone us-east-1b \
  --tag-specifications 'ResourceType=subnet,Tags=[{Key=Name,Value=pub-1b}]' \
  --query 'Subnet.SubnetId' --output text)

# Private subnets (EC2 / ASG)
PRIV_A=$(aws ec2 create-subnet --vpc-id $VPC_ID --cidr-block 10.0.11.0/24 \
  --availability-zone us-east-1a \
  --tag-specifications 'ResourceType=subnet,Tags=[{Key=Name,Value=priv-1a}]' \
  --query 'Subnet.SubnetId' --output text)

PRIV_B=$(aws ec2 create-subnet --vpc-id $VPC_ID --cidr-block 10.0.12.0/24 \
  --availability-zone us-east-1b \
  --tag-specifications 'ResourceType=subnet,Tags=[{Key=Name,Value=priv-1b}]' \
  --query 'Subnet.SubnetId' --output text)

# Data subnets (RDS)
DATA_A=$(aws ec2 create-subnet --vpc-id $VPC_ID --cidr-block 10.0.21.0/24 \
  --availability-zone us-east-1a \
  --tag-specifications 'ResourceType=subnet,Tags=[{Key=Name,Value=data-1a}]' \
  --query 'Subnet.SubnetId' --output text)

DATA_B=$(aws ec2 create-subnet --vpc-id $VPC_ID --cidr-block 10.0.22.0/24 \
  --availability-zone us-east-1b \
  --tag-specifications 'ResourceType=subnet,Tags=[{Key=Name,Value=data-1b}]' \
  --query 'Subnet.SubnetId' --output text)

# Internet Gateway (public subnets only)
IGW=$(aws ec2 create-internet-gateway --query 'InternetGateway.InternetGatewayId' --output text)
aws ec2 attach-internet-gateway --internet-gateway-id $IGW --vpc-id $VPC_ID

# Route public subnets to IGW
PUB_RT=$(aws ec2 create-route-table --vpc-id $VPC_ID --query 'RouteTable.RouteTableId' --output text)
aws ec2 create-route --route-table-id $PUB_RT --destination-cidr-block 0.0.0.0/0 --gateway-id $IGW
aws ec2 associate-route-table --route-table-id $PUB_RT --subnet-id $PUB_A
aws ec2 associate-route-table --route-table-id $PUB_RT --subnet-id $PUB_B

# NAT Gateway so private EC2 can reach the internet (package installs, SSM)
EIP=$(aws ec2 allocate-address --domain vpc --query 'AllocationId' --output text)
NAT=$(aws ec2 create-nat-gateway --subnet-id $PUB_A --allocation-id $EIP \
  --query 'NatGateway.NatGatewayId' --output text)
# Wait for NAT to become available before routing
aws ec2 wait nat-gateway-available --nat-gateway-ids $NAT

PRIV_RT=$(aws ec2 create-route-table --vpc-id $VPC_ID --query 'RouteTable.RouteTableId' --output text)
aws ec2 create-route --route-table-id $PRIV_RT --destination-cidr-block 0.0.0.0/0 --nat-gateway-id $NAT
aws ec2 associate-route-table --route-table-id $PRIV_RT --subnet-id $PRIV_A
aws ec2 associate-route-table --route-table-id $PRIV_RT --subnet-id $PRIV_B

Data subnets have no route to the internet — not even via NAT. RDS instances in data subnets can only be reached by resources inside the VPC. This is the correct security posture: your database is never one misconfigured security group rule away from the internet.

Step 2 — Security Groups

Three security groups enforce the principle of least privilege at the network layer. Each group only opens the exact port and source needed — never 0.0.0.0/0 for anything other than the ALB listener.

# ALB SG — accepts HTTPS from the internet (and HTTP for redirect)
ALB_SG=$(aws ec2 create-security-group \
  --group-name ha-alb-sg --description "ALB inbound" \
  --vpc-id $VPC_ID --query 'GroupId' --output text)
aws ec2 authorize-security-group-ingress --group-id $ALB_SG \
  --ip-permissions \
  IpProtocol=tcp,FromPort=443,ToPort=443,IpRanges=[{CidrIp=0.0.0.0/0}] \
  IpProtocol=tcp,FromPort=80,ToPort=80,IpRanges=[{CidrIp=0.0.0.0/0}]

# EC2 SG — accepts only from ALB SG on port 8080 (or 443 if TLS-terminated at EC2)
APP_SG=$(aws ec2 create-security-group \
  --group-name ha-app-sg --description "App tier inbound" \
  --vpc-id $VPC_ID --query 'GroupId' --output text)
aws ec2 authorize-security-group-ingress --group-id $APP_SG \
  --protocol tcp --port 8080 --source-group $ALB_SG

# RDS SG — accepts only from EC2 SG on port 5432 (PostgreSQL)
DB_SG=$(aws ec2 create-security-group \
  --group-name ha-db-sg --description "DB tier inbound" \
  --vpc-id $VPC_ID --query 'GroupId' --output text)
aws ec2 authorize-security-group-ingress --group-id $DB_SG \
  --protocol tcp --port 5432 --source-group $APP_SG

Step 3 — RDS Multi-AZ

Create a DB subnet group spanning both data subnets, then launch a Multi-AZ PostgreSQL instance. The --multi-az flag provisions a synchronous standby in the second AZ. RDS handles the DNS failover automatically within 60–120 seconds if the primary goes down — your application reconnects to the same hostname.

# DB Subnet Group
aws rds create-db-subnet-group \
  --db-subnet-group-name ha-db-subnet-group \
  --db-subnet-group-description "HA project data subnets" \
  --subnet-ids $DATA_A $DATA_B

# Launch RDS PostgreSQL 16 — Multi-AZ, encrypted, automated backup 7 days
aws rds create-db-instance \
  --db-instance-identifier ha-web-db \
  --db-instance-class db.t4g.medium \
  --engine postgres \
  --engine-version 16 \
  --master-username appuser \
  --master-user-password "$(openssl rand -base64 20)" \
  --db-name appdb \
  --db-subnet-group-name ha-db-subnet-group \
  --vpc-security-group-ids $DB_SG \
  --multi-az \
  --storage-type gp3 \
  --allocated-storage 100 \
  --iops 3000 \
  --storage-encrypted \
  --backup-retention-period 7 \
  --preferred-backup-window "02:00-03:00" \
  --deletion-protection \
  --no-publicly-accessible

# Poll until available (typically 5-10 minutes)
aws rds wait db-instance-available --db-instance-identifier ha-web-db
DB_ENDPOINT=$(aws rds describe-db-instances \
  --db-instance-identifier ha-web-db \
  --query 'DBInstances[0].Endpoint.Address' --output text)
echo "RDS endpoint: $DB_ENDPOINT"

Step 4 — Launch Template for the ASG

The Launch Template defines the exact specification for every instance the ASG creates. User data bootstraps the application. Pass the RDS endpoint via SSM Parameter Store — never bake secrets into the Launch Template or an AMI.

# Store the DB endpoint in SSM so user data can fetch it at boot
aws ssm put-parameter \
  --name /ha-web/db-endpoint \
  --value "$DB_ENDPOINT" \
  --type String --overwrite

# IAM instance profile allowing SSM access and CloudWatch agent
# (assume IAM role ha-web-instance-role with AmazonSSMManagedInstanceCore
#  + CloudWatchAgentServerPolicy already created)

# User data script (base64-encoded for the CLI)
USER_DATA=$(base64 -w0 <<'USERDATA'
#!/bin/bash
set -euo pipefail
yum update -y
yum install -y amazon-ssm-agent amazon-cloudwatch-agent

# Fetch config from SSM
DB_HOST=$(aws ssm get-parameter --name /ha-web/db-endpoint --query 'Parameter.Value' --output text --region us-east-1)

# Install and start your app (example: a simple Python/Gunicorn service)
pip3 install flask gunicorn psycopg2-binary
cat > /opt/app/app.py <<'EOF'
from flask import Flask
app = Flask(__name__)

@app.route('/health')
def health():
    return {'status': 'ok'}, 200

@app.route('/')
def home():
    return {'message': 'Hello from HA Web Tier'}, 200
EOF

systemctl enable gunicorn-app
systemctl start gunicorn-app
USERDATA
)

# Create Launch Template
LT_ID=$(aws ec2 create-launch-template \
  --launch-template-name ha-web-lt \
  --version-description "v1" \
  --launch-template-data "{
    \"ImageId\": \"ami-0c02fb55956c7d316\",
    \"InstanceType\": \"t3.small\",
    \"SecurityGroupIds\": [\"$APP_SG\"],
    \"IamInstanceProfile\": {\"Name\": \"ha-web-instance-profile\"},
    \"UserData\": \"$USER_DATA\",
    \"MetadataOptions\": {\"HttpTokens\": \"required\", \"HttpEndpoint\": \"enabled\"},
    \"BlockDeviceMappings\": [{
      \"DeviceName\": \"/dev/xvda\",
      \"Ebs\": {\"VolumeSize\": 30, \"VolumeType\": \"gp3\", \"Encrypted\": true, \"DeleteOnTermination\": true}
    }],
    \"TagSpecifications\": [{
      \"ResourceType\": \"instance\",
      \"Tags\": [{\"Key\": \"Name\", \"Value\": \"ha-web-asg\"}, {\"Key\": \"Env\", \"Value\": \"prod\"}]
    }]
  }" \
  --query 'LaunchTemplate.LaunchTemplateId' --output text)

Set "HttpTokens": "required" in MetadataOptions on every Launch Template. This enforces IMDSv2, which prevents SSRF attacks from reading instance credentials via the metadata endpoint — a well-known attack vector against EC2 workloads.

Step 5 — ALB and Target Group

Create an internet-facing ALB in both public subnets. Attach a Target Group with a health check path — the ASG registers instances against this Target Group, and the ALB only routes traffic to instances that pass the health check.

# Target Group (HTTP on port 8080, health check at /health)
TG_ARN=$(aws elbv2 create-target-group \
  --name ha-web-tg \
  --protocol HTTP \
  --port 8080 \
  --vpc-id $VPC_ID \
  --target-type instance \
  --health-check-protocol HTTP \
  --health-check-path /health \
  --health-check-interval-seconds 20 \
  --health-check-timeout-seconds 5 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 3 \
  --matcher HttpCode=200 \
  --query 'TargetGroups[0].TargetGroupArn' --output text)

# Internet-facing ALB spanning both public subnets
ALB_ARN=$(aws elbv2 create-load-balancer \
  --name ha-web-alb \
  --type application \
  --scheme internet-facing \
  --security-groups $ALB_SG \
  --subnets $PUB_A $PUB_B \
  --query 'LoadBalancers[0].LoadBalancerArn' --output text)

# HTTPS listener (assumes ACM certificate already exists)
CERT_ARN="arn:aws:acm:us-east-1:123456789012:certificate/your-cert-id"
aws elbv2 create-listener \
  --load-balancer-arn $ALB_ARN \
  --protocol HTTPS --port 443 \
  --certificates CertificateArn=$CERT_ARN \
  --default-actions Type=forward,TargetGroupArn=$TG_ARN

# HTTP → HTTPS redirect listener
aws elbv2 create-listener \
  --load-balancer-arn $ALB_ARN \
  --protocol HTTP --port 80 \
  --default-actions \
    Type=redirect,RedirectConfig="{Protocol=HTTPS,Port=443,StatusCode=HTTP_301}"

ALB_DNS=$(aws elbv2 describe-load-balancers \
  --load-balancer-arns $ALB_ARN \
  --query 'LoadBalancers[0].DNSName' --output text)
echo "ALB DNS: $ALB_DNS"

Step 6 — Auto Scaling Group

The ASG references the Launch Template, spans both private subnets, and registers new instances into the Target Group automatically. Use a Target Tracking scaling policy on average CPU: AWS automatically adjusts the fleet size to keep CPU at the target. Combine with a minimum of 2 instances (one per AZ) to guarantee AZ-level fault tolerance at all times.

# Create the ASG (min 2 / desired 2 / max 10, across both private subnets)
aws autoscaling create-auto-scaling-group \
  --auto-scaling-group-name ha-web-asg \
  --launch-template "LaunchTemplateId=$LT_ID,Version=\$Latest" \
  --min-size 2 \
  --max-size 10 \
  --desired-capacity 2 \
  --vpc-zone-identifier "$PRIV_A,$PRIV_B" \
  --target-group-arns $TG_ARN \
  --health-check-type ELB \
  --health-check-grace-period 120 \
  --default-instance-warmup 60 \
  --tags "Key=Name,Value=ha-web-asg,PropagateAtLaunch=true"

# Target Tracking policy — scale to keep average CPU at 60%
aws autoscaling put-scaling-policy \
  --auto-scaling-group-name ha-web-asg \
  --policy-name cpu-target-tracking \
  --policy-type TargetTrackingScaling \
  --target-tracking-configuration '{
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ASGAverageCPUUtilization"
    },
    "TargetValue": 60.0,
    "DisableScaleIn": false
  }'

Set --health-check-type ELB, not the default EC2. With EC2 health checks, the ASG only replaces an instance when the underlying VM is completely unreachable. With ELB health checks, if your application process crashes or starts returning 5xx errors, the ALB marks the instance unhealthy, the ASG terminates it, and launches a replacement automatically. This is the critical difference between "server is alive" and "application is healthy".

Validating the Stack

Once both instances in the ASG are InService in the Target Group, run these validation steps to confirm each HA property works:

# 1. Verify instances are healthy in the Target Group
aws elbv2 describe-target-health --target-group-arn $TG_ARN \
  --query 'TargetHealthDescriptions[*].{Instance:Target.Id,State:TargetHealth.State}'

# 2. Confirm the ALB returns 200 from your application
curl -sI https://$ALB_DNS/health

# 3. Simulate instance failure — terminate one instance and watch ASG replace it
INSTANCE_ID=$(aws autoscaling describe-auto-scaling-instances \
  --query 'AutoScalingInstances[0].InstanceId' --output text)
aws ec2 terminate-instances --instance-ids $INSTANCE_ID
# Within ~2 minutes, the ASG launches a replacement in the same or other AZ

# 4. Watch ASG activity log
aws autoscaling describe-scaling-activities \
  --auto-scaling-group-name ha-web-asg \
  --max-items 5

# 5. Test RDS failover (forces a Multi-AZ switchover, ~60s downtime)
aws rds reboot-db-instance \
  --db-instance-identifier ha-web-db \
  --force-failover
# Your app must handle reconnect — connection pool with retry is required

Production Failure Modes to Know

The architecture is resilient, but it is not immune to all failure classes. Engineers at top-tier companies know these edge cases:

Thundering herd on scale-out: When the ASG launches 6 new instances simultaneously during a traffic spike, all 6 hit your RDS connection pool at once. Set max_connections via a parameter group and use a connection pooler like PgBouncer or RDS Proxy in transaction mode to absorb the burst.
Health check grace period too short: If --health-check-grace-period is less than your application boot time, the ASG will terminate the instance before it is ready and enter a launch-terminate loop. Set it to at least 1.5× your observed cold-start time.
Stale Launch Template: If you update the Launch Template version but forget to update the ASG reference from $Latest to a pinned version, a concurrent scale event may launch the wrong version. Pin to an explicit version in critical production environments.
Single NAT Gateway SPOF: The setup above uses one NAT Gateway in AZ-a. If AZ-a has a partial outage, instances in AZ-b lose outbound internet access. For true HA, deploy one NAT Gateway per AZ and route each private subnet to its local NAT.