Cloud Architecture & Landing Zones

The Well-Architected Framework

18 min Lesson 1 of 28

The Well-Architected Framework

Every cloud failure you have ever read about — a multi-hour AWS outage that cascaded because one team disabled retries, a data breach caused by a wildcard IAM policy, a $500k monthly bill from a forgotten load balancer — shares a common root: the architecture was never evaluated against a consistent, principled framework. AWS's Well-Architected Framework (WAF) is that framework, and it is the lens through which every senior engineer and solutions architect at big-tech reviews infrastructure before it goes anywhere near production.

This lesson teaches you how the six pillars work, what the AWS Well-Architected Tool produces, and — critically — how to run a review that actually changes behavior rather than collecting dust as a PDF.

The Six Pillars

Each pillar is a domain of concern. They are not independent; trade-offs between them are the point of architectural decision-making.

Operational Excellence — Run and monitor systems to deliver business value and continuously improve supporting processes. Key practices: infrastructure as code, runbooks as code, event-driven operations, post-incident reviews without blame.
Security — Protect data, systems, and assets. Covers identity and access management, detection, infrastructure protection, data protection, and incident response. The first-principles rule: apply security at every layer, never rely on a single control.
Reliability — Recover from failures and meet demand dynamically. This pillar is where SLOs, chaos engineering, multi-AZ deployments, and backups live. The core insight: design for failure, not against it.
Performance Efficiency — Use computing resources efficiently and maintain that efficiency as demand changes. Covers right-sizing, selecting the right database engine, caching strategy, and benchmarking under load.
Cost Optimization — Avoid unnecessary cost while understanding where every dollar is spent. Covers Reserved Instances, Savings Plans, auto-scaling, architectural simplification, and waste elimination.
Sustainability (added 2021) — Minimize the environmental impact of running cloud workloads. Covers region selection by grid carbon intensity, rightsizing to avoid idle compute, and Graviton/ARM adoption for better perf-per-watt.

How AWS scores a pillar: each pillar contains 10–15 best-practice questions. Each question has a set of choices; the tool flags unanswered or risk-acknowledged choices as High Risk (HRI) or Medium Risk (MRI). The output is a prioritized list of findings — not a pass/fail grade.

The Well-Architected Tool

The AWS Well-Architected Tool is a free, first-party service inside the AWS Console. It stores workload definitions, tracks review history across quarters, and generates an improvement plan. For multi-account orgs the tool integrates with AWS Organizations so a central team can see all workload reviews across every account.

You can also drive it entirely from the CLI — useful for GitOps-style review automation where review state lives in your repo alongside your Terraform.

# ── Create a new workload ──────────────────────────────────────────────────────
aws wellarchitected create-workload \
  --workload-name "payments-service-prod" \
  --description "Payment processing, PCI-DSS scope" \
  --review-owner "platform-eng@example.com" \
  --environment PRODUCTION \
  --aws-regions us-east-1 eu-west-1 \
  --lenses "wellarchitected" \
  --tags Team=platform,CostCenter=12345

# ── List all questions for the Security pillar ────────────────────────────────
WORKLOAD_ID="abc12345678"
LENS_ALIAS="wellarchitected"

aws wellarchitected list-lens-review-improvements \
  --workload-id $WORKLOAD_ID \
  --lens-alias $LENS_ALIAS \
  --pillar-id security \
  --query 'ImprovementSummaries[*].[QuestionId,RiskCounts]' \
  --output table

# ── Pull the current risk summary after answering questions ───────────────────
aws wellarchitected get-lens-review \
  --workload-id $WORKLOAD_ID \
  --lens-alias $LENS_ALIAS \
  --query 'LensReview.PillarReviewSummaries[*].[PillarName,RiskCounts]' \
  --output table

Running a Review That Actually Matters

The tool is only as useful as the review process around it. Rubber-stamping questions in a solo session produces a meaningless artifact. The process that works at scale:

Define the workload boundary clearly. A WAF workload is a single deployable unit with a defined owner, not "all of production." Scope it to a service or a bounded context.
Run the review with the people who built it. Include the senior engineer, a security champion, and a product lead. The tool forces conversations that siloed reviews miss.
Time-box to 90 minutes per pillar. Attempting all six pillars in one session produces fatigue and shallow answers. Spread across a sprint.
Record every risk acknowledgement. The tool allows you to mark an HRI as "acknowledged" with a mitigation note. Use this — it creates an auditable trail for compliance teams.
Export the improvement plan and create tickets. The output is not a report you file; it is a backlog of engineering work. Paste findings into Jira/Linear and assign owners before the session ends.

Continuous review, not annual audits: schedule a lightweight pillar-level review at the start of each quarter. Use the CLI to diff the risk count against the previous snapshot — any new HRIs that appeared since the last deploy are a signal worth investigating immediately.

Pillar Trade-offs in Practice

Real architecture decisions force you to trade one pillar against another. Three examples you will encounter in your first year of platform work:

Reliability vs. Cost: Multi-AZ RDS costs 2× a single-AZ instance. For an internal analytics dashboard the trade-off is clear (take the risk); for a payments API it is not negotiable. WAF makes you articulate this choice explicitly rather than letting it happen by default.
Performance vs. Sustainability: Provisioned IOPS SSD (io2) is faster than gp3 for certain workloads but has a higher carbon footprint per IOPS-hour. ARM-based Graviton instances often beat x86 on both axes — check before assuming.
Security vs. Operational Excellence: Enforcing IMDSv2 (Instance Metadata Service v2) on all EC2 instances closes a credential-theft vector, but breaks poorly written scripts that still call the v1 endpoint. The right answer is to fix the scripts, not to leave the control off — but the WAF review is where you surface that technical debt.

# ── Enforce IMDSv2 on a new instance (Security pillar best practice) ───────────
aws ec2 run-instances \
  --image-id ami-0abcdef1234567890 \
  --instance-type t4g.medium \
  --metadata-options "HttpTokens=required,HttpPutResponseHopLimit=1,HttpEndpoint=enabled" \
  --count 1

# ── Check existing fleet for IMDSv1 exposure (find the technical debt) ─────────
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[*].Instances[*].{ID:InstanceId,IMDSv2:MetadataOptions.HttpTokens}' \
  --output table | grep -v "required"
# Any row NOT showing "required" is an HRI under the Security pillar

Custom Lenses

The built-in WAF lens covers general cloud best practices. For regulated industries or internal standards you can author custom lenses — JSON documents that define your own questions, choices, and risk weights. Big-tech platform teams publish internal lenses that layer on top of the standard one: a fintech might add PCI-DSS controls as WAF questions; an enterprise might encode their internal tagging standard as a lens pillar.

Custom lens versioning: AWS Well-Architected custom lenses are versioned, and publishing a new version does not automatically update workloads that use an older version. Build a Lambda that polls for stale lens versions and opens a ticket — otherwise teams stay on outdated standards for months without realizing it.

Connecting WAF to Your IaC Pipeline

The most mature teams integrate WAF into their Terraform workflow. The pattern: after every terraform apply to a production environment, a CI step calls the WAF API, fetches the current HRI count for the affected workload, and posts a summary comment to the pull request. If the HRI count increased — meaning the change introduced a new architectural risk — the PR is flagged for architectural review before merge.

This turns the Well-Architected Framework from a quarterly ceremony into a continuous, automated guardrail embedded in the delivery pipeline — which is the big-tech standard.