Cloud Architecture & Landing Zones

Project: An Enterprise Landing Zone Design

18 min Lesson 10 of 28

Project: An Enterprise Landing Zone Design

Everything in this tutorial — Well-Architected principles, multi-account strategy, Control Tower guardrails, identity federation, network architecture at org scale, hybrid connectivity, resilient patterns, and governance — converges in this lesson. You are the platform engineer at Acme Corp, a 1,200-person SaaS company entering a regulated market. Your task: design a production-grade AWS landing zone from scratch, justify every decision, and produce the Terraform + AWS CLI configuration to deploy it. This is what a senior DevOps engineer submits at a FAANG-tier company.

The Business Context

Acme Corp has three business units: Payments (PCI-DSS scope), Analytics (internal BI, no regulated data), and Platform (shared services — DNS, secrets, observability). They run 40 development teams. Security requires blast-radius isolation. Finance wants per-team cost visibility. Legal requires data residency in us-east-1 with a DR replica in us-west-2. Audit expects evidence of guardrails enforced without human intervention.

Account Structure Design

Big-tech standard: one AWS account per environment per workload team, governed by an AWS Organizations hierarchy. The root OU is never used for workloads. The structure below maps directly to a Control Tower deployment.

Acme Corp AWS Organizations hierarchy: 5 OUs, purpose-built accounts, workload sub-OUs per business unit.

The management account does exactly one thing: it holds AWS Organizations, billing consolidation, and the SCPs that apply to the root and each OU. No workload resources ever run inside it. A compromised management account can delete every account in your org — protect it with hardware MFA, no programmatic keys, and a separate break-glass procedure documented in your runbook.

Network Architecture

The network account owns the Transit Gateway (TGW) and acts as the central hub. Each workload account attaches its VPC as a spoke. The diagram below shows the full topology: shared services reachable by all spokes, Payments isolated from Analytics at the TGW route table level, and on-premises connectivity entering through a single VPN/Direct Connect attachment.

Hub-and-spoke Transit Gateway topology. Payments VPC is isolated from Analytics at the TGW route table level. Shared Services VPC is reachable from all spokes.

CIDR planning is irreversible. VPC CIDRs cannot be changed after creation, and TGW route tables cannot route overlapping CIDRs. At Acme Corp we allocate a full /8 (10.0.0.0/8) on day one, subdivide by OU (10.10/8 = Payments, 10.20/8 = Analytics, 10.30/8 = Shared Services), and document the allocation in a CMDB. Teams that skip this step spend six months re-IPing VPCs in production — a painful and risky exercise.

Guardrails: SCPs and AWS Config Rules

Guardrails are the governance layer that makes the landing zone trustworthy. At Acme Corp we implement two tiers: preventive controls via Service Control Policies (SCPs) that make violations impossible, and detective controls via AWS Config rules that alert on drift. The Payments OU gets additional PCI-specific SCPs that the other OUs do not inherit.

# ── Terraform: root-level SCP — deny leaving the org + deny disabling CloudTrail ──
resource "aws_organizations_policy" "root_baseline" {
  name        = "root-baseline-scp"
  description = "Non-negotiable controls applied to every account"
  type        = "SERVICE_CONTROL_POLICY"

  content = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "DenyLeaveOrganization"
        Effect = "Deny"
        Action = ["organizations:LeaveOrganization"]
        Resource = "*"
      },
      {
        Sid    = "DenyDisableCloudTrail"
        Effect = "Deny"
        Action = [
          "cloudtrail:DeleteTrail",
          "cloudtrail:StopLogging",
          "cloudtrail:UpdateTrail"
        ]
        Resource = "*"
      },
      {
        Sid    = "DenyIMDSv1"
        Effect = "Deny"
        Action = ["ec2:RunInstances"]
        Resource = "arn:aws:ec2:*:*:instance/*"
        Condition = {
          StringEquals = {
            "ec2:MetadataHttpTokens" = "optional"
          }
        }
      },
      {
        Sid    = "RequireTagOnEC2"
        Effect = "Deny"
        Action = ["ec2:RunInstances"]
        Resource = "arn:aws:ec2:*:*:instance/*"
        Condition = {
          "Null" = {
            "aws:RequestTag/CostCenter" = "true"
          }
        }
      }
    ]
  })
}

resource "aws_organizations_policy_attachment" "root" {
  policy_id = aws_organizations_policy.root_baseline.id
  target_id = data.aws_organizations_organization.this.roots[0].id
}

# ── Payments OU additional SCP: data residency (us-east-1 only) ───────────────
resource "aws_organizations_policy" "payments_residency" {
  name = "payments-data-residency"
  type = "SERVICE_CONTROL_POLICY"

  content = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Sid    = "DenyNonUSEast1"
      Effect = "Deny"
      NotAction = [
        "iam:*", "sts:*", "route53:*",
        "cloudfront:*", "waf:*", "support:*"
      ]
      Resource = "*"
      Condition = {
        StringNotEquals = {
          "aws:RequestedRegion" = "us-east-1"
        }
      }
    }]
  })
}

resource "aws_organizations_policy_attachment" "payments_ou" {
  policy_id = aws_organizations_policy.payments_residency.id
  target_id = aws_organizations_organizational_unit.payments.id
}

SCPs do not grant permissions — they restrict them. An SCP that allows everything except one action still requires the account's own IAM policies to allow what you want. A common production mistake is attaching a very restrictive SCP to an OU and then wondering why automated pipelines break — always test SCP changes in a sandbox OU with a canary deployment before attaching to production OUs. Use the IAM Policy Simulator and aws iam simulate-principal-policy to validate before attaching.

Identity Federation: Wiring Okta to AWS SSO

Acme Corp uses Okta as the corporate IdP. IAM Identity Center (formerly AWS SSO) federates to Okta via SCIM (auto-provisions users/groups) and SAML 2.0 (authenticates sessions). The result: engineers log in once with their corporate credentials and receive short-lived role credentials scoped to their permission set — no long-lived IAM keys anywhere.

# ── CLI: list all accounts reachable via SSO and generate credentials ──────────
# (run after `aws configure sso`)
aws sso list-accounts --access-token $(cat ~/.aws/sso/cache/*.json | python3 -c \
  "import json,sys; d=json.load(sys.stdin); print(d['accessToken'])") \
  --query 'accountList[*].[accountId,accountName]' \
  --output table

# ── Terraform: grant the "platform-eng" Okta group AdministratorAccess
#    to the networking account via Identity Center ─────────────────────────────
data "aws_ssoadmin_instances" "this" {}

resource "aws_identitystore_group" "platform_eng" {
  identity_store_id = tolist(data.aws_ssoadmin_instances.this.identity_store_ids)[0]
  display_name      = "platform-eng"
}

resource "aws_ssoadmin_permission_set" "network_admin" {
  name             = "NetworkAdmin"
  instance_arn     = tolist(data.aws_ssoadmin_instances.this.arns)[0]
  session_duration = "PT4H"

  tags = { ManagedBy = "terraform" }
}

resource "aws_ssoadmin_managed_policy_attachment" "network_admin" {
  instance_arn       = tolist(data.aws_ssoadmin_instances.this.arns)[0]
  permission_set_arn = aws_ssoadmin_permission_set.network_admin.arn
  managed_policy_arn = "arn:aws:iam::aws:policy/job-function/NetworkAdministrator"
}

resource "aws_ssoadmin_account_assignment" "platform_eng_networking" {
  instance_arn       = tolist(data.aws_ssoadmin_instances.this.arns)[0]
  permission_set_arn = aws_ssoadmin_permission_set.network_admin.arn
  principal_id       = aws_identitystore_group.platform_eng.group_id
  principal_type     = "GROUP"
  target_id          = var.networking_account_id
  target_type        = "AWS_ACCOUNT"
}

Resilience: What Happens When a Region Goes Down

Acme Corp's legal requirement mandates a DR replica in us-west-2. The Payments service uses a pilot-light DR pattern: the database (Amazon Aurora Global Database) replicates continuously, but compute in us-west-2 is minimal (a single ECS task, stopped). On failure, Route 53 health checks detect the primary endpoint going unhealthy and automatically shift DNS to the secondary ALB. RTO target: 15 minutes. RPO target: sub-1-minute (Aurora Global replication lag is typically under 1 second).

The SCP on the Payments OU explicitly excludes us-west-2 from the data-residency deny statement using a NotAction exception for Aurora global replication — a design detail that must be in the initial SCP or the DR replicas will fail to create.

Putting It Together: Control Tower Account Vending

AWS Control Tower's Account Factory (or Account Factory for Terraform — AFT) automates account creation. When a development team at Acme Corp needs a new environment, they open a pull request to the aft-account-requests repository with a single Terraform file specifying the OU, tags, and permission sets. AFT provisions the account, applies the baseline SCPs, creates the TGW attachment, registers the account with Security Hub and Config, and fires a welcome Slack notification — all without a human touching the AWS console.

# ── AFT account request file (teams submit PRs for new accounts) ───────────────
# File: aft-account-requests/payments-team-dev.tf

module "payments_dev_account" {
  source = "./modules/aft-account-request"

  control_tower_parameters = {
    AccountEmail              = "aws-payments-dev@acme-corp.com"
    AccountName               = "acme-payments-dev"
    ManagedOrganizationalUnit = "Workloads/Payments"
    SSOUserEmail              = "payments-lead@acme-corp.com"
    SSOUserFirstName          = "Payments"
    SSOUserLastName           = "Team"
  }

  account_tags = {
    Environment = "dev"
    BU          = "payments"
    CostCenter  = "CC-4421"
    Owner       = "payments-lead@acme-corp.com"
  }

  change_management_parameters = {
    change_requested_by = "platform-eng"
    change_reason       = "New dev environment for payments-v2 project"
  }

  account_customizations_name = "payments-baseline"
}

The landing zone is never "done." It is a living platform. Every new team onboarded, every new compliance requirement, every AWS service GA announcement is a potential change. Big-tech platform teams maintain the landing zone as a product: it has an owner, a changelog, a test suite (with AWS Config conformance packs and SCPkit for SCP unit testing), and a quarterly WAF review. Treat it accordingly.

Production Failure Modes to Anticipate

Having designed hundreds of landing zones, these are the failure modes that bite teams in their first year:

SCP blocks the CI/CD pipeline role: a new SCP requiring a tag on every EC2 launch silently breaks the pipeline's launch template. Always test SCPs in a staging OU before org-wide rollout, and monitor for sudden CloudTrail Deny events in the hours following an SCP change.
TGW route table propagation not enabled: the TGW attachment exists, but the route table does not propagate the VPC CIDR — traffic blackholes silently. Verify with aws ec2 describe-transit-gateway-route-tables and a VPC Reachability Analyzer path after every attachment.
Log Archive account bucket policy too permissive: the whole point of a dedicated log archive account is tamper-resistance. If the S3 bucket allows s3:DeleteObject to the log-delivery role, a compromised workload account can destroy audit evidence. The bucket policy must explicitly deny all destructive actions except from the management account principal.
SCIM token expiry breaks SSO provisioning: Okta SCIM tokens to IAM Identity Center expire by default. When this happens, new hires are not provisioned and existing users removed from Okta groups retain their AWS access. Rotate SCIM tokens before expiry and alert on provisioning errors.