Capstone: A Big-Tech Production Platform

Foundation: Accounts, Network & IAM

18 min Lesson 2 of 30

Foundation: Accounts, Network & IAM

Before a single container runs or a pipeline fires, three things must be correct: account structure, network topology, and identity baseline. At Amazon, Google, and Microsoft, platform teams spend weeks on these decisions before any workload reaches them. Get them wrong and you spend the next two years working around the consequences. Get them right and every subsequent layer — Kubernetes, CI/CD, observability, security — slots in cleanly.

The Landing Zone: Multi-Account Strategy

A landing zone is the opinionated, pre-configured multi-account environment that enforces guardrails before any team touches it. AWS Control Tower, GCP Landing Zone Fabric, or Azure Landing Zones each codify years of enterprise best practice into a reproducible baseline.

The core principle is blast-radius isolation through account separation. A credential leak, a misconfigured S3 bucket, or a runaway cost spike in one account cannot cascade to another. A sensible OU hierarchy for a mid-to-large organisation looks like this:

Root / Management account — AWS Organizations root, consolidated billing, zero workloads. Service Control Policies (SCPs) attach here as the permission ceiling for every account below.
Security OU — Log Archive account (all CloudTrail, VPC Flow Logs, Config snapshots from every account centralised here) and a Security Tooling account (GuardDuty delegated admin, Security Hub, SIEM ingestion).
Infrastructure OU — Shared Services account (Transit Gateway hub, Route 53 Resolver, ECR, shared EKS add-ons) and optionally a Network account for centralised ingress/egress through an inspection VPC.
Workloads OU — one account per team per environment (prod, staging, dev). A team owning three microservices still shares one account per environment; per-service accounts blow past account limits and fragment cost visibility.
Sandbox OU — individual developer accounts with a hard spend SCP ($200/month), auto-nuked weekly via aws-nuke.

SCPs are permission ceilings, not grants. A deny SCP beats any IAM Allow, including Admin. At a minimum, every production OU must deny: leaving AWS Organizations, disabling CloudTrail, creating long-lived IAM users, and launching resources outside approved regions. Codify these in Terraform and commit them to the landing-zone repo on day zero — retrofitting guardrails into a live estate is painful.

AWS Control Tower automates account vending and applies mandatory guardrails (called Controls). The Account Factory for Terraform (AFT) wraps this in a GitOps pipeline so every new account is a pull request:

# aft-account-requests/accounts/payments-prod.tf
# Every new account = one PR here; AFT CodePipeline applies on merge

module "payments_prod" {
  source = "./modules/aft-account-request"

  control_tower_parameters = {
    AccountEmail              = "aws-payments-prod@company.com"
    AccountName               = "payments-prod"
    ManagedOrganizationalUnit = "Workloads/Platform/Production"
    SSOUserEmail              = "aws-payments-prod@company.com"
    SSOUserFirstName          = "Payments"
    SSOUserLastName           = "Production"
  }

  account_tags = {
    Team        = "payments"
    Environment = "production"
    CostCenter  = "CC-1042"
    DataClass   = "PCI"
  }

  # Post-vend customisations: install baselines, SCPs, config rules
  account_customizations_name = "pci-production-baseline"
}

VPC Design: The Network Foundation

Every account that runs workloads gets at least one custom VPC. The default VPC is deleted on account creation — an SCP can enforce this. A production-grade VPC for a capstone platform uses a three-tier subnet model across three Availability Zones (never two — a dual-AZ architecture has a 50% probability of being degraded during any single AZ event).

The tier structure and CIDR plan for the platform VPC in us-east-1 (10.0.0.0/16):

Public subnets (/24 each, one per AZ) — Internet-facing ALBs, NAT Gateways. Nothing else. No application servers, ever.
Private / Application subnets (/22 each) — EKS nodes, EC2 app servers, Lambda in VPC. Outbound internet via NAT Gateway only.
Data subnets (/24 each) — RDS, ElastiCache, MSK. No route to the internet whatsoever, not even NAT.

Size application subnets for EKS node scaling, not today's node count. An EKS node requests one primary IP plus one IP per pod (with VPC CNI, each pod gets a real VPC IP). A c5.4xlarge node has an ENI limit of 58 secondary IPs. A cluster that scales to 50 nodes can consume 2,900 IPs in a single AZ. A /24 (251 usable) will exhaust instantly. Use /22 (1,019 usable) or larger for the application tier. This is the single most common VPC mistake in EKS deployments.

Connectivity between accounts uses AWS Transit Gateway (TGW), not VPC Peering. Peering is a mesh — O(n²) connections as accounts grow. TGW is a hub; route tables on the TGW control which spoke VPCs can reach each other. Prod VPCs must never have a TGW route to dev VPCs.

# terraform/network/vpc.tf — three-tier VPC with 3-AZ coverage

locals {
  azs             = ["us-east-1a", "us-east-1b", "us-east-1c"]
  public_cidrs    = ["10.0.0.0/24",  "10.0.1.0/24",  "10.0.2.0/24"]
  app_cidrs       = ["10.0.16.0/22", "10.0.20.0/22", "10.0.24.0/22"]
  data_cidrs      = ["10.0.48.0/24", "10.0.49.0/24", "10.0.50.0/24"]
}

resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_support   = true
  enable_dns_hostnames = true
  tags = { Name = "platform-prod-use1", Env = "production" }
}

resource "aws_subnet" "public" {
  count             = 3
  vpc_id            = aws_vpc.main.id
  cidr_block        = local.public_cidrs[count.index]
  availability_zone = local.azs[count.index]
  # Never set map_public_ip_on_launch = true here — assign EIPs explicitly
  tags = { Name = "public-${local.azs[count.index]}", Tier = "public" }
}

resource "aws_subnet" "app" {
  count             = 3
  vpc_id            = aws_vpc.main.id
  cidr_block        = local.app_cidrs[count.index]
  availability_zone = local.azs[count.index]
  tags = { Name = "app-${local.azs[count.index]}", Tier = "application" }
}

resource "aws_subnet" "data" {
  count             = 3
  vpc_id            = aws_vpc.main.id
  cidr_block        = local.data_cidrs[count.index]
  availability_zone = local.azs[count.index]
  tags = { Name = "data-${local.azs[count.index]}", Tier = "data" }
}

# NAT Gateway — one per AZ for HA (costly but mandatory for production)
resource "aws_eip" "nat" { count = 3; domain = "vpc" }
resource "aws_nat_gateway" "main" {
  count         = 3
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id
  tags          = { Name = "ngw-${local.azs[count.index]}" }
}

# VPC Flow Logs to S3 — centralised Log Archive account bucket
resource "aws_flow_log" "main" {
  vpc_id          = aws_vpc.main.id
  traffic_type    = "ALL"
  iam_role_arn    = aws_iam_role.flow_logs.arn
  log_destination = "arn:aws:s3:::company-log-archive-vpc-flowlogs"
  log_destination_type = "s3"
  tags = { Name = "platform-prod-flowlogs" }
}

Three-tier VPC (public / application / data) across three Availability Zones, with Transit Gateway providing cross-account connectivity.

IAM Identity Baseline

On day zero, the identity baseline has two goals: eliminate long-lived credentials and enforce least-privilege role assumption. Every human identity should authenticate through your IdP (Okta, Entra ID, Google Workspace) via AWS IAM Identity Center (SSO), not IAM users. No engineer should have an access key in ~/.aws/credentials in production.

The key patterns at big-tech scale:

Permission Sets in IAM Identity Center — map to IAM roles in each account. Define four tiers: ReadOnly, Developer, PowerUser, Administrator. Attach SCPs so even the Administrator permission set cannot disable CloudTrail or modify the landing-zone baseline.
OIDC trust for CI/CD — GitHub Actions, GitLab CI, and ArgoCD assume IAM roles via OIDC federation. No static secrets in CI ever. The aws-actions/configure-aws-credentials action handles token exchange automatically.
EC2/EKS workload identity — EC2 instance profiles and EKS IRSA (IAM Roles for Service Accounts) give workloads scoped credentials without any key management. Each service account gets its own role with exactly the S3 buckets and DynamoDB tables it needs.
AWS Organizations SCPs for guardrails — deny iam:CreateUser, deny iam:CreateAccessKey (except a break-glass automation account), deny s3:PutBucketPublicAccessBlock with BlockPublicAcls=false.

# OIDC trust policy for GitHub Actions — no static secrets in CI
# Deploy this once per account; all pipelines reference the role by ARN

data "aws_iam_openid_connect_provider" "github" {
  url = "https://token.actions.githubusercontent.com"
}

resource "aws_iam_role" "github_deploy" {
  name = "github-actions-deploy"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Federated = data.aws_iam_openid_connect_provider.github.arn }
      Action    = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringLike = {
          "token.actions.githubusercontent.com:sub" =
            "repo:company-org/platform:*"
        }
        StringEquals = {
          "token.actions.githubusercontent.com:aud" = "sts.amazonaws.com"
        }
      }
    }]
  })
}

# Scope: only push to ECR and deploy to EKS — nothing else
resource "aws_iam_role_policy_attachment" "github_ecr" {
  role       = aws_iam_role.github_deploy.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryPowerUser"
}

Bringing It Together: Foundation Sequence

The correct provisioning order matters because later layers depend on earlier ones:

Enable AWS Organizations, set root SCPs (region deny, CloudTrail protect, IAM user deny).
Bootstrap Control Tower with AFT; vend the Security OU accounts first (Log Archive, Security Tooling).
Set up IAM Identity Center, connect to IdP, define permission sets — before any human logs into a workload account.
Vend workload accounts via AFT; each customisation baseline runs Terraform that creates the VPC, subnets, TGW attachment, Flow Logs, Config rules, and the OIDC IAM role.
Establish CIDR registry — a simple DynamoDB table or Terraform state in a central S3 bucket that records every VPC CIDR allocation. Without this, two teams will eventually pick overlapping ranges and discover it only when they try to peer.

The baseline Terraform module for a new account should be idempotent and under 300 lines. If the account baseline is longer than that, you are encoding too many assumptions. Keep the baseline minimal (VPC, IAM OIDC role, Config rules, Security Hub enrolment) and push workload-specific infra into team-owned repos. The platform team owns the landing zone; teams own everything above it.

With accounts isolated, networks segmented into tiers, and identity routed through the IdP with no long-lived credentials anywhere, you have eliminated the most common root causes of cloud security incidents. Lesson 3 builds the Kubernetes platform on top of this foundation.