Cloud Architecture & Landing Zones

Governance Guardrails

18 min Lesson 8 of 28

Governance Guardrails

At scale, governance is not bureaucracy — it is the engineering discipline that keeps 200 teams from accidentally destroying each other's workloads, spending $800 k on forgotten GPU instances, or opening port 22 to the public internet because someone skipped the security review. At Amazon, Google, and Microsoft the mechanism is the same: policy as code enforced at the platform layer, not audited after the fact by a human reading a spreadsheet.

This lesson covers four interlocking guardrail layers: Service Control Policies (SCPs) for hard permission boundaries, AWS Config rules for continuous compliance, tagging standards for resource accountability, and budgets for cost control. Each operates at a different point in the control plane — and together they form a defence-in-depth governance model that does not require trusting every developer to do the right thing.

Service Control Policies (SCPs)

SCPs are maximum permission boundaries attached to AWS Organizations OUs or individual accounts. They do not grant permissions — they reduce the ceiling of what IAM policies can grant. An SCP can prevent any principal in an account (including the account root user, except for a few billing actions) from calling certain APIs, regardless of what their IAM policy says.

The mental model: IAM policies define what an identity is allowed to do. SCPs define what the account is allowed to allow. The effective permission is the intersection. This means SCPs are the right layer for hard organisational rules — things that must never happen regardless of who asks.

Common production SCP patterns at big tech include: denying ec2:ModifyInstanceAttribute to prevent disabling termination protection on production boxes, denying organizations:LeaveOrganization to prevent a rogue admin ejecting an account from governance, denying all API calls outside approved regions (data residency), and denying iam:CreateUser in workload accounts to enforce federated access only.

# Deny any action outside approved regions (data residency SCP)
# Exceptions carved out for global services: IAM, STS, Route 53, CloudFront, WAF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyNonApprovedRegions",
      "Effect": "Deny",
      "NotAction": [
        "iam:*",
        "sts:*",
        "route53:*",
        "cloudfront:*",
        "waf:*",
        "support:*",
        "trustedadvisor:*"
      ],
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:RequestedRegion": [
            "eu-west-1",
            "eu-central-1"
          ]
        }
      }
    }
  ]
}

# Apply via AWS CLI to an OU
aws organizations attach-policy \
  --policy-id p-xxxxxxxxxxxx \
  --target-id ou-root-xxxxxxxx

SCPs apply to every principal in the account including the root user (for most service APIs). Test SCPs in a non-production OU first. A malformed SCP that denies sts:AssumeRole with no carve-outs can lock every human and CI system out of the account instantly — and recovery requires action from the management account.

SCPs should be managed as Terraform resources (aws_organizations_policy + aws_organizations_policy_attachment), committed to your landing zone repository, and applied through a CI pipeline that has a mandatory terraform plan review step. Never apply SCP changes manually.

AWS Config Rules

If SCPs are the firewall — blocking actions before they happen — AWS Config rules are the continuous audit — detecting when the current state of resources violates your standards. Config records every API-level change to supported resources and evaluates each resource against rules you define. Non-compliant resources trigger notifications (SNS), findings (Security Hub), or automated remediation (SSM Automation documents).

AWS provides ~200 managed rules. The ones every production org enables from day one include: restricted-ssh (no SG allows 0.0.0.0/0 on port 22), s3-bucket-public-read-prohibited, encrypted-volumes, rds-storage-encrypted, mfa-enabled-for-iam-console-access, and required-tags. Custom rules can be written as Lambda functions (or using Guard policy language) for business-specific checks.

# Deploy AWS Config conformance pack via Terraform
resource "aws_config_conformance_pack" "security_baseline" {
  name = "security-baseline"

  template_body = <<-EOT
    Parameters:
      MaxAccessKeyAge:
        Type: String
        Default: "90"
    Resources:
      RestrictedSSH:
        Type: AWS::Config::ConfigRule
        Properties:
          ConfigRuleName: restricted-ssh
          Source:
            Owner: AWS
            SourceIdentifier: INCOMING_SSH_DISABLED
      S3PublicReadProhibited:
        Type: AWS::Config::ConfigRule
        Properties:
          ConfigRuleName: s3-bucket-public-read-prohibited
          Source:
            Owner: AWS
            SourceIdentifier: S3_BUCKET_PUBLIC_READ_PROHIBITED
      AccessKeyRotation:
        Type: AWS::Config::ConfigRule
        Properties:
          ConfigRuleName: access-keys-rotated
          InputParameters:
            maxAccessKeyAge: !Ref MaxAccessKeyAge
          Source:
            Owner: AWS
            SourceIdentifier: ACCESS_KEYS_ROTATED
  EOT
}

# Auto-remediation: enable versioning on non-compliant S3 buckets
resource "aws_config_remediation_configuration" "s3_versioning" {
  config_rule_name = "s3-bucket-versioning-enabled"
  resource_type    = "AWS::S3::Bucket"
  target_type      = "SSM_DOCUMENT"
  target_id        = "AWS-ConfigureS3BucketVersioning"
  automatic        = true
  maximum_automatic_attempts = 3
  retry_attempt_seconds      = 60

  parameter {
    name           = "BucketName"
    resource_value = "RESOURCE_ID"
  }
  parameter {
    name         = "VersioningState"
    static_value = "Enabled"
  }
}

Deploy Config rules through CloudFormation StackSets or the Terraform aws_config_conformance_pack across all accounts in an OU simultaneously. The AWS Security Hub "Foundational Security Best Practices" standard aggregates Config findings from all accounts into a single pane — enable it in the Security tooling account and configure a cross-account aggregator.

Tagging Standards

Tags are the connective tissue of cloud governance. Without consistent tags you cannot answer: "Which team owns this EC2 instance?" "What is the monthly cost of the payments service?" "Which resources are subject to GDPR?" Every major cloud cost blowup I have seen traces back to untagged or inconsistently tagged resources.

A production tagging standard covers four categories:

Identity: Owner (team email DL), Project (Jira project key), CostCenter (GL code)
Lifecycle: Environment (prod / staging / dev), Terraform (true — to detect manual resources), CreatedBy (IAM principal from aws:PrincipalArn)
Compliance: DataClassification (public / internal / confidential / restricted), Regulation (GDPR / PCI / HIPAA — pipe-delimited)
Operations: BackupPolicy (daily / weekly / none), PatchGroup (SSM patch group name)

Enforcement happens at two points. First, the required-tags Config rule marks any resource missing mandatory tags as NON_COMPLIANT — but it does not block creation. For blocking, use an SCP or an IAM permission boundary that conditions ec2:RunInstances on the presence of required tag keys via aws:RequestTag conditions. Second, Terraform modules enforce tags at the module level using merge(local.mandatory_tags, var.tags) so developers cannot accidentally omit them when using the blessed module library.

# Terraform: mandatory tags enforced in every resource module
locals {
  mandatory_tags = {
    Owner          = var.owner          # e.g. "payments-team@company.com"
    Project        = var.project        # e.g. "PAY"
    CostCenter     = var.cost_center    # e.g. "CC-4210"
    Environment    = var.environment    # prod | staging | dev
    DataClass      = var.data_class     # public | internal | confidential
    Terraform      = "true"
  }
}

resource "aws_instance" "app" {
  ami           = var.ami_id
  instance_type = var.instance_type
  # Merge caller-supplied tags; mandatory_tags win on collision
  tags = merge(var.extra_tags, local.mandatory_tags)
}

# SCP condition blocking EC2 launch without Owner tag
# (add to your deny-untagged SCP)
"Condition": {
  "Null": {
    "aws:RequestTag/Owner": "true"
  }
}

Four-layer governance model: SCPs prevent, Config rules detect, tagging enables accountability, budgets control cost.

Budgets & Cost Guardrails

A budget without an action is just a notification nobody reads. Production cost governance at scale uses AWS Budgets with automated actions to enforce spend limits without human intervention. There are three levels of response worth configuring on every account:

Alert at 80% — notify the team Slack channel (via SNS → Lambda → Slack webhook). No action taken; early warning.
Alert at 100% — notify engineering leadership and the FinOps team. Trigger an SNS topic that a Lambda function reads to post a P2 ticket in your incident tracker.
Action at 110% — AWS Budgets applies an IAM policy that prevents launching new EC2 instances, RDS instances, or NAT Gateways in that account until the budget resets. This is the hard stop that prevents a runaway autoscaling event from spending $50k overnight.

Supplement account-level budgets with cost anomaly detection. AWS Cost Anomaly Detection uses ML to identify spend patterns that deviate from historical baselines — it will catch a forgotten p3.16xlarge training job or an S3 bucket accidentally made public and being hammered by bots before end-of-month billing does.

# Terraform: account-level monthly budget with alert + hard-stop action
resource "aws_budgets_budget" "monthly_limit" {
  name         = "${var.account_name}-monthly"
  budget_type  = "COST"
  limit_amount = var.monthly_limit_usd
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_sns_topic_arns  = [aws_sns_topic.budget_alerts.arn]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_sns_topic_arns  = [aws_sns_topic.budget_alerts.arn]
  }
}

# Budget action: apply deny policy at 110% (hard stop)
resource "aws_budgets_budget_action" "deny_new_compute" {
  budget_name        = aws_budgets_budget.monthly_limit.name
  action_type        = "APPLY_IAM_POLICY"
  approval_model     = "AUTOMATIC"
  notification_type  = "ACTUAL"
  execution_role_arn = aws_iam_role.budget_action_role.arn

  action_threshold {
    action_threshold_type  = "PERCENTAGE"
    action_threshold_value = 110
  }

  definition {
    iam_action_definition {
      policy_arn = aws_iam_policy.deny_new_compute.arn
      roles      = []
      groups     = []
      users      = []
    }
  }

  subscriber {
    address           = aws_sns_topic.budget_alerts.arn
    subscription_type = "SNS"
  }
}

# Cost anomaly detection for each service monitor
resource "aws_ce_anomaly_monitor" "service_monitor" {
  name         = "service-level-monitor"
  monitor_type = "DIMENSIONAL"
  monitor_dimension = "SERVICE"
}

resource "aws_ce_anomaly_subscription" "alerts" {
  name      = "anomaly-alerts"
  frequency = "IMMEDIATE"
  monitor_arn_list = [aws_ce_anomaly_monitor.service_monitor.monitor_arn]

  subscriber {
    address = aws_sns_topic.budget_alerts.arn
    type    = "SNS"
  }

  threshold_expression {
    dimension {
      key           = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
      values        = ["100"]
      match_options = ["GREATER_THAN_OR_EQUAL"]
    }
  }
}

At big-tech scale, budgets are set per account (not per the entire org) so a single team's spend cannot mask the signal from others. The FinOps platform team owns a "cost governance" account that aggregates Cost Explorer data across the org via the Cost and Usage Report (CUR) delivered to an S3 bucket in the billing account, queried by Athena and visualised in Grafana or QuickSight.

Pulling It Together: The Governance Pipeline

These four mechanisms are most powerful when they are managed as code in the same landing zone repository. The recommended workflow: SCPs and Config conformance packs live in Terraform, applied by a dedicated governance pipeline that requires two senior engineer approvals. Budget thresholds are stored as per-account variables in a budgets.tfvars map. Tagging standards are enforced at the Terraform module layer so compliance is automatic, not aspirational. Drift from any of these layers triggers a Security Hub finding that pages the Platform team — not just an email that goes unread.

Add a guardrail health dashboard to your ops runbook: a weekly automated report showing SCP coverage percentage (accounts attached vs total), Config compliance scores per account, tagging compliance rate (from aws resourcegroupstaggingapi get-resources --tag-filters), and budget burn rates. Make it visible to engineering leadership — visibility creates accountability.