Cloud & Kubernetes Security Hardening

CSPM & Misconfiguration Management

18 min Lesson 2 of 28

CSPM & Misconfiguration Management

The most common way cloud environments get breached is not through sophisticated zero-day exploits — it is through misconfiguration. An S3 bucket left public, an IAM role with *:* permissions, a Security Group open to 0.0.0.0/0 on port 22, a Kubernetes API server exposed to the internet with anonymous auth enabled. These are the real attack vectors. The Capital One breach (2019), the Twitch leak (2021), and dozens of smaller incidents share a common root cause: somebody configured something incorrectly, and nobody noticed until an attacker did.

Cloud Security Posture Management (CSPM) is the practice of continuously scanning your cloud environment — across accounts, regions, and services — to detect misconfigurations, benchmark your state against security standards, and either alert on or automatically remediate violations. At big-tech scale, this is not a quarterly audit; it is a continuous control plane running in parallel with your deployment pipeline.

The Misconfiguration Problem at Scale

A typical enterprise cloud environment has hundreds of AWS accounts (or GCP projects / Azure subscriptions) managed by dozens of teams. Each team provisions infrastructure via Terraform, CDK, or the console. Without a centralised posture layer, security findings accumulate faster than any human team can triage them. A single AWS account can have thousands of resources; a 200-account organization can have hundreds of thousands. Manual review is impossible. CSPM is the only viable approach.

The attack surface expands in three dimensions simultaneously: breadth (new services and regions adopted), depth (more configuration options per service — S3 alone has object-level ACLs, bucket policies, public access blocks, encryption settings, Object Lock, Replication, and VPC endpoints), and velocity (infrastructure deployed via CI/CD in minutes). A CSPM platform must track all three dimensions continuously.

Key idea: Misconfiguration is the #1 cause of cloud security incidents. CSPM closes the gap between "what we deployed" and "what security policy says we should have deployed" — continuously, not at audit time.

Benchmarks: The Foundation of Posture Scanning

CSPM checks are not arbitrary opinions. They are grounded in published security benchmarks that represent industry consensus on what a secure cloud configuration looks like:

CIS Benchmarks — Center for Internet Security produces detailed benchmarks for AWS, Azure, GCP, and Kubernetes. Each check is numbered (e.g., CIS AWS 1.4: "Ensure access keys are rotated every 90 days or less"), categorized by control area, and assigned a profile level (Level 1 = essential, Level 2 = defense-in-depth).
NIST SP 800-53 — The US federal control framework, mapped to cloud configurations. Required for FedRAMP workloads.
SOC 2 / ISO 27001 — Operational security controls that map to specific cloud configurations.
AWS Foundational Security Best Practices (FSBP) — AWS Security Hub's native standard; automatically curated by AWS for their service set.
PCI DSS — Payment card industry standard with specific cloud configuration requirements for cardholder data environments.

Most production CSPM deployments run multiple frameworks simultaneously and use the union of findings — a resource that fails CIS AWS Level 1 and NIST AC-2 is higher priority than one that fails only a single check.

CSPM Architecture: AWS Security Hub at Scale

AWS Security Hub is the native AWS CSPM platform. It aggregates findings from AWS-native services (GuardDuty, Inspector, Macie, IAM Access Analyzer, Firewall Manager) and third-party tools (Wiz, Prisma Cloud, Orca Security) into a single normalised finding format called ASFF (Amazon Security Finding Format).

AWS Security Hub aggregating findings from member accounts and third-party tools into a single posture plane, then routing to remediation, dashboards, and ticketing.

The standard big-tech deployment uses AWS Organizations with Security Hub delegated to a dedicated Security account. Every member account ships findings to the Security account automatically. Enabling this across a 200-account organization takes one API call:

# Enable Security Hub across all member accounts via Organizations integration
# Run from the Security (delegated admin) account

aws securityhub enable-organization-admin-account \
  --admin-account-id 111122223333

# Automatically enable for all existing + new member accounts
aws securityhub update-organization-configuration \
  --auto-enable \
  --auto-enable-standards SECURITY_HUB_DEFAULT

# Enable specific standards in bulk (CIS AWS Foundations 1.4)
aws securityhub batch-enable-standards \
  --standards-subscription-requests \
    '[{"StandardsArn":"arn:aws:securityhub:::ruleset/cis-aws-foundations-benchmark/v/1.4.0"}]'

Posture Scanning with AWS Config Rules and Conformance Packs

AWS Config is the underlying inventory and change-tracking layer. Every resource change (creation, modification, deletion) emits a Configuration Item. Config Rules evaluate these items against policy. A Conformance Pack is a collection of Config Rules representing a compliance framework — you deploy one YAML file and get 100+ rules across your account.

# Deploy the CIS AWS Foundations Benchmark conformance pack
aws configservice put-conformance-pack \
  --conformance-pack-name CIS-AWS-Foundations-1-4 \
  --template-s3-uri s3://your-bucket/CIS_AWS_Foundations_Benchmark_v1.4.yaml \
  --delivery-s3-bucket your-config-bucket

# Query current compliance status across all rules
aws configservice describe-conformance-pack-compliance \
  --conformance-pack-name CIS-AWS-Foundations-1-4 \
  --query 'ConformancePackRuleComplianceList[?ComplianceType==`NON_COMPLIANT`]' \
  --output table

# Write a custom Config rule (managed Lambda evaluator)
# Rule: all S3 buckets must have public access block enabled
aws configservice put-config-rule --config-rule '{
  "ConfigRuleName": "s3-bucket-public-access-block-required",
  "Source": {
    "Owner": "AWS",
    "SourceIdentifier": "S3_BUCKET_LEVEL_PUBLIC_ACCESS_PROHIBITED"
  },
  "Scope": {
    "ComplianceResourceTypes": ["AWS::S3::Bucket"]
  }
}'

Pro practice: Use conformance packs rather than individual rules. A conformance pack deploys atomically — if your pipeline deploys 80 rules and the 60th fails validation, the whole pack rolls back cleanly. More importantly, it gives you a single ARN to query for "am I compliant with CIS Level 1?" rather than aggregating 80 individual rule statuses.

Auto-Remediation: Closing the Loop

Detection without remediation is noise. The real value of CSPM is the detect → remediate feedback loop. AWS Security Hub findings trigger EventBridge rules; EventBridge invokes Lambda functions that call the relevant AWS API to fix the misconfiguration. This is the auto-remediation pattern used at scale.

The remediation Lambda for a public S3 bucket looks like this (simplified):

# EventBridge rule: route Security Hub findings for S3 public bucket to remediation Lambda
aws events put-rule \
  --name RemediatePublicS3Bucket \
  --event-pattern '{
    "source": ["aws.securityhub"],
    "detail-type": ["Security Hub Findings - Imported"],
    "detail": {
      "findings": {
        "ProductFields": {
          "ControlId": ["S3.1", "S3.2", "S3.8"]
        },
        "Compliance": {
          "Status": ["FAILED"]
        }
      }
    }
  }'

# The Lambda remediation function (pseudo-Python):
# import boto3
# def handler(event, context):
#     for finding in event['detail']['findings']:
#         bucket_name = finding['Resources'][0]['Id'].split(':::')[-1]
#         s3 = boto3.client('s3')
#         s3.put_public_access_block(
#             Bucket=bucket_name,
#             PublicAccessBlockConfiguration={
#                 'BlockPublicAcls': True,
#                 'IgnorePublicAcls': True,
#                 'BlockPublicPolicy': True,
#                 'RestrictPublicBuckets': True,
#             }
#         )
#         # Update finding workflow status to RESOLVED
#         securityhub.batch_update_findings(
#             FindingIdentifiers=[{'Id': finding['Id'], 'ProductArn': finding['ProductArn']}],
#             Workflow={'Status': 'RESOLVED'}
#         )

Production pitfall: Auto-remediation without a safeguard list will break things. A Lambda that blindly blocks all public S3 buckets will break static website hosting, CloudFront origin buckets with intentional public policies, and any bucket an application team deliberately made public for legitimate reasons. Always maintain a suppression list (DynamoDB table or SSM parameter) of approved exceptions, and check it before remediating. Equally important: write the finding workflow status to SUPPRESSED (not RESOLVED) for exceptions so the audit trail is clean.

Open-Source CSPM: Prowler and Steampipe

Prowler is the dominant open-source CSPM tool. It implements 300+ checks across AWS, Azure, and GCP and outputs findings in JSON, ASFF, CSV, or HTML. It integrates directly into CI/CD pipelines and Security Hub. Running a CIS benchmark scan takes seconds:

# Install Prowler (Python 3.9+)
pip install prowler

# Run CIS AWS Foundations Benchmark Level 1 against the current account/region
prowler aws --compliance cis_level1_aws --output-formats json,html --output-directory ./findings

# Run specific check families (IAM, S3, Logging)
prowler aws --services iam s3 cloudtrail --severity high critical

# Integrate with Security Hub (sends ASFF findings directly)
prowler aws --security-hub --compliance cis_level2_aws

# Run in CI (exit code 3 = findings above threshold, useful for pipeline gates)
prowler aws --compliance cis_level1_aws --exit-code 3 --severity critical high
# if [ $? -eq 3 ]; then echo "CRITICAL findings — blocking deploy"; exit 1; fi

Steampipe takes a different angle: it presents your cloud resources as SQL tables so you can query posture with standard SQL. This is extremely powerful for ad-hoc investigations and custom benchmarks:

-- Steampipe: find all S3 buckets with versioning disabled
SELECT name, region, versioning_enabled
FROM aws_s3_bucket
WHERE versioning_enabled = false
ORDER BY region, name;

-- Find IAM users with console access but no MFA
SELECT name, create_date, mfa_enabled, password_last_used
FROM aws_iam_user
WHERE mfa_enabled = false
  AND password_enabled = true
ORDER BY password_last_used DESC NULLS LAST;

-- Security groups with inbound 0.0.0.0/0 on SSH or RDP
SELECT group_id, group_name, vpc_id, region,
       perm ->> 'FromPort' AS from_port,
       perm ->> 'IpProtocol' AS protocol
FROM aws_vpc_security_group,
     jsonb_array_elements(ip_permissions) AS perm,
     jsonb_array_elements(perm -> 'IpRanges') AS cidr
WHERE cidr ->> 'CidrIp' = '0.0.0.0/0'
  AND (perm ->> 'FromPort')::int IN (22, 3389);

Kubernetes Posture: kube-bench

For Kubernetes clusters, the CIS Kubernetes Benchmark is the standard. kube-bench runs the benchmark locally against any cluster and reports pass/fail per control:

# Run kube-bench against current cluster (auto-detects control-plane vs node)
kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml
kubectl logs job/kube-bench

# Or run directly on a node
docker run --pid=host -v /etc:/etc:ro -v /var:/var:ro \
  -v $(which kubectl):/usr/local/mount-from-host/bin/kubectl:ro \
  -v ~/.kube:/.kube:ro \
  --env KUBECONFIG=/.kube/config \
  aquasec/kube-bench:latest \
  --benchmark cis-1.8

# Key CIS Kubernetes controls to verify:
# 1.1.1 - kube-apiserver.yaml file permissions (should be 600)
# 1.2.1 - API server anonymous-auth=false
# 1.2.6 - API server audit-log-path configured
# 4.2.1 - Kubelet anonymous auth disabled
# 5.1.1 - Cluster-admin role binding count

Drift Detection and Pipeline-Gated Scanning

Posture scanning must happen in two places: continuously (every configuration change triggers a re-evaluation) and pre-deployment (Terraform plans and Kubernetes manifests are scanned before they reach production). The pre-deployment gate catches misconfigurations before they ever exist, not after.

Tools like Checkov (Terraform/Kubernetes/CloudFormation IaC scanner) and tfsec plug into your CI pipeline as a quality gate. A Security Group open to 0.0.0.0/0 in a Terraform plan fails the pipeline before terraform apply is ever run:

# Run Checkov against a Terraform plan (fail on HIGH severity)
checkov -d ./terraform --framework terraform \
  --check CKV_AWS_24,CKV_AWS_25,CKV_AWS_53 \
  --compact --output cli

# Or scan all frameworks at once with severity threshold
checkov -d . --soft-fail-on MEDIUM --hard-fail-on HIGH,CRITICAL

# GitHub Actions snippet: gate PR on Checkov
# - name: Run Checkov IaC scan
#   uses: bridgecrewio/checkov-action@master
#   with:
#     directory: terraform/
#     framework: terraform
#     soft_fail: false
#     check: CKV_AWS_*

Big-tech practice: The mature pattern is "shift left + detect right." Run Checkov/tfsec in the PR pipeline (shift left) AND run Prowler/Config continuously in production (detect right). The PR gate prevents new misconfigs; the continuous scanner catches configuration drift from console changes, manual hotfixes, or service defaults that changed after your Terraform was last applied. Both layers are required — neither alone is sufficient.

Compliance Scoring and SLA Tracking

CSPM is not just about finding individual bugs — it is about demonstrating posture improvement over time. Security Hub provides an account-level compliance score per standard (0–100%). Big-tech security teams define SLAs by severity: Critical findings remediated within 24 hours, High within 7 days, Medium within 30 days. These SLAs are tracked via Security Hub metrics exported to CloudWatch and displayed on a security operations dashboard.

The most important metric is not "number of findings" — it is mean time to remediate (MTTR) by severity. A team with 500 findings all remediated within SLA is more secure than a team with 50 findings where 10 Criticals have been open for 45 days.