Cloud & Kubernetes Security Hardening

Project: Harden a Cloud + K8s Estate

18 min Lesson 10 of 28

Project: Harden a Cloud + K8s Estate

Every lesson in this tutorial has introduced a control category in isolation — IAM, network, pod security, cluster hardening, runtime detection, zero trust, CSPM, and incident response. The real challenge is applying all of them coherently to a single production architecture, resolving conflicts, sequencing work so the live system is never disrupted, and proving to auditors and your own SRE team that the hardening holds over time.

This capstone project walks through a realistic reference architecture — an e-commerce platform running on AWS with a production EKS cluster — and applies every layer of the hardening checklist to it. Every step is a real command or config you would run, not a conceptual exercise. By the end you will have a repeatable runbook for any cloud plus Kubernetes estate.

Reference Architecture

The sample estate consists of:

An AWS account with three environments (dev, staging, prod) in separate VPCs.
An EKS 1.29 cluster in the prod VPC running ten microservices, a PostgreSQL RDS instance, and an ElastiCache Redis cluster.
An S3 bucket for user-uploaded assets, a CloudFront distribution, and an ALB in front of the cluster.
ECR as the container registry; GitHub Actions as the CI/CD system; Terraform Cloud managing infrastructure.
Prometheus, Grafana, and Loki for observability; Falco for runtime security.

Hardened estate: every layer from the internet edge to the data tier has explicit controls applied.

Phase 1 — AWS Account & IAM Baseline

Before touching the cluster, lock down the account itself. Run the following AWS Security Hub and Config checks to establish a baseline score:

# Enable Security Hub with the CIS AWS Foundations benchmark
aws securityhub enable-security-hub \
  --enable-default-standards \
  --region us-east-1

# Verify root account has MFA and no active access keys
aws iam get-account-summary | jq '.AccountMFAEnabled, .AccountAccessKeysPresent'
# Expected: 1 (MFA on), 0 (no root access keys)

# List all IAM users with console access and no MFA
aws iam generate-credential-report
aws iam get-credential-report --query 'Content' --output text | base64 -d | \
  awk -F',' 'NR>1 && $4=="true" && $8=="false" {print $1, "NO MFA"}'

# Attach a permission boundary to every human IAM role to cap blast radius
aws iam put-role-permissions-boundary \
  --role-name EngineerRole \
  --permissions-boundary arn:aws:iam::123456789012:policy/OrgMaxPermissions

Enable CloudTrail with log file integrity validation across all regions, and route events into an S3 bucket in a separate logging account that engineers cannot write to or delete from. This is the immutable audit log that incident response depends on.

aws cloudtrail create-trail \
  --name org-trail \
  --s3-bucket-name acme-audit-logs-immutable \
  --is-multi-region-trail \
  --enable-log-file-validation \
  --include-global-service-events

aws cloudtrail start-logging --name org-trail

# S3 bucket policy on the logging account — deny DeleteObject for everyone
# (apply this in Terraform; shown here for clarity)
# "Effect": "Deny",
# "Principal": "*",
# "Action": ["s3:DeleteObject","s3:DeleteBucket","s3:PutBucketPolicy"],
# "Resource": "arn:aws:s3:::acme-audit-logs-immutable/*"

Phase 2 — Network Hardening

Review every security group in the prod VPC for 0.0.0.0/0 ingress on non-HTTP ports. The EKS node groups should only accept traffic from the ALB security group on the application port and from the cluster control plane security group on the Kubelet port (10250). RDS should only accept traffic from the EKS node security group on 5432 — nothing else.

# Find security groups with unrestricted ingress
aws ec2 describe-security-groups \
  --filters "Name=ip-permission.cidr,Values=0.0.0.0/0" \
  --query 'SecurityGroups[*].[GroupId,GroupName,IpPermissions[*].FromPort]' \
  --output table

# Apply Kubernetes NetworkPolicy: default-deny all ingress, then allow selectively
# File: k8s/netpol-default-deny.yaml
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend-to-api
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080

kubectl apply -f k8s/netpol-default-deny.yaml

Phase 3 — EKS Cluster Hardening Checklist

Work through this checklist against your cluster. Each item maps to a CIS EKS Benchmark control:

API server endpoint access — disable public endpoint or restrict it to your corporate IP range. Enable private endpoint. Use eksctl utils update-cluster-endpoints.
Envelope encryption for Secrets — ensure --encryption-config is set with a KMS key. Verify: aws eks describe-cluster --name prod --query 'cluster.encryptionConfig'.
RBAC audit — list all ClusterRoleBindings to cluster-admin. There should be only the system default and your break-glass account, nothing else.
Pod Security Standards — set the pod-security.kubernetes.io/enforce: restricted label on all production namespaces.
Node group IMDSv2 — all launch templates must set HttpTokens: required and HttpPutResponseHopLimit: 1 to block pod-level SSRF attacks against the metadata service.
IRSA over node roles — every service account that needs AWS access must use IAM Roles for Service Accounts, not a blanket node IAM role policy.

# Audit ClusterRoleBindings to cluster-admin — should be minimal
kubectl get clusterrolebindings \
  -o json | jq '.items[] | select(.roleRef.name=="cluster-admin") | .subjects'

# Check namespace pod security labels
kubectl get namespaces -o json | jq '.items[] | {name: .metadata.name, pss: .metadata.labels}'

# Apply restricted PSS to the production namespace
kubectl label namespace production \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/enforce-version=v1.29 \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/audit=restricted

# Verify EKS envelope encryption
aws eks describe-cluster --name prod \
  --query 'cluster.encryptionConfig[*].{Resources:resources,KMSKey:provider.keyArn}'

Phase 4 — Secrets Management

Replace all Kubernetes Secret objects that store plaintext credentials with Vault Agent Injector or the Secrets Store CSI Driver backed by AWS Secrets Manager. Rotate every long-lived credential found in the audit. Set a 90-day max TTL on any static secret in Vault.

# Install the Secrets Store CSI Driver + AWS provider
helm repo add secrets-store-csi-driver \
  https://kubernetes-sigs.github.io/secrets-store-csi-driver/charts
helm install csi-secrets-store \
  secrets-store-csi-driver/secrets-store-csi-driver \
  -n kube-system --set syncSecret.enabled=true

helm repo add aws-secrets-manager \
  https://aws.github.io/secrets-store-csi-driver-provider-aws
helm install -n kube-system awssmp aws-secrets-manager/secrets-store-csi-driver-provider-aws

# SecretProviderClass pointing to AWS Secrets Manager
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: db-credentials
  namespace: production
spec:
  provider: aws
  parameters:
    objects: |
      - objectName: "prod/db/password"
        objectType: "secretsmanager"

Phase 5 — Supply Chain & Image Security

Every image running in production must be signed with Cosign and verified by an OPA Gatekeeper policy at admission time. Your ECR lifecycle policy must delete untagged images after 14 days, and your GitHub Actions workflow must run Trivy on every PR and block merges on CRITICAL vulnerabilities.

# Sign the image after push (in CI)
cosign sign --key cosign.key \
  123456789012.dkr.ecr.us-east-1.amazonaws.com/myapp:1.4.2

# Verify the signature (in the Gatekeeper constraint or admission webhook)
cosign verify --key cosign.pub \
  123456789012.dkr.ecr.us-east-1.amazonaws.com/myapp:1.4.2

# Trivy scan in GitHub Actions — block on CRITICAL
- name: Scan image
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: 123456789012.dkr.ecr.us-east-1.amazonaws.com/myapp:1.4.2
    exit-code: '1'
    severity: 'CRITICAL'
    ignore-unfixed: true

Phase 6 — Runtime Detection & Continuous Compliance

Falco should already be running from the Runtime Security lesson. Validate that at least these rules are active: shell spawned in container, write to /etc in container, unexpected outbound connection, and privilege escalation via sudo. Route Falco alerts to PagerDuty via the Falco Sidekick integration. Set up a weekly Security Hub report emailed to the engineering leads. Schedule a quarterly CIS benchmark scan using Kube-bench as a CronJob.

# Run kube-bench as a one-off Job to produce a CIS score
kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml
kubectl logs job/kube-bench 2>&1 | grep -E "PASS|FAIL|WARN" | tail -40

# Confirm Falco is catching the shell-in-container rule
kubectl exec -it -n production deploy/api -- /bin/sh 2>/dev/null || true
# Falco should emit: Warning Spawned Shell in Container (proc.name=sh)

# Check Falco rule coverage
kubectl exec -n falco ds/falco -- falco --list | grep -c "rule:"

The hardening checklist is a living document. CVEs land daily, Kubernetes releases change default behaviors, and your own infrastructure evolves. Commit the full checklist as a Markdown file to your infrastructure repo and run it as part of every quarterly security review. Version it, date it, and sign off on each run. Auditors at SOC 2 and ISO 27001 reviews will ask for exactly this artifact.

Proving It Holds: Continuous Control Validation

Hardening applied once decays. Engineers rotate, Terraform state drifts, and new workloads are deployed without going through the checklist. Use the following mechanisms to enforce that controls stay in place:

OPA Gatekeeper constraints for every pod security and image signing rule — rejected at admission, not discovered post-deployment.
AWS Config Rules for every account-level control — auto-remediation Lambda functions for low-risk rules (e.g., enable CloudTrail), human-review alerts for high-risk ones (e.g., public S3 bucket).
Trivy Operator running inside the cluster, continuously scanning running workloads and writing results to VulnerabilityReport CRDs. Alert in Grafana when a CRITICAL CVE appears in a running image.
Terraform Sentinel or Checkov in every PR — block any infrastructure change that opens a security group to 0.0.0.0/0, disables encryption, or removes an audit trail.

Start with the highest-impact, lowest-effort controls first. In practice the items that deliver the most risk reduction per engineer-hour are: enabling GuardDuty and Security Hub (minutes, catches the majority of active threats), enforcing IMDSv2 on all EC2 and node groups (eliminates a whole class of SSRF-to-credential-theft attacks), and rotating or removing unused IAM access keys. Do these three before anything else.

Enforcing Pod Security Standards on an existing cluster will break pods. Never switch a namespace label from warn to enforce without first running kubectl label namespace production pod-security.kubernetes.io/warn=restricted for at least one full deploy cycle, reading every warning in the API server audit log, and fixing all non-compliant pod specs. Flipping to enforce on a live namespace with non-compliant Deployments means the next rollout cannot create pods — an outage during business hours.

Handoff Artifacts

A hardening project is not complete until it produces durable artifacts that survive team turnover. Deliver: a hardening runbook (this checklist, with commands), a threat model document (STRIDE applied to the architecture diagram), an evidence bundle for your compliance team (Security Hub exports, kube-bench HTML report, Trivy SBOM for each image), and a backlog of findings with owners and due dates tracked in your issue tracker. Schedule a re-run for six months from now.