Project: Harden a Cloud + K8s Estate
Project: Harden a Cloud + K8s Estate
Every lesson in this tutorial has introduced a control category in isolation — IAM, network, pod security, cluster hardening, runtime detection, zero trust, CSPM, and incident response. The real challenge is applying all of them coherently to a single production architecture, resolving conflicts, sequencing work so the live system is never disrupted, and proving to auditors and your own SRE team that the hardening holds over time.
This capstone project walks through a realistic reference architecture — an e-commerce platform running on AWS with a production EKS cluster — and applies every layer of the hardening checklist to it. Every step is a real command or config you would run, not a conceptual exercise. By the end you will have a repeatable runbook for any cloud plus Kubernetes estate.
Reference Architecture
The sample estate consists of:
- An AWS account with three environments (dev, staging, prod) in separate VPCs.
- An EKS 1.29 cluster in the prod VPC running ten microservices, a PostgreSQL RDS instance, and an ElastiCache Redis cluster.
- An S3 bucket for user-uploaded assets, a CloudFront distribution, and an ALB in front of the cluster.
- ECR as the container registry; GitHub Actions as the CI/CD system; Terraform Cloud managing infrastructure.
- Prometheus, Grafana, and Loki for observability; Falco for runtime security.
Phase 1 — AWS Account & IAM Baseline
Before touching the cluster, lock down the account itself. Run the following AWS Security Hub and Config checks to establish a baseline score:
Enable CloudTrail with log file integrity validation across all regions, and route events into an S3 bucket in a separate logging account that engineers cannot write to or delete from. This is the immutable audit log that incident response depends on.
Phase 2 — Network Hardening
Review every security group in the prod VPC for 0.0.0.0/0 ingress on non-HTTP ports. The EKS node groups should only accept traffic from the ALB security group on the application port and from the cluster control plane security group on the Kubelet port (10250). RDS should only accept traffic from the EKS node security group on 5432 — nothing else.
Phase 3 — EKS Cluster Hardening Checklist
Work through this checklist against your cluster. Each item maps to a CIS EKS Benchmark control:
- API server endpoint access — disable public endpoint or restrict it to your corporate IP range. Enable private endpoint. Use
eksctl utils update-cluster-endpoints. - Envelope encryption for Secrets — ensure
--encryption-configis set with a KMS key. Verify:aws eks describe-cluster --name prod --query 'cluster.encryptionConfig'. - RBAC audit — list all ClusterRoleBindings to
cluster-admin. There should be only the system default and your break-glass account, nothing else. - Pod Security Standards — set the
pod-security.kubernetes.io/enforce: restrictedlabel on all production namespaces. - Node group IMDSv2 — all launch templates must set
HttpTokens: requiredandHttpPutResponseHopLimit: 1to block pod-level SSRF attacks against the metadata service. - IRSA over node roles — every service account that needs AWS access must use IAM Roles for Service Accounts, not a blanket node IAM role policy.
Phase 4 — Secrets Management
Replace all Kubernetes Secret objects that store plaintext credentials with Vault Agent Injector or the Secrets Store CSI Driver backed by AWS Secrets Manager. Rotate every long-lived credential found in the audit. Set a 90-day max TTL on any static secret in Vault.
Phase 5 — Supply Chain & Image Security
Every image running in production must be signed with Cosign and verified by an OPA Gatekeeper policy at admission time. Your ECR lifecycle policy must delete untagged images after 14 days, and your GitHub Actions workflow must run Trivy on every PR and block merges on CRITICAL vulnerabilities.
Phase 6 — Runtime Detection & Continuous Compliance
Falco should already be running from the Runtime Security lesson. Validate that at least these rules are active: shell spawned in container, write to /etc in container, unexpected outbound connection, and privilege escalation via sudo. Route Falco alerts to PagerDuty via the Falco Sidekick integration. Set up a weekly Security Hub report emailed to the engineering leads. Schedule a quarterly CIS benchmark scan using Kube-bench as a CronJob.
Proving It Holds: Continuous Control Validation
Hardening applied once decays. Engineers rotate, Terraform state drifts, and new workloads are deployed without going through the checklist. Use the following mechanisms to enforce that controls stay in place:
- OPA Gatekeeper constraints for every pod security and image signing rule — rejected at admission, not discovered post-deployment.
- AWS Config Rules for every account-level control — auto-remediation Lambda functions for low-risk rules (e.g., enable CloudTrail), human-review alerts for high-risk ones (e.g., public S3 bucket).
- Trivy Operator running inside the cluster, continuously scanning running workloads and writing results to
VulnerabilityReportCRDs. Alert in Grafana when a CRITICAL CVE appears in a running image. - Terraform Sentinel or Checkov in every PR — block any infrastructure change that opens a security group to
0.0.0.0/0, disables encryption, or removes an audit trail.
warn to enforce without first running kubectl label namespace production pod-security.kubernetes.io/warn=restricted for at least one full deploy cycle, reading every warning in the API server audit log, and fixing all non-compliant pod specs. Flipping to enforce on a live namespace with non-compliant Deployments means the next rollout cannot create pods — an outage during business hours.
Handoff Artifacts
A hardening project is not complete until it produces durable artifacts that survive team turnover. Deliver: a hardening runbook (this checklist, with commands), a threat model document (STRIDE applied to the architecture diagram), an evidence bundle for your compliance team (Security Hub exports, kube-bench HTML report, Trivy SBOM for each image), and a backlog of findings with owners and due dates tracked in your issue tracker. Schedule a re-run for six months from now.