Zero Trust Architecture
Zero Trust Architecture
The perimeter model is dead. For decades, security engineering assumed that anything inside the corporate network was safe and anything outside was hostile. That assumption failed the moment employees started carrying laptops to coffee shops, the moment SaaS apps began holding more crown jewels than the data centre, and the moment a single phished credential gave an attacker the keys to the entire internal network. Google learned this lesson firsthand in Operation Aurora (2010): attackers who breached a single Windows workstation moved laterally across the corporate network to source repositories and user accounts worldwide. The internal network offered zero resistance after the perimeter was crossed.
Google's response was BeyondCorp — a complete rearchitecture of how employees access internal resources, published openly starting in 2014 and deployed at Google scale by 2017. The core insight is simple: network location must never be treated as a proxy for trust. Every access decision must be made based on identity, device state, and context — regardless of whether the request originates from a home office, an airport, or a datacenter rack sitting three feet from the server it is calling.
This lesson covers the three pillars that make Zero Trust concrete: identity-aware access, mutual TLS everywhere, and the BeyondCorp access-proxy pattern. You will leave with working configs and a clear mental model of how top-tier engineering organisations implement these in production Kubernetes and cloud environments.
Pillar 1: Identity-Aware Access
In a Zero Trust world, identity is the control plane. Every principal — human user, service account, Kubernetes pod, CI runner — must carry a verified, short-lived credential. Access decisions happen at the resource boundary, not the network edge.
The practical implications for a production Kubernetes environment:
- Short-lived credentials everywhere. AWS IAM Roles for Service Accounts (IRSA), GKE Workload Identity, and Azure AD Workload Identity all allow pods to obtain time-limited cloud credentials via OIDC token exchange — replacing the disastrous practice of mounting long-lived access keys as Secrets.
- Per-workload identity at the pod level. Kubernetes ServiceAccounts are the unit of identity. Each ServiceAccount should be bound to exactly the permissions it needs, nothing more. The default ServiceAccount in every namespace has no bound permissions by default in modern clusters — keep it that way and create dedicated accounts per workload.
- Context-aware policy. An identity token is not enough on its own. At Google, BeyondCorp also evaluates device posture (is the device managed? is the OS up to date?), request time, geolocation, and risk signals. In cloud-native environments, Open Policy Agent (OPA) / Gatekeeper allows you to encode this logic as Rego policies evaluated on every admission or authorisation request.
Enabling IRSA on an EKS cluster requires annotating the ServiceAccount and creating an IAM role with the right trust policy:
The StringEquals condition on the OIDC subject is critical — it scopes the role assumption to exactly one ServiceAccount in one namespace. A wildcard here would allow any pod in the cluster to assume the role.
Pillar 2: Mutual TLS Everywhere
Even with identity tokens in place, network traffic between services is still vulnerable to man-in-the-middle attacks if it runs over plain TCP or one-way TLS. Mutual TLS (mTLS) requires both sides of every connection to present a valid certificate issued by a trusted Certificate Authority. This means every service is cryptographically authenticated before a single byte of application data is exchanged.
In a Kubernetes cluster, implementing mTLS manually is impractical — it requires every application team to manage certificates, rotate them, and handle expired certs correctly. The right answer is a service mesh: Istio, Linkerd, or Cilium's eBPF-based network policies. The mesh runs a sidecar proxy (or, in Cilium's case, a kernel-level eBPF program) alongside each pod. The proxy intercepts all inbound and outbound traffic and handles the mTLS handshake transparently.
Istio enforces mTLS cluster-wide with a single PeerAuthentication resource, and then restricts which identities may communicate with an AuthorizationPolicy:
Pillar 3: The BeyondCorp Access-Proxy Pattern
BeyondCorp flips the traditional VPN model on its head. Instead of a VPN gateway that checks "is this device on the right network?" and then grants broad access, BeyondCorp deploys an access proxy in front of every internal application. The proxy makes an authorisation decision per request, based on three factors:
- Who are you? — verified identity from an IdP (Google Workspace, Okta, Azure AD). No anonymous access, ever.
- What device are you using? — device inventory service checks: is this a managed device? Does it have the latest OS patches? Is disk encryption enabled? Does the MDM show any recent security alerts?
- What are you asking for? — the specific resource and action, matched against a policy that defines which groups and device trust levels can access it.
The commercial implementations of this pattern are Google Cloud IAP (Identity-Aware Proxy), Cloudflare Access, and AWS Verified Access. All three terminate external TLS, validate identity via OIDC/SAML, check device posture, and forward requests to your backend only if policy allows it — with a signed JWT header that the backend can trust without reimplementing auth.
Production Failure Modes
Zero Trust architectures introduce a new class of production incidents that on-premises teams have never encountered. Know them before they hit you on call:
- Certificate expiry cascade. If the mesh CA cert or a SPIRE root cert expires unnoticed, every service-to-service call in the cluster fails with TLS handshake errors simultaneously. Monitor cert expiry as a first-class SLI. Alert at 30 days, page at 7 days.
- OIDC issuer unreachable. If the EKS OIDC discovery endpoint is temporarily unreachable, pods trying to obtain IRSA credentials fail. Build retry logic with exponential backoff and use
eks.amazonaws.com/token-expirationannotations to extend token lifetimes to tolerable levels. - AuthorizationPolicy deny-by-default locking out legitimate traffic. Istio's default action when no AuthorizationPolicy matches is ALLOW. The moment you create any AuthorizationPolicy in a namespace, the default flips to DENY for that workload. Teams that apply a policy for one service and forget to cover health check probes from the Kubernetes API server will trigger cascading 503s on the next rollout.
- Clock skew breaking JWT validation. OIDC tokens carry
iatandexpclaims. A node whose clock drifts by more than five minutes will cause token validation to fail. Enforce NTP synchronisation (chrony or timesyncd) on every node as a non-negotiable cluster hygiene requirement.