Service Mesh: Istio & Linkerd

mTLS & Mesh Security

18 min Lesson 5 of 27

mTLS & Mesh Security

Zero-trust networking inside a Kubernetes cluster is not a luxury — it is a hard production requirement at any organization that has passed a SOC 2 or PCI audit, or that runs multi-tenant workloads on shared infrastructure. A service mesh enforces zero-trust transparently: every connection is mutually authenticated and encrypted, authorization is declared as policy, and none of this requires a single line of application code to change. This lesson unpacks how Istio implements that guarantee and where it breaks down under production pressure.

Why mTLS at the Mesh Layer?

Without a mesh, east-west traffic inside a cluster is plaintext. A compromised pod can sniff any service it can reach, forge source IPs, and impersonate other workloads. Kubernetes NetworkPolicy can block Layer-3 flows, but it cannot verify identity at the application layer. mTLS solves the identity problem: both sides present X.509 certificates, the connection is rejected if either side cannot prove its identity, and all bytes are encrypted with TLS 1.3.

Istio encodes workload identity in the SPIFFE standard: each pod gets a certificate with a Subject Alternative Name (SAN) of the form spiffe://<trust-domain>/ns/<namespace>/sa/<service-account>. Istiod acts as the SPIFFE-compliant CA, minting short-lived certificates (default 24-hour TTL, configurable down to minutes) that are automatically rotated by the sidecar before expiry.

Istiod issues SPIFFE certificates to each sidecar; all inter-pod traffic is encrypted and mutually authenticated automatically.

PeerAuthentication: Enforcing mTLS

Istio ships in PERMISSIVE mode by default — it accepts both plaintext and mTLS traffic so you can roll out incrementally. For production, you must flip to STRICT. The PeerAuthentication resource controls this at mesh, namespace, or per-workload granularity.

# Mesh-wide strict mTLS — apply once in istio-system
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: istio-system
spec:
  mtls:
    mode: STRICT
---
# Namespace-scoped override — useful during incremental migration
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: legacy-ns
spec:
  mtls:
    mode: PERMISSIVE
---
# Per-workload: legacy batch job still on plaintext
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: legacy-batch
  namespace: jobs
spec:
  selector:
    matchLabels:
      app: batch-importer
  mtls:
    mode: PERMISSIVE
  portLevelMtls:
    "8080":
      mode: DISABLE   # specific port, e.g., health endpoint polled by an uninjected node agent

The PERMISSIVE trap: many teams enable STRICT at the mesh level but leave individual namespaces or old namespaces untouched. Run istioctl x authz check <pod> and kubectl get peerauthentication -A regularly. A namespace-scoped PERMISSIVE silently overrides the mesh default, and you will not notice until an audit or breach.

Authorization Policies: Layer-7 Access Control

AuthorizationPolicy is Istio's firewall at the application layer. Unlike NetworkPolicy (L3/L4), it can match on HTTP method, path, headers, JWT claims, and the SPIFFE principal of the caller. The evaluation order is: DENY rules evaluated first, then ALLOW rules. A request is denied if any DENY matches, or if no ALLOW matches when at least one ALLOW policy exists.

# Deny all traffic into the payments namespace by default, then selectively open
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: deny-all
  namespace: payments
spec: {}    # empty spec = deny everything
---
# Allow the checkout service (in orders namespace) to POST /charge
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: allow-checkout
  namespace: payments
spec:
  selector:
    matchLabels:
      app: payment-service
  action: ALLOW
  rules:
    - from:
        - source:
            principals:
              - "cluster.local/ns/orders/sa/checkout"
      to:
        - operation:
            methods: ["POST"]
            paths: ["/charge", "/refund"]
---
# Deny any call using a deprecated header — evaluated before ALLOW rules
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: deny-legacy-clients
  namespace: payments
spec:
  action: DENY
  rules:
    - when:
        - key: request.headers[x-legacy-client]
          values: ["true"]

Least-privilege baseline: deploy a blanket deny-all policy in every namespace on day one, then add ALLOW policies incrementally as services prove they need communication. This is the zero-trust model Google uses internally (BeyondProd). It forces developers to declare dependencies explicitly, which also improves your service dependency map.

JWT Authentication with RequestAuthentication

For north-south traffic (ingress), combine RequestAuthentication (validates the JWT signature) with AuthorizationPolicy (enforces which claims are allowed). RequestAuthentication does not reject requests without a token — it only rejects requests with an invalid token. The DENY or ALLOW logic in AuthorizationPolicy is what enforces presence.

apiVersion: security.istio.io/v1beta1
kind: RequestAuthentication
metadata:
  name: jwt-auth
  namespace: api-gateway
spec:
  selector:
    matchLabels:
      app: gateway
  jwtRules:
    - issuer: "https://auth.example.com"
      jwksUri: "https://auth.example.com/.well-known/jwks.json"
      audiences:
        - "api.example.com"
      forwardOriginalToken: true   # upstream services see the raw JWT
---
# Require a valid token AND restrict to 'reader' or 'admin' role
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: require-jwt
  namespace: api-gateway
spec:
  selector:
    matchLabels:
      app: gateway
  action: ALLOW
  rules:
    - from:
        - source:
            requestPrincipals: ["https://auth.example.com/*"]
      when:
        - key: request.auth.claims[role]
          values: ["reader", "admin"]

Certificate Rotation and CA Pluggability

Istiod's built-in CA is fine for single-cluster development, but production clusters at scale use an external CA. Options:

Intermediate CA signing: give Istiod a corporate-signed intermediate cert; it issues workload certs chained to your PKI. Use istio-ca-secret in istio-system.
cert-manager integration: use the istio-csr agent (cert-manager project) to forward all CSRs from Envoy to cert-manager, which can back off to Vault, AWS ACM PCA, or any RFC 5280 CA.
SPIRE: for multi-cluster or multi-platform identity, replace Istiod's CA entirely with a SPIRE server federation. All trust domains are managed centrally; workloads on VMs, bare metal, and Kubernetes all get consistent SPIFFE IDs.

Key rotation math: the default 24-hour certificate TTL means a compromised private key is valid for at most 24 hours before rotation. Teams running PCI-DSS workloads often drop this to 1 hour using meshConfig.defaultConfig.proxyMetadata.SECRET_TTL. Shorter TTLs increase Istiod load: at 1-hour TTL with 1,000 pods, Istiod handles ~0.28 cert renewals per second — well within its capacity. At 100,000 pods, plan for Istiod HA with resource tuning.

Production Failure Modes

The most common mTLS incidents in production:

Injected pod calling uninjected pod: mTLS is STRICT on the caller side, the target has no sidecar. The connection hangs or returns a TLS error. Fix: inject the target pod, or add a per-workload PERMISSIVE override, or use a DestinationRule with tls.mode: DISABLE for that specific host.
Node-level health probes: kubelet calls liveness/readiness probes directly without a sidecar. Istio automatically exempts these via the rewriteAppHTTPProbers webhook. Confirm it is enabled: kubectl get mutatingwebhookconfiguration istio-sidecar-injector -o yaml | grep rewriteAppHTTPProbers.
Policy not applying: AuthorizationPolicy selectors use pod labels; a typo silently makes the policy match nothing. Always verify with istioctl x authz check <pod-name> -n <namespace>.
Clock skew breaking JWT validation: JWT nbf/exp checks require synchronized clocks. NTP drift > 60 seconds causes spurious 401s. Monitor node clock skew; Kubernetes already recommends NTP but does not enforce it.

STRICT mTLS and legacy monitoring agents: Prometheus, Datadog agents, and other node-level scrapers often run outside the mesh. When you flip a namespace to STRICT, their scrape calls will start failing with TLS errors. Exempt them with a targeted PERMISSIVE PeerAuthentication on scrape ports, or inject them into the mesh first.

Verifying the Security Posture

Trust but verify — the mesh generates the metadata to audit its own security posture:

# Check what mTLS mode is active for a pod
istioctl x describe pod <pod-name> -n <namespace>

# Show all authorization policies in effect for a workload
istioctl x authz check <pod-name> -n <namespace>

# Verify a connection is truly mTLS (look for TLSv1.3 in the Envoy access log)
kubectl logs <pod-name> -c istio-proxy -n <namespace> | grep TLSv1

# Kiali security graph — shows mTLS lock icons per edge
kubectl port-forward svc/kiali 20001:20001 -n istio-system
# Open http://localhost:20001 → Graph → Display → Security

# List all PeerAuthentication policies across the cluster
kubectl get peerauthentication -A

# List all AuthorizationPolicies
kubectl get authorizationpolicy -A

A mature mesh security posture at production scale means: mesh-wide STRICT mTLS, deny-all as the default in every namespace, ALLOW policies version-controlled in Git alongside the application manifests, certificate rotation under four hours, and Kiali (or a custom Prometheus query on istio_requests_total{connection_security_policy="mutual_tls"}) confirming >99.9% of intra-cluster calls are mTLS-encrypted at all times.