Secrets Management & PKI

PKI Fundamentals

18 min Lesson 7 of 28

PKI Fundamentals

Every TLS connection you've ever made — to GitHub, to your cloud console, between microservices — relies on a trust infrastructure called Public Key Infrastructure (PKI). PKI is not just a certificate file sitting on a server; it is a chain of cryptographic proof linking your certificate all the way back to a root authority that the client already trusts. Senior DevOps engineers must understand this chain deeply, because misconfigurations at any link cause cascading production failures: browsers refuse connections, mutual-TLS (mTLS) authentication breaks, and automated rotation fails silently.

Certificate Authorities and the Chain of Trust

A Certificate Authority (CA) is an entity whose job is to sign certificates — cryptographically binding a public key to an identity. There are three tiers in a production PKI:

  • Root CA — the ultimate trust anchor. Its certificate is self-signed. OS and browser vendors ship a curated list of trusted root CA certificates (the trust store). The root CA private key is kept completely offline (often in an HSM in a physically secured facility). It signs only Intermediate CA certificates, nothing else.
  • Intermediate (Subordinate) CA — an online CA whose certificate was signed by the root. All day-to-day certificate issuance happens here. If an intermediate is compromised, it can be revoked without touching the root.
  • Leaf (End-Entity) Certificate — the certificate installed on a server, device, or user. Signed by the intermediate. This is what openssl s_client sees when it connects to your service.

When a client (browser, curl, gRPC client) validates a certificate, it walks this chain: leaf → intermediate → root. If it can build an unbroken chain to a root it already trusts, the handshake succeeds. This is called chain building or path validation. The client checks three things at each step: the signature is valid, the certificate is not expired, and the certificate has not been revoked.

PKI Chain of Trust Root CA Self-signed · Offline HSM Intermediate CA Online · Signs leaf certs api.example.com Leaf cert · 90 days svc-mesh.internal mTLS leaf · 24 hours *.internal.corp Wildcard leaf · 1 year signs Client Trust Store Contains Root CA cert
The three-tier PKI chain of trust: Root CA (offline) signs Intermediate CA (online), which signs leaf certificates used by services.

Subject Alternative Names (SANs)

The old Common Name (CN) field was the only way to specify what hostname a certificate covered. Modern browsers and RFC 6125 require Subject Alternative Names (SANs) instead. A SAN extension can hold:

  • DNS: entries — exact hostnames (api.example.com) or wildcards (*.example.com)
  • IP: entries — used for inter-pod mTLS where pod IPs are the identity
  • URI: entries — used by SPIFFE (Secure Production Identity Framework for Everyone), e.g. spiffe://cluster.local/ns/default/sa/payment-svc
  • Email: entries — for S/MIME client certificates
SPIFFE URIs in SANs are how Istio and Envoy implement zero-trust mTLS. Every pod gets a certificate with a SPIFFE URI encoding its Kubernetes service account. Sidecars verify these URIs on every connection — no passwords, no network-level ACLs needed.

To inspect SANs on any certificate from the command line:

# Inspect SANs of a live cert openssl s_client -connect api.example.com:443 -servername api.example.com < /dev/null 2>&1 \ | openssl x509 -noout -text \ | grep -A5 "Subject Alternative Name" # Inspect a local cert file openssl x509 -in server.crt -noout -text | grep -A5 "Subject Alternative Name" # Quick one-liner: show SANs, expiry, and issuer openssl x509 -in server.crt -noout -subject -issuer -dates -ext subjectAltName

Certificate Lifecycle

Every certificate has a hard expiry date. The lifecycle phases are:

  1. Generation — create a private key and a Certificate Signing Request (CSR). The CSR contains your public key and identity claims (CN, SANs). The private key never leaves your system.
  2. Issuance — the CA verifies the CSR (via DNS-01, HTTP-01 challenge for public CAs, or internal policy for private CAs), signs it, and returns the certificate. Validity period is set here.
  3. Deployment — the certificate and private key are loaded into the server (Nginx, Kubernetes Secret, Vault's PKI engine). The chain file (intermediate + root) must be served alongside the leaf.
  4. Renewal — start renewal at ~two-thirds of the validity period. For 90-day Let's Encrypt certs that means day 60. For 24-hour Vault-issued mTLS certs, your automation must handle hourly rotation.
  5. Revocation — if a private key is compromised, the certificate is revoked via CRL (Certificate Revocation List) or OCSP (Online Certificate Status Protocol). Revocation is notoriously unreliable in browsers; short-lived certificates are a better answer.
The industry is moving toward short-lived certificates (24 hours or less) over revocation. If a cert expires in 24 hours and rotation is automated, a compromised key becomes useless within a day. This is the model Vault's PKI engine, Istio Citadel, and Google's internal Munger system use. It eliminates the complexity and reliability issues of CRL/OCSP entirely.

Generating a CSR and Self-Signed Cert

For internal services or testing, you often need to generate your own CA and issue certificates against it. Here is the full workflow using openssl:

# 1. Generate a Root CA private key and self-signed cert (do this ONCE, keep key offline) openssl genrsa -aes256 -out root-ca.key 4096 openssl req -new -x509 -days 3650 -key root-ca.key \ -subj "/C=US/O=Acme Corp/CN=Acme Root CA" \ -out root-ca.crt # 2. Generate an Intermediate CA key and CSR openssl genrsa -out intermediate-ca.key 4096 openssl req -new -key intermediate-ca.key \ -subj "/C=US/O=Acme Corp/CN=Acme Intermediate CA" \ -out intermediate-ca.csr # 3. Sign the Intermediate CA with the Root CA openssl x509 -req -days 1825 -in intermediate-ca.csr \ -CA root-ca.crt -CAkey root-ca.key -CAcreateserial \ -extensions v3_ca \ -extfile <(printf "[v3_ca]\nbasicConstraints=critical,CA:TRUE\nkeyUsage=critical,keyCertSign,cRLSign") \ -out intermediate-ca.crt # 4. Generate a leaf key and CSR with SANs openssl genrsa -out server.key 2048 openssl req -new -key server.key \ -subj "/C=US/O=Acme Corp/CN=api.example.com" \ -out server.csr # 5. Sign the leaf cert with the Intermediate CA (including SANs) openssl x509 -req -days 90 -in server.csr \ -CA intermediate-ca.crt -CAkey intermediate-ca.key -CAcreateserial \ -extfile <(printf "[ext]\nsubjectAltName=DNS:api.example.com,DNS:www.example.com,IP:10.0.1.5") \ -extensions ext \ -out server.crt # 6. Build the chain file (intermediate + root) — serve this alongside server.crt cat intermediate-ca.crt root-ca.crt > chain.crt # 7. Verify the chain openssl verify -CAfile root-ca.crt -untrusted intermediate-ca.crt server.crt

Common Production Failure Modes

Understanding failure modes is what separates a senior engineer from a junior one. These are the real incidents you will encounter:

  • Incomplete chain served — the server sends only the leaf, not the intermediate. Desktop browsers cache intermediates and seem fine; mobile browsers, curl, and service-to-service clients fail with unable to get local issuer certificate. Always concatenate leaf + intermediate into the bundle you serve. Use openssl s_client -connect host:443 -showcerts to verify the full chain is sent.
  • SAN mismatch — a certificate for api.example.com deployed behind a reverse proxy that forwards as api-internal.example.com. TLS validation fails. Always include all names — internal aliases, load balancer names, and pod DNS names — in the SAN list at issuance time.
  • Clock skew — a certificate issued at 14:00:00 UTC deployed to a server whose clock reads 13:59:50 UTC will be rejected as "not yet valid." NTP synchronization is a PKI prerequisite.
  • Expiry surprise — certificates expire at 03:00 AM and no one notices until traffic drops and on-call fires at 03:05 AM. The fix: monitor ssl_certificate_expiry_seconds (Prometheus blackbox_exporter), alert at 30 days and again at 7 days.
  • Root not in trust store — an internal private CA root was never distributed to all services and container base images. New services fail mTLS. Manage trust store distribution via configuration management (Ansible, Chef) or bake it into your base Docker image.
Never use a wildcard certificate for internal east-west traffic. Wildcards give broad coverage but make it impossible to identify individual services in logs and audit trails. For mTLS between microservices, issue per-service certificates with SPIFFE URIs — then your service mesh can enforce exact-match authorization policies.

Monitoring Certificate Expiry in Production

# Prometheus blackbox_exporter scrape config (prometheus.yml) scrape_configs: - job_name: 'tls_expiry' metrics_path: /probe params: module: [tcp_connect] static_configs: - targets: - api.example.com:443 - internal-gateway.prod.svc:8443 relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: blackbox-exporter:9115 # Alert rule (alert.rules.yml) groups: - name: tls rules: - alert: CertExpiryWarning expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30 for: 1h labels: severity: warning annotations: summary: "TLS cert expiring soon on {{ $labels.instance }}" - alert: CertExpiryCritical expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 7 for: 5m labels: severity: critical

PKI mastery unlocks everything in the next lessons: Vault's PKI secrets engine can replace your static openssl workflow entirely, issuing certificates programmatically with TTLs as short as one hour, and automatically revoking them when a service is decommissioned. The key concepts from this lesson — the three-tier hierarchy, SAN semantics, lifecycle phases, and chain building — are the foundation you need to configure that system correctly.