Kubernetes Networking & Storage

Services Deep Dive

18 min Lesson 2 of 31

Services Deep Dive

In the previous lesson you learned that the Kubernetes network model gives every Pod a unique, routable IP. That IP is ephemeral: when a Pod is replaced — by a rolling update, a crash, or a node eviction — it gets a completely new address. The higher-level construct that gives your workloads a stable identity is the Service. But a Service is far more than a static IP: the mechanism that actually routes packets, the topology of the DNS entry it creates, and the session-affinity guarantees it provides are all choices that determine how your system behaves at production scale.

kube-proxy: The Dataplane Under Every Service

When you create a Service, the control-plane component kube-proxy — a DaemonSet running on every node — is responsible for programming the local network so that traffic sent to the Service's ClusterIP (a virtual IP, or VIP) is load-balanced to one of the healthy backing Pods. kube-proxy supports three dataplane modes, and the choice matters enormously at scale.

Mode 1: iptables (Default in Most Clusters)

kube-proxy watches the Endpoints (or EndpointSlices) API and translates each Service into a chain of iptables DNAT rules in the KUBE-SERVICES chain. A packet destined for the VIP hits the chain, a rule is probabilistically selected (e.g. a 1-of-3 chance for each of 3 endpoints), and the destination is rewritten to the chosen Pod IP before the packet is forwarded.

The critical operational characteristic of iptables mode is that the rules are a flat, sequential list. For a cluster with 10,000 Services and 50,000 endpoints, every packet that enters the kernel must traverse up to 500,000 iptables rules. Latency spikes and CPU saturation on kube-proxy pods are the most common symptom. Additionally, the entire rule set is rewritten on every endpoint change — a thundering-herd problem on high-churn clusters.

Mode 2: IPVS (Production Standard for Large Clusters)

IPVS (IP Virtual Server) is a kernel-space load balancer purpose-built for this problem. Instead of a flat rule chain, IPVS uses a hash table: Service VIP lookup is O(1) regardless of how many Services exist. IPVS also supports richer load-balancing algorithms — round-robin, least-connection, source-hashing — that iptables cannot express. At any cluster above ~500 Services or ~5,000 endpoints, IPVS mode is the engineering-sound default.

# Check which mode kube-proxy is running on your cluster kubectl -n kube-system get configmap kube-proxy -o yaml | grep mode # Direct inspection on a node (requires node shell access) kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- bash # inside the debug pod: ipvsadm -ln | head -30 # lists all IPVS virtual services + real servers iptables -t nat -L KUBE-SERVICES --line-numbers | head -20 # Switch an existing cluster (kubeadm-managed) from iptables to IPVS: kubectl -n kube-system edit configmap kube-proxy # change: mode: "" -> mode: "ipvs" # then restart kube-proxy pods: kubectl -n kube-system rollout restart daemonset kube-proxy
Mode 3 — eBPF (Cilium/Calico eBPF): Next-generation CNI plugins like Cilium bypass kube-proxy entirely, programming the dataplane via eBPF programs hooked directly into the kernel's XDP/TC layers. Packet forwarding happens before the network stack, achieving lower latency and eliminating the iptables/IPVS overhead completely. This is the direction most hyperscalers are moving. For this tutorial we stay with IPVS since it is the current baseline; Cilium eBPF is covered in the NetworkPolicies lesson.

Headless Services: Removing the VIP

A standard Service provides a single stable VIP and does the load-balancing inside the kernel. But this is the wrong model for a stateful workload — a Kafka consumer that must connect to partition leader 2, a Redis Sentinel client that needs to discover the primary, or a StatefulSet where every Pod has its own identity. These clients need the actual Pod IPs, not an opaque VIP.

Setting clusterIP: None creates a headless Service. kube-proxy does nothing. Instead, the cluster DNS returns an A record per healthy endpoint directly. The client receives all Pod IPs and is responsible for selecting one — enabling client-side load-balancing, sticky connections, or topology-aware routing that the kernel dataplane cannot express.

Standard Service VIP vs Headless Service DNS response Standard Service (VIP) Client Pod DNS lookup CoreDNS → 10.96.4.1 VIP 10.96.4.1 kube-proxy IPVS / iptables Pod-0 10.0.0.5 Pod-1 10.0.1.8 Pod-2 10.0.2.3 Headless Service (clusterIP: None) Client Pod DNS lookup CoreDNS → 10.0.0.5, 10.0.1.8, 10.0.2.3 kube-proxy not involved Pod-0 10.0.0.5 Pod-1 10.0.1.8 Pod-2 10.0.2.3 StatefulSet: pod-0.svc, pod-1.svc, pod-2.svc — stable per-Pod DNS
Standard Service routes via a kernel VIP; headless Service returns all Pod IPs directly to the client.
# Headless Service manifest apiVersion: v1 kind: Service metadata: name: kafka-headless namespace: messaging spec: clusterIP: None # <-- this makes it headless selector: app: kafka ports: - name: broker port: 9092 targetPort: 9092 --- # StatefulSet that uses the headless Service for stable DNS apiVersion: apps/v1 kind: StatefulSet metadata: name: kafka namespace: messaging spec: serviceName: kafka-headless # <-- must match headless svc name replicas: 3 selector: matchLabels: app: kafka template: metadata: labels: app: kafka spec: containers: - name: kafka image: confluentinc/cp-kafka:7.6.0 env: - name: KAFKA_BROKER_ID valueFrom: fieldRef: fieldPath: metadata.name # "kafka-0", "kafka-1", etc. # DNS entries created automatically: # kafka-0.kafka-headless.messaging.svc.cluster.local -> Pod-0 IP # kafka-1.kafka-headless.messaging.svc.cluster.local -> Pod-1 IP # kafka-2.kafka-headless.messaging.svc.cluster.local -> Pod-2 IP

Session Affinity: Pinning Clients to a Backend

By default, every new TCP connection from a client is independently load-balanced — there is no guarantee that a client will hit the same backend Pod twice. For stateless services this is desirable; for services that store per-session data in memory (shopping carts, WebSocket upgrade handshakes, ML inference servers that load a model per session), routing every request from the same client to the same Pod is critical.

Kubernetes Services support sessionAffinity: ClientIP, which programs IPVS (or iptables) to use source-IP hashing as the load-balancing key. Any connection from the same source IP will be routed to the same backend for the duration of the timeoutSeconds window (default 10800 seconds = 3 hours).

apiVersion: v1 kind: Service metadata: name: inference-api namespace: ml spec: selector: app: inference sessionAffinity: ClientIP sessionAffinityConfig: clientIP: timeoutSeconds: 3600 # 1-hour sticky window; max is 86400 ports: - port: 8080 targetPort: 8080 # Verify session affinity is active: kubectl get svc inference-api -n ml -o jsonpath='{.spec.sessionAffinity}' # Expected output: ClientIP # In IPVS mode you can see the persistence table directly on the node: # ipvsadm -ln --persistent-conn | grep -A 5 <VIP>
Production pitfall — session affinity behind a NAT or proxy: If your clients egress through a shared NAT gateway (common in corporate networks or AWS VPCs using a NAT instance), many clients will appear to have the same source IP. Session affinity will funnel all their traffic to a single backend Pod, creating a hotspot. In this topology, either use a cookie-based sticky session at the Ingress layer (NGINX nginx.ingress.kubernetes.io/affinity: cookie) or design the backend to be truly stateless and externalise session state to Redis.

EndpointSlices: Scalability for Large Backends

Before Kubernetes 1.17, a Service had a single Endpoints object containing all Pod IPs. On a Deployment with 1,000 replicas, every endpoint change (a single Pod restart) caused the entire 1,000-entry object to be re-written, re-sent to every node, and re-processed by kube-proxy — O(N) work for O(1) change. EndpointSlices shard the endpoint list into 100-entry chunks. Each chunk is independent; a Pod restart updates one slice. This change reduced kube-proxy CPU by 90% in large-scale benchmarks at Google and Datadog. EndpointSlices are now the default and should never be disabled.

# Inspect the EndpointSlices for a Service kubectl get endpointslices -n default -l kubernetes.io/service-name=my-service kubectl describe endpointslice <slice-name> # Check topology: which node each endpoint lives on kubectl get endpointslices -n default -l kubernetes.io/service-name=my-service \ -o jsonpath='{range .items[*].endpoints[*]}{.addresses[0]}{"\t"}{.nodeName}{"\n"}{end}'
Topology-aware routing: In Kubernetes 1.27+, setting service.kubernetes.io/topology-mode: Auto on a Service enables topology-aware hints. kube-proxy will prefer endpoints on the same node or availability zone as the client, reducing cross-zone data transfer costs (which are real and measurable on AWS/GCP at scale) and lowering latency. Enable it on high-throughput Services once you have stable traffic across zones.

Putting It Together: Choosing the Right Service Shape

As a rule of thumb: use a standard ClusterIP Service for stateless workloads. Use a headless Service for StatefulSets, peer-discovery protocols (Elasticsearch, Cassandra, Kafka), and any case where the client needs to address individual Pods. Enable sessionAffinity: ClientIP sparingly — only when statefulness genuinely lives in the Pod and cannot be moved to a shared store — and be aware of the NAT-hotspot failure mode. Run IPVS mode on any cluster with more than a few hundred Services. These choices, made correctly, are invisible to users; made incorrectly, they become the hardest class of production bugs to diagnose.