Kubernetes Networking & Storage

Services Deep Dive

18 min Lesson 2 of 31

Services Deep Dive

In the previous lesson you learned that the Kubernetes network model gives every Pod a unique, routable IP. That IP is ephemeral: when a Pod is replaced — by a rolling update, a crash, or a node eviction — it gets a completely new address. The higher-level construct that gives your workloads a stable identity is the Service. But a Service is far more than a static IP: the mechanism that actually routes packets, the topology of the DNS entry it creates, and the session-affinity guarantees it provides are all choices that determine how your system behaves at production scale.

kube-proxy: The Dataplane Under Every Service

When you create a Service, the control-plane component kube-proxy — a DaemonSet running on every node — is responsible for programming the local network so that traffic sent to the Service's ClusterIP (a virtual IP, or VIP) is load-balanced to one of the healthy backing Pods. kube-proxy supports three dataplane modes, and the choice matters enormously at scale.

Mode 1: iptables (Default in Most Clusters)

kube-proxy watches the Endpoints (or EndpointSlices) API and translates each Service into a chain of iptables DNAT rules in the KUBE-SERVICES chain. A packet destined for the VIP hits the chain, a rule is probabilistically selected (e.g. a 1-of-3 chance for each of 3 endpoints), and the destination is rewritten to the chosen Pod IP before the packet is forwarded.

The critical operational characteristic of iptables mode is that the rules are a flat, sequential list. For a cluster with 10,000 Services and 50,000 endpoints, every packet that enters the kernel must traverse up to 500,000 iptables rules. Latency spikes and CPU saturation on kube-proxy pods are the most common symptom. Additionally, the entire rule set is rewritten on every endpoint change — a thundering-herd problem on high-churn clusters.

Mode 2: IPVS (Production Standard for Large Clusters)

IPVS (IP Virtual Server) is a kernel-space load balancer purpose-built for this problem. Instead of a flat rule chain, IPVS uses a hash table: Service VIP lookup is O(1) regardless of how many Services exist. IPVS also supports richer load-balancing algorithms — round-robin, least-connection, source-hashing — that iptables cannot express. At any cluster above ~500 Services or ~5,000 endpoints, IPVS mode is the engineering-sound default.

# Check which mode kube-proxy is running on your cluster
kubectl -n kube-system get configmap kube-proxy -o yaml | grep mode

# Direct inspection on a node (requires node shell access)
kubectl debug node/<node-name> -it --image=nicolaka/netshoot -- bash
  # inside the debug pod:
  ipvsadm -ln | head -30          # lists all IPVS virtual services + real servers
  iptables -t nat -L KUBE-SERVICES --line-numbers | head -20

# Switch an existing cluster (kubeadm-managed) from iptables to IPVS:
kubectl -n kube-system edit configmap kube-proxy
# change:  mode: ""   ->   mode: "ipvs"
# then restart kube-proxy pods:
kubectl -n kube-system rollout restart daemonset kube-proxy

Mode 3 — eBPF (Cilium/Calico eBPF): Next-generation CNI plugins like Cilium bypass kube-proxy entirely, programming the dataplane via eBPF programs hooked directly into the kernel's XDP/TC layers. Packet forwarding happens before the network stack, achieving lower latency and eliminating the iptables/IPVS overhead completely. This is the direction most hyperscalers are moving. For this tutorial we stay with IPVS since it is the current baseline; Cilium eBPF is covered in the NetworkPolicies lesson.

Headless Services: Removing the VIP

A standard Service provides a single stable VIP and does the load-balancing inside the kernel. But this is the wrong model for a stateful workload — a Kafka consumer that must connect to partition leader 2, a Redis Sentinel client that needs to discover the primary, or a StatefulSet where every Pod has its own identity. These clients need the actual Pod IPs, not an opaque VIP.

Setting clusterIP: None creates a headless Service. kube-proxy does nothing. Instead, the cluster DNS returns an A record per healthy endpoint directly. The client receives all Pod IPs and is responsible for selecting one — enabling client-side load-balancing, sticky connections, or topology-aware routing that the kernel dataplane cannot express.

Standard Service routes via a kernel VIP; headless Service returns all Pod IPs directly to the client.

# Headless Service manifest
apiVersion: v1
kind: Service
metadata:
  name: kafka-headless
  namespace: messaging
spec:
  clusterIP: None          # <-- this makes it headless
  selector:
    app: kafka
  ports:
    - name: broker
      port: 9092
      targetPort: 9092
---
# StatefulSet that uses the headless Service for stable DNS
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: kafka
  namespace: messaging
spec:
  serviceName: kafka-headless    # <-- must match headless svc name
  replicas: 3
  selector:
    matchLabels:
      app: kafka
  template:
    metadata:
      labels:
        app: kafka
    spec:
      containers:
        - name: kafka
          image: confluentinc/cp-kafka:7.6.0
          env:
            - name: KAFKA_BROKER_ID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name   # "kafka-0", "kafka-1", etc.

# DNS entries created automatically:
# kafka-0.kafka-headless.messaging.svc.cluster.local  ->  Pod-0 IP
# kafka-1.kafka-headless.messaging.svc.cluster.local  ->  Pod-1 IP
# kafka-2.kafka-headless.messaging.svc.cluster.local  ->  Pod-2 IP

Session Affinity: Pinning Clients to a Backend

By default, every new TCP connection from a client is independently load-balanced — there is no guarantee that a client will hit the same backend Pod twice. For stateless services this is desirable; for services that store per-session data in memory (shopping carts, WebSocket upgrade handshakes, ML inference servers that load a model per session), routing every request from the same client to the same Pod is critical.

Kubernetes Services support sessionAffinity: ClientIP, which programs IPVS (or iptables) to use source-IP hashing as the load-balancing key. Any connection from the same source IP will be routed to the same backend for the duration of the timeoutSeconds window (default 10800 seconds = 3 hours).

apiVersion: v1
kind: Service
metadata:
  name: inference-api
  namespace: ml
spec:
  selector:
    app: inference
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 3600     # 1-hour sticky window; max is 86400
  ports:
    - port: 8080
      targetPort: 8080

# Verify session affinity is active:
kubectl get svc inference-api -n ml -o jsonpath='{.spec.sessionAffinity}'
# Expected output: ClientIP

# In IPVS mode you can see the persistence table directly on the node:
# ipvsadm -ln --persistent-conn | grep -A 5 <VIP>

Production pitfall — session affinity behind a NAT or proxy: If your clients egress through a shared NAT gateway (common in corporate networks or AWS VPCs using a NAT instance), many clients will appear to have the same source IP. Session affinity will funnel all their traffic to a single backend Pod, creating a hotspot. In this topology, either use a cookie-based sticky session at the Ingress layer (NGINX nginx.ingress.kubernetes.io/affinity: cookie) or design the backend to be truly stateless and externalise session state to Redis.

EndpointSlices: Scalability for Large Backends

Before Kubernetes 1.17, a Service had a single Endpoints object containing all Pod IPs. On a Deployment with 1,000 replicas, every endpoint change (a single Pod restart) caused the entire 1,000-entry object to be re-written, re-sent to every node, and re-processed by kube-proxy — O(N) work for O(1) change. EndpointSlices shard the endpoint list into 100-entry chunks. Each chunk is independent; a Pod restart updates one slice. This change reduced kube-proxy CPU by 90% in large-scale benchmarks at Google and Datadog. EndpointSlices are now the default and should never be disabled.

# Inspect the EndpointSlices for a Service
kubectl get endpointslices -n default -l kubernetes.io/service-name=my-service
kubectl describe endpointslice <slice-name>

# Check topology: which node each endpoint lives on
kubectl get endpointslices -n default -l kubernetes.io/service-name=my-service \
  -o jsonpath='{range .items[*].endpoints[*]}{.addresses[0]}{"\t"}{.nodeName}{"\n"}{end}'

Topology-aware routing: In Kubernetes 1.27+, setting service.kubernetes.io/topology-mode: Auto on a Service enables topology-aware hints. kube-proxy will prefer endpoints on the same node or availability zone as the client, reducing cross-zone data transfer costs (which are real and measurable on AWS/GCP at scale) and lowering latency. Enable it on high-throughput Services once you have stable traffic across zones.

Putting It Together: Choosing the Right Service Shape

As a rule of thumb: use a standard ClusterIP Service for stateless workloads. Use a headless Service for StatefulSets, peer-discovery protocols (Elasticsearch, Cassandra, Kafka), and any case where the client needs to address individual Pods. Enable sessionAffinity: ClientIP sparingly — only when statefulness genuinely lives in the Pod and cannot be moved to a shared store — and be aware of the NAT-hotspot failure mode. Run IPVS mode on any cluster with more than a few hundred Services. These choices, made correctly, are invisible to users; made incorrectly, they become the hardest class of production bugs to diagnose.