Traffic Management
Traffic Management
Traffic management is the core value proposition of a service mesh. Without a mesh, traffic routing lives inside your application code or in coarse-grained load-balancer rules — both are inflexible and operationally dangerous at scale. Istio externalises every routing decision into two Kubernetes custom resources: VirtualService and DestinationRule. Understanding the boundary between them, and how they compose, is the prerequisite to doing anything sophisticated in the data plane — canaries, blue/green, circuit breaking, fault injection, and header-based routing all depend on this pair.
The Two-Resource Model
It helps to think of these resources as two distinct layers of abstraction:
- VirtualService — the routing layer. Answers the question "where does this request go?" It matches on HTTP method, URI prefix, headers, source namespace, or query parameters, and forwards matched traffic to one or more
destinations. A destination references a Kubernetes service plus an optional subset label. - DestinationRule — the policy layer. Answers the question "how should traffic behave once it reaches this destination?" It defines subsets (pod label selectors that group versions of a workload), load-balancing algorithms, connection pool limits, and outlier ejection (circuit breaking). The subset names referenced by a VirtualService must be declared in the DestinationRule for that host.
subset: v2) will silently drop traffic if no DestinationRule exists that declares that subset for the same host. Envoy simply has no endpoints to forward to. This is the single most common misconfiguration when teams first adopt Istio.
VirtualService: Routing Rules in Depth
A VirtualService applies to traffic destined for a given host (which maps to a Kubernetes service name). The http array is evaluated top-to-bottom; the first matching rule wins. This ordering matters — put more specific matches (header-based canary) before weight-based catch-alls, or your specific rules will never fire.
Key fields to know in production:
timeout— per-request timeout. Defaults to 15s in Istio 1.x. Set this explicitly; relying on the default leads to silent latency budget overruns when a downstream service degrades.retries.retryOn— comma-separated Envoy retry conditions.5xxalone is dangerous if your POST endpoints are not idempotent; preferconnect-failure,reset,retriable-4xxfor mutation paths.match.sourceLabels— route based on the caller's pod labels, not just the request headers. Useful when a batch job and the user-facing API share a service name but need different routing policies.
DestinationRule: Subsets and Load Balancing
The DestinationRule for the same checkout service declares the subsets the VirtualService referenced, and sets per-subset (or global) policies:
maxEjectionPercent explicitly. The default is 10%, which is safe for large pools but means a small subset (e.g. 3 pods) ejects at most 0 pods at the default — Envoy rounds down. At Google-scale teams typically set 50% for critical services and keep 100% only for stateless read replicas where partial failure is acceptable.
Canary Deployments with Traffic Splitting
A canary releases a new version to a controlled percentage of traffic before a full rollout. Istio's weighted routing makes this trivially precise — unlike a Kubernetes Deployment rollout that can only approximate percentages via pod replica ratios (10% requires a 9:1 replica ratio, meaning 10 pods minimum).
Production canary workflow:
- Deploy v2 as a separate Deployment with
version: v2labels. Keep the replica count at 1–2 for the canary — Istio controls percentages, not pod counts. - Update the DestinationRule to declare the
v2subset. - Update the VirtualService to split traffic: start at 1–5%, monitor error rate and p99 latency, then advance in steps (10%, 25%, 50%, 100%).
- At 100%, delete the v1 Deployment and remove the weight split from the VirtualService.
istioctl analyze -f virtualservice-checkout.yaml catches this and ~50 other common configuration errors before they reach the cluster.
Header-Based Routing for Dark Launches
Weight-based canaries expose the new version to real users proportionally. Header-based routing is a complementary strategy: a specific request header (set by internal tooling, a feature flag SDK, or a cookie) routes the bearer to v2 while 100% of normal traffic stays on v1. This is called a dark launch — the version is technically in production but invisible to ordinary users.
At Netflix and Uber this pattern is used for multi-region validation: internal employees hitting production from a corporate network get a forwarded header that shadows them onto canary versions of hundreds of services simultaneously, running real production load against the new code without risk to end users.
The mirror field sends a fire-and-forget copy of matched requests to the shadow destination. Responses are discarded — users see only v1 responses. This lets you validate v2 against real production traffic shapes (bursty patterns, unusual payloads, edge-case headers) before shifting any real user to it.
Operational Validation
After applying any VirtualService or DestinationRule change, validate the configuration has propagated to all Envoy sidecars before calling the rollout complete:
kubectl apply succeeding only means the API server accepted the resource. Istiod must then translate it to xDS and push it to every Envoy proxy. At high pod counts (1,000+) this propagation can take 5–30 seconds. istioctl proxy-status shows the sync state — wait for all proxies to show SYNCED before running smoke tests.
Production Failure Modes
The most common production incidents caused by VirtualService and DestinationRule misconfiguration, ranked by frequency in public post-mortems:
- Missing DestinationRule subset: VirtualService references
subset: v2but DestinationRule has no matching entry — Envoy black-holes traffic. Useistioctl analyzein the CD pipeline before merging. - Namespace scope mismatch: VirtualService in namespace
Atargeting a service in namespaceBrequires the FQDN (checkout.production.svc.cluster.local), not the short name. Short names resolve within the VirtualService's own namespace only. - Retry storms: Aggressive retries on a degraded downstream multiply QPS by the retry count. A service handling 10,000 RPS with 3 retries on 5xx will generate 40,000 RPS to the failing downstream. Pair retries with circuit breaking and exponential backoff, and never set
retryOn: 5xxon non-idempotent endpoints. - Weight-based split during pod restarts: If the v2 Deployment has zero ready pods and traffic still has weight > 0 going to its subset, every request in that percentage hits a 503. Always ensure the target Deployment is healthy before incrementing its weight.