Queue-Based & Event-Driven Scaling
Queue-Based & Event-Driven Scaling
Kubernetes HPA scales on CPU and memory — signals that measure current work being processed. For event-driven workloads the signal that actually matters is pending work not yet processed: the depth of a Kafka topic partition, the length of a RabbitMQ queue, the lag of an SQS consumer group. CPU and memory are lagging indicators here; queue depth is the leading one. KEDA (Kubernetes Event-Driven Autoscaler) closes this gap by wiring arbitrary external metrics directly into the Kubernetes HorizontalPodAutoscaler machinery. It was donated to CNCF in 2020 and is now CNCF Graduated — every major cloud and on-prem stack supports it.
How KEDA Works: Architecture Under the Hood
KEDA installs two components into your cluster. The KEDA Operator watches ScaledObject and ScaledJob custom resources and translates them into Kubernetes-native HorizontalPodAutoscaler objects — it does not bypass HPA, it drives it. The Metrics Adapter is an implementation of the Kubernetes external.metrics.k8s.io API: it polls your external source (Kafka, Redis, SQS, Prometheus, Azure Service Bus, …) on a configurable interval and exposes the metric value so HPA can act on it like any other metric.
This design matters: kubectl get hpa still shows the KEDA-managed scaler with EXTERNAL metric type. Standard Kubernetes RBAC, PodDisruptionBudgets, and minReplicas/maxReplicas guards all apply. KEDA adds scale-to-zero (HPA cannot go below 1 natively) and scale-from-zero capabilities — when no messages exist, the Deployment drops to 0 replicas and KEDA's own polling loop wakes it when the first message arrives.
Installing KEDA
Helm is the standard installation path. KEDA runs in its own keda namespace and registers the metrics server extension API. Pin the version in production — chart version 2.x maps to KEDA operator 2.x (they track together).
Scaling on Kafka Consumer Lag
The canonical KEDA use case at scale: a Kafka consumer group that processes events has falling-behind partitions. You want replicas proportional to the total lag across partitions, with a target lag per replica of say 1,000 messages — so at 50,000 unprocessed messages you expect ~50 replicas.
desiredReplicas = ceil(totalLag / lagThreshold). With 47,300 messages and a threshold of 1,000 you get ceil(47.3) = 48. This is deterministic and easy to reason about during incident reviews.
Scaling on SQS Queue Depth
SQS is the most common trigger in AWS-native shops. KEDA uses the aws-sqs-queue scaler, which calls GetQueueAttributes to read ApproximateNumberOfMessages. You need a TriggerAuthentication or an IRSA annotation — prefer IRSA in EKS to avoid long-lived credentials in Secrets.
ScaledJob for Batch Workloads
For tasks that are run-to-completion (video transcoding, ML inference batch, report generation) use ScaledJob instead of ScaledObject. KEDA creates a fresh Kubernetes Job for each unit of work up to maxReplicaCount, then cleans up completed jobs. This avoids the thundering-herd problem where a single long-polling consumer blocks later messages.
ScaledObject for long-running consumer processes (your normal Kafka worker deployment). Use ScaledJob when each message maps to a discrete, bounded task — especially if processing time is highly variable. Jobs give you better fan-out parallelism and automatic cleanup.
Production Failure Modes & Mitigations
KEDA at scale surfaces several non-obvious failure modes that you must account for before going to production:
- Metrics Adapter crash loops back to 1 replica: If the KEDA metrics server becomes unavailable, HPA cannot fetch the external metric and falls back to the last known value — which may be stale. Design your consumer to be idempotent and your topic retention to cover the KEDA restart window (usually seconds, never minutes with proper liveness probes).
- Scale-to-zero cold start latency: Going from 0 to 1 replica takes time: scheduling, image pull (if not cached), and your app startup. During that gap messages accumulate. For latency-sensitive paths keep
minReplicaCount: 1and accept the idle cost. For cost-sensitive batch jobs, setpollingInterval: 5and pre-pull images with a DaemonSet or node image cache (Karpenter supportsamiFamilypre-pulls). - Partition count ceiling: Kafka cannot scale consumers past the partition count. 10 partitions = 10 maximum parallel consumers regardless of lag. Scale partitions before you need them — Kafka partition increase is non-reversible and requires a rebalance. A common production target is
partitions = 3 × maxReplicasto leave room for rebalancing headroom. - ScaledObject deletion deletes the HPA: KEDA owns the HPA object. If you delete the
ScaledObjectin an incident (to stop autoscaling), the HPA is also deleted and your Deployment drops to its base replica count immediately. Usekubectl scale deploymentto manually override instead of deleting the ScaledObject. - Trigger authentication Secret rotation: If the Secret referenced by
TriggerAuthenticationis rotated and the KEDA Operator has cached the old value, scaling will silently fail (metrics return 0, replicas collapse). Monitor KEDA operator logs with the metrickeda_scaler_errors_totaland alert on any non-zero rate.
Pending and messages keep accumulating. Always align maxReplicaCount with the node budget you covered in Lesson 5, and set a Kubernetes ResourceQuota on the namespace as a hard ceiling.
Observability for KEDA Scalers
KEDA exposes Prometheus metrics from the metrics adapter. Scrape the keda-operator-metrics-apiserver service on port 8080. Key metrics to alert on:
keda_scaler_metrics_value— the current external metric value (queue depth, lag). Graph this alongside replica count to verify the scaler is responsive.keda_scaler_errors_total— any non-zero rate means the trigger cannot read its source. Alert immediately.keda_scaled_object_paused— set to 1 when a ScaledObject is paused (useful during maintenance windows; you can pause via annotation).
At Google-scale, the standard practice is to deploy a Grafana dashboard per ScaledObject showing: queue depth trend, replica count over time, scale event annotations, and consumer throughput (messages/s processed). This dashboard is your primary tool during capacity reviews (Lesson 9) and incident response.