Cloud infrastructure bills at the major providers scale directly with the CPU and memory you reserve — not just what your workloads actually consume. In a mature Kubernetes cluster, a shocking 30–60% of allocated resources frequently sit idle, paying for capacity that never processes a single byte of real traffic. This lesson is about closing that gap: understanding where waste originates, how to measure it precisely, and how to use Kubernetes scheduling mechanics to run more work on fewer nodes.
These techniques are not optional tuning. At big-tech scale — thousands of nodes, hundreds of teams — 10% efficiency improvement translates directly to millions of dollars per year and meaningfully reduces the carbon footprint of your fleet.
The Root Cause: Requests vs. Limits vs. Actual Usage
Every Pod spec can declare two resource dimensions per container: requests (what the scheduler reserves) and limits (the ceiling the kernel enforces at runtime). The scheduler only sees requests; it places a Pod on a node when node.allocatable - sum(existing_pod_requests) >= pod_requests. Limits are irrelevant to placement.
This creates three distinct waste patterns:
Over-provisioned requests: A container requests 2 CPU / 4 Gi but typically consumes 0.3 CPU / 800 Mi. The scheduler reserves full 2 CPU on a node; no other Pod can use the headroom — even though 85% of it goes unused.
Missing requests (no-request pods): A container with no resource request gets scheduled as if it needs zero CPU/memory. This appears cheap but causes dangerous neighbor eviction: when the node is pressured, Kubernetes evicts these Burstable/BestEffort pods first, causing unpredictable outages.
Limits set too high (or unlimited): With no CPU limit, a single misbehaving container can steal CPU from all neighbors on the node (CPU throttling is NOT automatic without a limit). With no memory limit, an OOM condition kills random containers on the node.
The scheduler cares about requests; the kubelet cares about limits. Right-sizing means setting requests equal to the P95 of actual usage and limits at 2-3x that value. Setting requests and limits identical (Guaranteed QoS) maximizes scheduler predictability but wastes capacity on bursty workloads — use it only for latency-critical pods.
Measuring Idle Waste with Real Commands
You cannot right-size what you cannot measure. The first tool is kubectl top backed by the Metrics Server. For deeper historical analysis, query Prometheus directly using PromQL.
# Install Metrics Server (if not present)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Current CPU and memory usage per pod in a namespace
kubectl top pods -n production --sort-by=cpu
# Current usage per node — shows allocatable vs used
kubectl top nodes
# --- PromQL queries for efficiency analysis (run in Prometheus/Grafana) ---
# CPU request utilisation per namespace (ratio of actual to requested)
# Low values = over-provisioned; aim for 60-80%
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (namespace)
/
sum(kube_pod_container_resource_requests{resource="cpu",container!=""}) by (namespace)
# Memory waste: requested but unused, per pod, over last hour
(
avg_over_time(kube_pod_container_resource_requests{resource="memory"}[1h])
-
avg_over_time(container_memory_working_set_bytes{container!=""}[1h])
) / (1024*1024) # result in MiB
# Pods with NO cpu request (dangerous scheduling)
kube_pod_container_resource_requests{resource="cpu"} == 0
# Node allocatable vs sum of pod requests (over-subscription ratio)
sum(kube_node_status_allocatable{resource="cpu"}) by (node)
-
sum(kube_pod_container_resource_requests{resource="cpu"}) by (node)
A healthy cluster typically shows CPU utilisation at 50–70% of requests and memory at 60–75%. Values below 30% signal significant waste. Values above 85% signal the cluster is running hot — you are one burst away from evictions.
Bin-Packing: Scheduling More Pods Per Node
Bin-packing is the practice of fitting as many Pods as possible onto the fewest nodes. Kubernetes 1.22+ ships a built-in bin-packing scoring plugin called MostAllocated. By default the scheduler uses LeastAllocated (spread), which is safer but more expensive. In cost-optimised clusters you flip this via a KubeSchedulerConfiguration resource.
# KubeSchedulerConfiguration enabling bin-packing (MostAllocated)
# Deploy this as a ConfigMap and reference it from the kube-scheduler static pod
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
pluginConfig:
- name: NodeResourcesFit
args:
scoringStrategy:
type: MostAllocated # packs pods tightly instead of spreading
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
plugins:
score:
disabled:
- name: NodeResourcesBalancedAllocation # disable spread-by-default
enabled:
- name: NodeResourcesFit
---
# On EKS/GKE/AKS — use the managed scheduler config API or Karpenter consolidation
# Karpenter's built-in consolidation achieves bin-packing automatically:
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
disruption:
consolidationPolicy: WhenUnderutilized # drain + terminate nodes below threshold
consolidateAfter: 30s
template:
spec:
nodeClassRef:
name: default
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
LeastAllocated spreads pods across 3 nodes, leaving large idle capacity; MostAllocated packs them onto 2 nodes, allowing the third to be terminated and saving its cost.
Vertical Pod Autoscaler (VPA): Automated Right-Sizing
The Vertical Pod Autoscaler watches actual resource consumption over time and recommends — or automatically applies — right-sized requests and limits. It solves the human problem: engineers almost never revisit manifests after initial deployment.
VPA has three modes. In production, start with Off (recommendation only) or Initial (set on pod creation, no live restarts) before graduating to Auto. VPA in Auto mode evicts and recreates pods when it detects significant drift, which means you need PodDisruptionBudgets in place before enabling it.
# Install VPA (requires metrics-server)
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh
# VPA object targeting a Deployment — recommendation mode (safe starting point)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Off" # Recommend only; use "Auto" once PDBs are in place
resourcePolicy:
containerPolicies:
- containerName: api-server
minAllowed:
cpu: 50m
memory: 64Mi
maxAllowed:
cpu: 4
memory: 8Gi
controlledResources: ["cpu", "memory"]
# Check VPA recommendations
kubectl describe vpa api-server-vpa -n production
# Look for: Status > Recommendation > ContainerRecommendations
# Lower Bound = minimum safe | Target = recommended | Upper Bound = max observed spike
# Apply recommended values manually (VPA Off mode workflow)
kubectl set resources deployment/api-server \
--requests=cpu=320m,memory=420Mi \
--limits=cpu=1200m,memory=1200Mi \
-n production
VPA and HPA cannot both control CPU on the same Deployment. If you use HPA for horizontal scaling based on CPU, configure VPA to control only memory (controlledResources: ["memory"]). This gives you the best of both worlds: HPA scales out under load, VPA keeps memory requests honest.
Namespace Resource Quotas and LimitRanges
At the cluster level, cost governance starts with ResourceQuota (caps on total resource consumption per namespace) and LimitRange (enforces per-pod defaults and boundaries). These are your policy levers: teams cannot accidentally over-provision because the admission webhook rejects the deployment before it ever reaches a node.
# ResourceQuota: hard ceiling for the team-frontend namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-frontend-quota
namespace: team-frontend
spec:
hard:
requests.cpu: "20" # max 20 CPU cores requested across all pods
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
pods: "60" # max 60 pods in this namespace
persistentvolumeclaims: "10"
---
# LimitRange: default requests/limits injected when a pod omits them
apiVersion: v1
kind: LimitRange
metadata:
name: team-frontend-limits
namespace: team-frontend
spec:
limits:
- type: Container
default: # applied when no limit is specified
cpu: 500m
memory: 512Mi
defaultRequest: # applied when no request is specified
cpu: 100m
memory: 128Mi
max: # hard ceiling per container
cpu: "4"
memory: 8Gi
min:
cpu: 10m
memory: 32Mi
ResourceQuota blocks ALL pod creation once the namespace is full. Without monitoring, engineers will receive a cryptic "exceeded quota" error during a production deploy at the worst possible moment. Alert on namespace quota utilisation above 80% using this PromQL: kube_resourcequota{type="used"} / kube_resourcequota{type="hard"} > 0.8. Page the platform team at 90%.
Identifying and Eliminating Idle Workloads
Beyond right-sizing running pods, significant waste comes from workloads that should not be running at all: staging environments left on over weekends, forgotten Deployments from deprecated features, CronJobs with no schedule consumers. Standard audit:
# Find Deployments with 0 ready replicas (likely broken or forgotten)
kubectl get deployments -A -o json \
| jq '.items[] | select(.status.readyReplicas == 0 or .status.readyReplicas == null)
| {ns: .metadata.namespace, name: .metadata.name}'
# Find pods that have been Pending for more than 10 minutes (unschedulable = wasted quota)
kubectl get pods -A --field-selector=status.phase=Pending \
-o custom-columns='NS:.metadata.namespace,POD:.metadata.name,AGE:.metadata.creationTimestamp'
# Total CPU/memory requested per namespace — find which team is biggest spender
kubectl get pods -A -o json | jq -r '
.items[] | .metadata.namespace as $ns |
.spec.containers[].resources.requests |
[$ns, (.cpu // "0"), (.memory // "0")] | @csv' \
| sort | uniq -c | sort -rn | head -20
# Scale down non-production namespaces outside business hours (use a CronJob)
kubectl scale deployment --all --replicas=0 -n staging
Cost Allocation: Chargeback and Showback
Right-sizing fixes the technical problem; chargeback fixes the incentive problem. When teams do not see their cloud bill, they have no reason to care about efficiency. The standard approach at big-tech companies is to label every resource with team and environment labels, then aggregate cost by those labels using tools like OpenCost (CNCF project, free) or Kubecost.
The minimum viable labeling convention — enforce it via an admission webhook that rejects pods missing these labels:
team: payments — owning team slug
env: production — environment (production / staging / dev)
component: api — service component
cost-center: CC-1042 — finance code for chargeback
Spot/Preemptible instances are the single biggest lever. Stateless, fault-tolerant workloads — web servers, batch processors, async workers — can run 60–80% cheaper on Spot. Use Karpenter NodePools (or Cluster Autoscaler node groups) to schedule Spot-tolerant pods onto Spot instances automatically. Combine with PodDisruptionBudgets to keep at least one replica on on-demand. This single change often accounts for 40%+ of a cluster's cost reduction.
Cost efficiency in Kubernetes is not a one-time exercise. It is a continuous feedback loop: measure actual usage, right-size requests, enable bin-packing, enforce quotas, label everything, and review the chargeback report monthly. The clusters that run leanest are the ones where platform teams have made efficiency data visible and actionable for every engineering team that deploys to them.