Advanced Kubernetes Operations

Cost & Efficiency in Kubernetes

18 min Lesson 9 of 30

Cost & Efficiency in Kubernetes

Cloud infrastructure bills at the major providers scale directly with the CPU and memory you reserve — not just what your workloads actually consume. In a mature Kubernetes cluster, a shocking 30–60% of allocated resources frequently sit idle, paying for capacity that never processes a single byte of real traffic. This lesson is about closing that gap: understanding where waste originates, how to measure it precisely, and how to use Kubernetes scheduling mechanics to run more work on fewer nodes.

These techniques are not optional tuning. At big-tech scale — thousands of nodes, hundreds of teams — 10% efficiency improvement translates directly to millions of dollars per year and meaningfully reduces the carbon footprint of your fleet.

The Root Cause: Requests vs. Limits vs. Actual Usage

Every Pod spec can declare two resource dimensions per container: requests (what the scheduler reserves) and limits (the ceiling the kernel enforces at runtime). The scheduler only sees requests; it places a Pod on a node when node.allocatable - sum(existing_pod_requests) >= pod_requests. Limits are irrelevant to placement.

This creates three distinct waste patterns:

Over-provisioned requests: A container requests 2 CPU / 4 Gi but typically consumes 0.3 CPU / 800 Mi. The scheduler reserves full 2 CPU on a node; no other Pod can use the headroom — even though 85% of it goes unused.
Missing requests (no-request pods): A container with no resource request gets scheduled as if it needs zero CPU/memory. This appears cheap but causes dangerous neighbor eviction: when the node is pressured, Kubernetes evicts these Burstable/BestEffort pods first, causing unpredictable outages.
Limits set too high (or unlimited): With no CPU limit, a single misbehaving container can steal CPU from all neighbors on the node (CPU throttling is NOT automatic without a limit). With no memory limit, an OOM condition kills random containers on the node.

The scheduler cares about requests; the kubelet cares about limits. Right-sizing means setting requests equal to the P95 of actual usage and limits at 2-3x that value. Setting requests and limits identical (Guaranteed QoS) maximizes scheduler predictability but wastes capacity on bursty workloads — use it only for latency-critical pods.

Measuring Idle Waste with Real Commands

You cannot right-size what you cannot measure. The first tool is kubectl top backed by the Metrics Server. For deeper historical analysis, query Prometheus directly using PromQL.

# Install Metrics Server (if not present)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Current CPU and memory usage per pod in a namespace
kubectl top pods -n production --sort-by=cpu

# Current usage per node — shows allocatable vs used
kubectl top nodes

# --- PromQL queries for efficiency analysis (run in Prometheus/Grafana) ---

# CPU request utilisation per namespace (ratio of actual to requested)
# Low values = over-provisioned; aim for 60-80%
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (namespace)
  /
sum(kube_pod_container_resource_requests{resource="cpu",container!=""}) by (namespace)

# Memory waste: requested but unused, per pod, over last hour
(
  avg_over_time(kube_pod_container_resource_requests{resource="memory"}[1h])
  -
  avg_over_time(container_memory_working_set_bytes{container!=""}[1h])
) / (1024*1024)   # result in MiB

# Pods with NO cpu request (dangerous scheduling)
kube_pod_container_resource_requests{resource="cpu"} == 0

# Node allocatable vs sum of pod requests (over-subscription ratio)
sum(kube_node_status_allocatable{resource="cpu"}) by (node)
  -
sum(kube_pod_container_resource_requests{resource="cpu"}) by (node)

A healthy cluster typically shows CPU utilisation at 50–70% of requests and memory at 60–75%. Values below 30% signal significant waste. Values above 85% signal the cluster is running hot — you are one burst away from evictions.

Bin-Packing: Scheduling More Pods Per Node

Bin-packing is the practice of fitting as many Pods as possible onto the fewest nodes. Kubernetes 1.22+ ships a built-in bin-packing scoring plugin called MostAllocated. By default the scheduler uses LeastAllocated (spread), which is safer but more expensive. In cost-optimised clusters you flip this via a KubeSchedulerConfiguration resource.

# KubeSchedulerConfiguration enabling bin-packing (MostAllocated)
# Deploy this as a ConfigMap and reference it from the kube-scheduler static pod
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  - schedulerName: default-scheduler
    pluginConfig:
      - name: NodeResourcesFit
        args:
          scoringStrategy:
            type: MostAllocated          # packs pods tightly instead of spreading
            resources:
              - name: cpu
                weight: 1
              - name: memory
                weight: 1
    plugins:
      score:
        disabled:
          - name: NodeResourcesBalancedAllocation   # disable spread-by-default
        enabled:
          - name: NodeResourcesFit

---
# On EKS/GKE/AKS — use the managed scheduler config API or Karpenter consolidation
# Karpenter's built-in consolidation achieves bin-packing automatically:
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  disruption:
    consolidationPolicy: WhenUnderutilized   # drain + terminate nodes below threshold
    consolidateAfter: 30s
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]

LeastAllocated spreads pods across 3 nodes, leaving large idle capacity; MostAllocated packs them onto 2 nodes, allowing the third to be terminated and saving its cost.

Vertical Pod Autoscaler (VPA): Automated Right-Sizing

The Vertical Pod Autoscaler watches actual resource consumption over time and recommends — or automatically applies — right-sized requests and limits. It solves the human problem: engineers almost never revisit manifests after initial deployment.

VPA has three modes. In production, start with Off (recommendation only) or Initial (set on pod creation, no live restarts) before graduating to Auto. VPA in Auto mode evicts and recreates pods when it detects significant drift, which means you need PodDisruptionBudgets in place before enabling it.

# Install VPA (requires metrics-server)
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh

# VPA object targeting a Deployment — recommendation mode (safe starting point)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Off"         # Recommend only; use "Auto" once PDBs are in place
  resourcePolicy:
    containerPolicies:
      - containerName: api-server
        minAllowed:
          cpu: 50m
          memory: 64Mi
        maxAllowed:
          cpu: 4
          memory: 8Gi
        controlledResources: ["cpu", "memory"]

# Check VPA recommendations
kubectl describe vpa api-server-vpa -n production
# Look for: Status > Recommendation > ContainerRecommendations
# Lower Bound = minimum safe  |  Target = recommended  |  Upper Bound = max observed spike

# Apply recommended values manually (VPA Off mode workflow)
kubectl set resources deployment/api-server \
  --requests=cpu=320m,memory=420Mi \
  --limits=cpu=1200m,memory=1200Mi \
  -n production

VPA and HPA cannot both control CPU on the same Deployment. If you use HPA for horizontal scaling based on CPU, configure VPA to control only memory (controlledResources: ["memory"]). This gives you the best of both worlds: HPA scales out under load, VPA keeps memory requests honest.

Namespace Resource Quotas and LimitRanges

At the cluster level, cost governance starts with ResourceQuota (caps on total resource consumption per namespace) and LimitRange (enforces per-pod defaults and boundaries). These are your policy levers: teams cannot accidentally over-provision because the admission webhook rejects the deployment before it ever reaches a node.

# ResourceQuota: hard ceiling for the team-frontend namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-frontend-quota
  namespace: team-frontend
spec:
  hard:
    requests.cpu: "20"           # max 20 CPU cores requested across all pods
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    pods: "60"                   # max 60 pods in this namespace
    persistentvolumeclaims: "10"

---
# LimitRange: default requests/limits injected when a pod omits them
apiVersion: v1
kind: LimitRange
metadata:
  name: team-frontend-limits
  namespace: team-frontend
spec:
  limits:
    - type: Container
      default:              # applied when no limit is specified
        cpu: 500m
        memory: 512Mi
      defaultRequest:       # applied when no request is specified
        cpu: 100m
        memory: 128Mi
      max:                  # hard ceiling per container
        cpu: "4"
        memory: 8Gi
      min:
        cpu: 10m
        memory: 32Mi

ResourceQuota blocks ALL pod creation once the namespace is full. Without monitoring, engineers will receive a cryptic "exceeded quota" error during a production deploy at the worst possible moment. Alert on namespace quota utilisation above 80% using this PromQL: kube_resourcequota{type="used"} / kube_resourcequota{type="hard"} > 0.8. Page the platform team at 90%.

Identifying and Eliminating Idle Workloads

Beyond right-sizing running pods, significant waste comes from workloads that should not be running at all: staging environments left on over weekends, forgotten Deployments from deprecated features, CronJobs with no schedule consumers. Standard audit:

# Find Deployments with 0 ready replicas (likely broken or forgotten)
kubectl get deployments -A -o json \
  | jq '.items[] | select(.status.readyReplicas == 0 or .status.readyReplicas == null)
        | {ns: .metadata.namespace, name: .metadata.name}'

# Find pods that have been Pending for more than 10 minutes (unschedulable = wasted quota)
kubectl get pods -A --field-selector=status.phase=Pending \
  -o custom-columns='NS:.metadata.namespace,POD:.metadata.name,AGE:.metadata.creationTimestamp'

# Total CPU/memory requested per namespace — find which team is biggest spender
kubectl get pods -A -o json | jq -r '
  .items[] | .metadata.namespace as $ns |
  .spec.containers[].resources.requests |
  [$ns, (.cpu // "0"), (.memory // "0")] | @csv' \
  | sort | uniq -c | sort -rn | head -20

# Scale down non-production namespaces outside business hours (use a CronJob)
kubectl scale deployment --all --replicas=0 -n staging

Cost Allocation: Chargeback and Showback

Right-sizing fixes the technical problem; chargeback fixes the incentive problem. When teams do not see their cloud bill, they have no reason to care about efficiency. The standard approach at big-tech companies is to label every resource with team and environment labels, then aggregate cost by those labels using tools like OpenCost (CNCF project, free) or Kubecost.

The minimum viable labeling convention — enforce it via an admission webhook that rejects pods missing these labels:

team: payments — owning team slug
env: production — environment (production / staging / dev)
component: api — service component
cost-center: CC-1042 — finance code for chargeback

Spot/Preemptible instances are the single biggest lever. Stateless, fault-tolerant workloads — web servers, batch processors, async workers — can run 60–80% cheaper on Spot. Use Karpenter NodePools (or Cluster Autoscaler node groups) to schedule Spot-tolerant pods onto Spot instances automatically. Combine with PodDisruptionBudgets to keep at least one replica on on-demand. This single change often accounts for 40%+ of a cluster's cost reduction.

Cost efficiency in Kubernetes is not a one-time exercise. It is a continuous feedback loop: measure actual usage, right-size requests, enable bin-packing, enforce quotas, label everything, and review the chargeback report monthly. The clusters that run leanest are the ones where platform teams have made efficiency data visible and actionable for every engineering team that deploys to them.