Kubernetes Cost Optimization: Where Bills Leak

Kubernetes does not waste money by accident. Every dollar of overspend traces back to an instruction someone gave it: a resource request copied from a tutorial, a minimum replica count set "to be safe," an autoscaler watching the wrong signal. The cluster is doing exactly what it was told. That is what makes Kubernetes cost optimization different from ordinary cloud cleanup — the waste is structural, encoded in YAML, and invisible on the invoice.

This guide maps the five places we consistently find money leaking when we audit production clusters: resource requests, node bin-packing, autoscaling behavior, capacity pricing, and the traffic and telemetry nobody graphs. For each leak: how it works mechanically, what fixing it actually costs, and the failure modes that punish a careless fix.

Why a Kubernetes bill is hard to read

The cloud provider bills you for nodes. Your teams operate pods. Nothing on the invoice connects the two, and that gap is where the money disappears.

There are three layers in play: what an application actually uses, what its pod requests, and what the underlying node provides. Money leaks at both boundaries. A pod that requests four times the CPU it uses reserves capacity nobody else can schedule onto — the bill reflects the reservation, not the usage. A node that is 60% allocated strands the remaining 40% if no pending pod fits the leftover shape. Your dashboards can show healthy, low CPU graphs while the cluster runs at poor real utilization, because cost follows allocation, not consumption.

This is why the first move in any serious effort is per-workload cost allocation — OpenCost, Kubecost, or the cloud provider's native cost allocation mapped to namespaces and labels. Until each team and service has a number attached to it, optimization is guessing, and nobody changes a resource request they cannot see the price of.

Leak 1: resource requests nobody revisits

The scheduler places pods based on requests, and requests are reservations: they hold capacity whether or not the process uses it. In most clusters we audit, requests were set once — often copied from a Helm chart default or an internal template — and never measured against reality. Engineers over-request out of rational fear: under-provisioning gets you paged, over-provisioning gets you nothing but a quieter on-call. The incentive structure produces waste by default.

The fix is measurement, not intuition. Watch real usage over a representative window — weeks, including your traffic peaks and batch cycles, not a quiet Tuesday — and set requests near observed peak usage with deliberate headroom. The Vertical Pod Autoscaler in recommendation mode does this math for you per container; you do not have to let it apply changes to benefit from its numbers.

The trade-offs are asymmetric, and this is where careless right-sizing causes incidents:

CPU is compressible. A pod exceeding its CPU share gets throttled, not killed. You can be aggressive cutting CPU requests; the failure mode is latency, which you can observe and walk back.
Memory is not compressible. A pod exceeding available memory gets OOM-killed. Be conservative with memory requests, and set the memory limit equal to the request so the scheduler's picture matches the kernel's enforcement.
CPU limits deserve suspicion. Hard CPU limits cause throttling that shows up as tail latency long before it shows up as an alert. For latency-sensitive services on nodes you control, honest requests without CPU limits is often the better posture; keep limits where multi-tenancy makes noisy neighbors a real risk.
Runtimes lie to the scheduler. A JVM sized by flags ignores pod boundaries unless told otherwise; a single-threaded Node.js process cannot use the four cores you requested for it.

The recurring failure mode is the heroic one-off: a sprint of right-sizing, a satisfying graph, and then drift quietly returns as new services ship with template defaults. Make requests a reviewed artifact — admission policies can reject workloads without sane requests using the same machinery that enforces your security baseline, and a periodic VPA-versus-actual report keeps drift visible.

Leak 2: node shape and the bin-packing tax

Pods are Tetris pieces; nodes are the board. When pod shapes and node shapes do not match, capacity strands: a node with memory to spare but no schedulable CPU left is capacity you bought and cannot use. Across a large cluster, fragmentation alone can hold entire nodes' worth of stranded resources.

There is also a fixed tax per node that few teams price in: every node runs its DaemonSets — log shipper, CNI, monitoring agent, security tooling — and that overhead is multiplied by node count. Many small nodes maximize the tax; a few enormous nodes minimize it but concentrate blast radius and make scale-up increments expensive. Neither extreme is right, which is why static node groups age badly.

The structural fix is letting the cluster choose its own node shapes. Karpenter on AWS and node auto-provisioning on GKE provision instances sized to the pending pods rather than to a list someone wrote a year ago, and — just as important — consolidate: repacking pods onto fewer nodes and retiring the empties. Consolidation is where bin-packing savings actually materialize; provisioning alone only stops the bleeding for new capacity.

Some pools should stay deliberately separate. GPU nodes are the most expensive capacity in the building: taint them so only the AI workloads that need them can land there, and let them scale to zero when idle — an idle GPU node is the single fastest way to torch a budget. The same isolation logic applies to compliance-bound workloads that need dedicated tenancy.

Leak 3: autoscaling that fights itself

The Horizontal Pod Autoscaler's default — average CPU across replicas — is the wrong signal for a large class of workloads. Queue consumers and IO-bound services can be drowning at modest CPU; scaling them on queue depth or consumer lag (KEDA does this cleanly) sizes the fleet to the work that exists. This matters most in event-driven integration workloads, where the honest signal lives in the message broker, not the kernel.

The quieter and usually larger leak is scale-down that never happens. The cluster autoscaler refuses to drain nodes hosting pods it cannot safely evict: PodDisruptionBudgets that effectively demand 100% availability, naked pods with no controller, pods using local storage, system pods without eviction annotations. One such pod pins an entire node. Clusters in this state scale up beautifully during every spike and never come back down — utilization decays month over month while every individual graph looks fine.

Then there is the calendar. Staging and preview environments running at full replica counts through nights and weekends are pure leak: schedule non-production to zero outside working hours and give preview environments a TTL. Nothing about that requires cleverness — only ownership.

Audit Your Cloud Architecture

Leak 4: paying retail for compute

The same node comes at three prices: on-demand, committed, and spot. Running everything on-demand means paying the maximum rate for capacity whose shape you could largely predict.

Spot (preemptible) capacity is typically priced 60–90% below on-demand, in exchange for reclamation on minutes' notice or less. That constraint is fatal for some workloads and irrelevant for others, so the decision reduces to a short list:

Good on spot: stateless replicas behind a load balancer, queue workers with idempotent jobs, CI runners, batch and ML training with checkpointing.
Keep on-demand or committed: databases and stateful sets, singleton control-plane services, anything whose recovery story you have not actually tested.
Prerequisites either way: graceful shutdown handling, PodDisruptionBudgets that reflect real tolerance, and spot pools diversified across many instance types — a single-type pool can be wiped out in one capacity crunch.

Commitments — Savings Plans, reserved instances, committed use discounts — belong on the floor of your usage: the baseline that survives after right-sizing and scale-down fixes. The ordering is the whole game. Committing to one to three years of your current, unoptimized usage locks the waste in at a discount; right-size first, then commit to what remains, and keep the commitment below the floor so optimization keeps paying you instead of stranding reservations.

Leak 5: the traffic and telemetry nobody graphs

Compute gets the attention; the supporting charges grow in the dark.

Cross-zone traffic is billed in both directions on most clouds, and Kubernetes is happy to generate it: services chatting across availability zones, replicated data stores gossiping constantly, a service mesh adding its own chatter on top. Topology-aware routing keeps request paths zone-local where replication allows. The trade-off is real — never sacrifice multi-AZ resilience to save on traffic — but most cross-zone volume in the clusters we see is accidental, not architectural, and accidental traffic is free to eliminate.

NAT gateways charge per gigabyte processed, and a fleet of nodes pulling container images through one is a classic silent line item: route registry, object storage, and cloud API traffic over VPC endpoints, and use a regional pull-through image cache. Likewise, every Service of type LoadBalancer provisions a billable load balancer — consolidate behind an ingress or gateway instead of letting each team mint its own.

Finally, the observability tax. High-cardinality metrics (a label per user or request ID will do it), debug-level logs shipped at production volume, and unbounded retention produce telemetry pipelines that rival the compute they observe. Set cardinality budgets, sample where fidelity allows, and tier retention — full-fidelity recent data, aggregates for history.

The order of operations that actually works

Sequence matters, because each step changes the data the next one depends on:

Allocate. Get cost per namespace, team, and workload. Unallocated spend never gets fixed.
Right-size requests. Usually the largest single win, and it resets the real demand baseline.
Unblock scale-down. Fix PDBs, naked pods, and eviction blockers so the cluster can shrink.
Repack nodes. Turn on consolidation and let node shapes follow pod shapes.
Reprice capacity. Spot for the tolerant, commitments sized to the new, smaller floor.
Hunt traffic and telemetry. Cross-zone paths, NAT routes, load balancer sprawl, log volume.

Be honest about what this costs. Steps two through four consume senior engineering weeks and carry regression risk — OOM kills after aggressive memory cuts, tail latency from throttling, churn from a spot pool that was less tolerant than assumed. And savings decay: without ownership, drift returns within a couple of quarters. This is precisely the engagement shape of DevOps consulting services done properly — an audit that produces the prioritized list, then engineers who implement it and operate the result, rather than a report that ages in a drive.

When tuning is the wrong answer

Sometimes the honest finding is that the cluster should not exist. A handful of low-traffic services does not need Kubernetes; managed runtimes like Cloud Run or Fargate are often cheaper both in compute and — the part that never makes the spreadsheet — in the engineering attention the cluster consumes. Sometimes the leak is upstream of infrastructure entirely: a constellation of chatty microservices that should have been modules is a web application architecture problem no amount of bin-packing will fix, and lift-and-shift workloads running as VMs-in-pods need legacy modernization more than they need tuning.

Our bias here comes from operating, not advising: the platforms behind the production systems we run sit on this exact stack, and every recommendation above is one we pay for ourselves when we get it wrong. If your bill has been drifting up while traffic has not, the leaks in this guide are the first six places to look — and an audit is the cheapest way to find out which ones are yours.