Azure Autoscaling Cost Optimisation: Avoid Unpredictable Spend

Architecture Brief Systems thinking, implementation detail, and a bias toward clarity over noise.

Autoscaling is sold as a cost reduction feature. Match capacity to demand, stop paying for headroom you rarely use, let the platform handle the peaks. The pitch is correct in principle. In practice, a misconfigured autoscaling policy produces cost patterns that are less predictable, and frequently more expensive, than the fixed provisioning it replaced. The teams that experience this rarely trace it back to autoscaling — they see a fluctuating bill and assume it is growth. Often it is not. It is a scale-in problem.

How Autoscaling Creates Cost Problems

There are three failure modes that reliably produce cost higher than fixed provisioning, and they are not edge cases — they appear in the default configuration or near-default configuration of most workloads that have not been explicitly tuned.

The first is scale-out without scale-in. The workload grows to handle a peak event — a marketing campaign, a monthly batch run, a Monday morning traffic surge — and the autoscaler does exactly what it is configured to do: adds instances. The peak passes. The instances remain. This happens when scale-in rules are not defined, when scale-in thresholds are too close to scale-out thresholds, or when scale-in cooldown periods are so long that the next peak arrives before the previous scale-in completes. The result is a workload that grows to its maximum instance count over several weeks and stays there indefinitely. At that point you have fixed provisioning at peak capacity, billed continuously, with the overhead of the autoscaling configuration on top.

The second failure mode is scaling on the wrong metric. The most common example is scaling on CPU when the actual bottleneck is memory pressure, thread pool exhaustion, or database connection limits. A workload that is genuinely CPU-constrained responds well to CPU-based scaling — adding instances distributes the load and the metric returns to baseline. A workload that is memory-constrained will scale out under CPU pressure, discover that the new instances also become memory-constrained almost immediately, scale out again, and eventually hit the instance maximum without resolving the underlying problem. The cost implication is significant: you are paying for additional compute without recovering performance, which means you are also likely to pay for an incident investigation when the workload eventually degrades anyway.

The third failure mode is the absence of a maximum instance cap, or a cap set so high it is functionally absent. A runaway scaling event — a retry storm, a misconfigured load test pointing at production, a dependency failure causing elevated latency and therefore elevated request queuing — can drive an uncapped autoscaling group to hundreds of instances in minutes. Azure will scale as aggressively as your limits and quotas allow. The billing consequence of a two-hour runaway event against a workload with no instance maximum can easily exceed a month of normal operating cost. The cap is not primarily a performance decision — it is a cost circuit breaker.

App Service Autoscaling — What to Watch

App Service autoscaling uses rule-based configuration: define a metric, a threshold, and a scale action. The mechanism is straightforward, which makes it easy to configure something that appears correct but behaves badly under real load.

The evaluation window is the most important and most consistently misunderstood parameter. The evaluation window determines how long Azure observes the metric before deciding whether to scale. A one-minute window is almost never correct for production workloads — it means the autoscaler is making decisions based on sixty seconds of data, which is enough to capture a transient spike but not enough to determine whether the load is sustained. The consequence is oscillation: the workload scales out on a brief spike, scales back in when the spike passes, and repeats the cycle continuously. Every scale event has a warm-up cost, a connection migration cost, and a configuration overhead cost. Continuous oscillation can cost more than running at the scaled-out level continuously. Use a five to ten minute evaluation window as a starting point and only move toward shorter windows if your traffic profile genuinely justifies it.

Scale-in cooldown prevents scale events from happening faster than the workload can stabilise after a change. The typical mistake is setting this too short in the belief that faster scale-in means lower cost. On sustained load, a short cooldown causes the autoscaler to scale in, observe that load is still elevated, and scale back out — oscillation again. A cooldown of ten to fifteen minutes for scale-in gives the workload time to redistribute in-flight requests and stabilise before the next evaluation.

The threshold relationship between scale-out and scale-in requires deliberate asymmetry. If you scale out at 70% CPU and scale in at 50% CPU, you are operating within a twenty point band that provides meaningful hysteresis. If you scale out at 70% and scale in at 60%, a workload running at 65% will oscillate continuously. Scale in at 30–35% when you are scaling out at 70% — the gap looks excessive until you consider that the scale-in decision is being evaluated against a smaller pool of instances, which will individually show higher utilisation than the scaled-out pool did.

Minimum instance count is a genuine trade-off between cost and performance. A minimum of one instance saves money during off-peak periods but introduces cold-start latency when the first request arrives after the instance has been idle long enough for the runtime to wind down. For latency-sensitive APIs, the cold-start penalty is typically measured in seconds — long enough to breach an SLA and generate a support ticket. A minimum of two instances eliminates cold starts without a significant cost impact for most workloads, and means scale-out events are distributing load that was already warm, rather than spinning up from zero. Two is usually the right minimum for production APIs. One is acceptable for background workers and internal tooling where latency is not a concern.

[DAN: Add specific App Service autoscaling config you’ve shipped — scale-out/in thresholds, cooldown periods, reasoning behind numbers.]

AKS Node Pool Autoscaling — Different Problem Set

Kubernetes autoscaling involves two separate systems operating at different layers: the Cluster Autoscaler adds and removes nodes based on whether pods are schedulable, and the Horizontal Pod Autoscaler adds and removes pods based on resource utilisation or custom metrics. Both must be tuned together. Tuning one in isolation while ignoring the other is the most common cause of AKS cost problems.

The most frequent cost failure is nodes that cannot scale in because the pods on them cannot be moved elsewhere. This happens when pod disruption budgets require a minimum number of replicas that equals the total replica count — every pod is protected, nothing can be evicted, so the Cluster Autoscaler cannot drain the node. It also happens when workloads have no replica headroom: if every node in the pool is running at the minimum replica count required to serve traffic, there is nowhere to consolidate pods, and scale-in is blocked indefinitely. The result is a node pool that grows during a demand spike and never contracts. Designing for scale-in means ensuring that pod disruption budgets allow at least one pod to be evicted at any time, and that your replica count above the minimum leaves enough slack that the scheduler can bin-pack efficiently.

The Cluster Autoscaler’s default scale-in delay of ten minutes is intentional, not a bug. It exists because node startup time is measured in minutes, not seconds, and a node that is drained and terminated only to be needed again three minutes later produces unnecessary churn, startup cost, and scheduling disruption. Do not reduce this aggressively. The cost saving from faster scale-in is marginal compared to the stability cost of getting it wrong.

Node pool sizing has a direct relationship to scale-in efficiency that is frequently overlooked. A node pool built from large, expensive nodes scales in slowly because each node represents a large chunk of capacity — the scheduler cannot drain it until the majority of its workload has migrated, which requires significant headroom elsewhere. Smaller nodes drain faster, produce finer-grained capacity increments, and scale in more efficiently. The trade-off is that smaller nodes introduce more scheduling overhead and network surface. For most general workloads, nodes in the D4s or D8s range balance these concerns well. Oversized nodes — D32s or D64s — should be limited to workloads with specific single-pod memory or CPU requirements that genuinely cannot be served by smaller instances.

Spot node pools reduce cost by 60–80% for workloads that tolerate interruption. The candidates are batch processing jobs, development and testing environments, machine learning training workloads, and any background worker that implements retry logic against a durable queue. The condition for spot pool suitability is not “can this workload survive an interruption” but “can this workload survive an interruption without operator involvement.” If a spot eviction requires a manual restart or a support ticket, the workload does not belong on spot nodes. If it queues its work durably and retries automatically, spot is appropriate and the cost reduction is substantial.

[DAN: Add preferred AKS node pool strategy — system vs user pool separation, node sizing for efficient scale-in, spot pool conditions.]

The Scale-In Aggression Trade-Off

Aggressive scale-in and conservative scale-in are not one right and one wrong — they are different positions on a genuine trade-off, and the right position depends on the traffic characteristics of the workload.

Aggressive scale-in returns capacity to baseline quickly after a peak. The cost profile is lower on average, because fewer instance-hours are consumed in the period between a peak and the next demand spike. The risk is that a scale-in event coincides with the beginning of the next peak — the workload contracts just as demand is returning, triggers a scale-out, and the latency during the scale-out interval is visible to users. For workloads with bursty, unpredictable traffic, aggressive scale-in creates a saw-tooth performance profile that is difficult to predict and harder to explain to stakeholders than a slow billing trend.

Conservative scale-in keeps more capacity available for longer after a peak. The cost profile is higher because instances that are no longer fully utilised remain running. The benefit is a buffer against the next demand spike — if traffic returns before scale-in completes, the workload absorbs it without a scale-out event. For workloads with sustained load or sharp unpredictable spikes, conservative scale-in produces more stable performance at the cost of a higher baseline bill.

The decision rule is to match scale-in aggressiveness to traffic predictability. For workloads with predictable daily patterns — morning ramp, evening trough, minimal weekend load — lean aggressive on scale-in and use scheduled scaling to pre-scale ahead of the morning peak and scale in after the evening trough. Scheduled scaling does not wait for a metric threshold; it adjusts instance count on a schedule you define. The combination of aggressive scale-in with scheduled pre-scaling eliminates both the cost waste of conservative scale-in and the latency spike risk of aggressive scale-in against unpredictable traffic. For workloads with genuinely unpredictable traffic — event-driven architectures, consumer-facing applications with viral growth potential, anything dependent on external demand signals you do not control — lean conservative. The extra spend buys meaningful latency headroom that customers and SLAs will notice when you get it wrong.

Autoscaling done well produces lower average cost than equivalent fixed provisioning. The discipline is in the tuning, and specifically in the scale-in configuration, which receives a fraction of the attention that scale-out does during initial setup. Scale-out is visible — it responds to incidents, it prevents degradation, it is the mechanism that keeps the service up. Scale-in is invisible until you look at the bill. Look at the scale-in configuration first. The scale-out is probably fine.

#autoscaling #app service #aks #cost optimisation #architecture

About the Author

Daniel Inman

Cloud Solution Architect focused on Azure, platform design, and translating technical complexity into decisions that teams can actually execute.

Learn more LinkedIn

Autoscaling Azure Workloads Without Creating Cost Spikes

How Autoscaling Creates Cost Problems

App Service Autoscaling — What to Watch

AKS Node Pool Autoscaling — Different Problem Set

The Scale-In Aggression Trade-Off

Daniel Inman

Pass this article on

Related Architecture

Governance as a Competitive Advantage, Not a Compliance Tax

Governance Drift: How Azure Environments Decay Over Time and How to Prevent It

The Azure Governance Conversation Nobody Wants to Have (Until It's Too Late)