cost-optimised-cloud 8 min read
06 May 2026

Autoscaling Azure Workloads Without Creating Cost Spikes

Autoscaling should reduce cost. Misconfigured, it produces spikes worse than fixed provisioning. Here is the discipline that prevents both.

Daniel Inman
Daniel Inman Cloud Solution Architect

Practical architecture guidance grounded in delivery, trade-offs, and real platform constraints.

#autoscaling #app service #cost optimisation #architecture
Architecture Brief Systems thinking, implementation detail, and a bias toward clarity over noise.

Autoscaling is sold as a cost reduction feature. Match capacity to demand, stop paying for headroom you rarely use, let the platform handle the peaks. The pitch is correct in principle. In practice, a misconfigured autoscaling policy produces cost patterns that are less predictable, and frequently more expensive, than the fixed provisioning it replaced. The teams that experience this rarely trace it back to autoscaling — they see a fluctuating bill and assume it is growth. Often it is not. It is a scale-in problem.

How Autoscaling Creates Cost Problems

There are three failure modes that reliably produce cost higher than fixed provisioning, and they are not edge cases — they appear in the default configuration or near-default configuration of most workloads that have not been explicitly tuned.

The first is scale-out without scale-in. The workload grows to handle a peak event — a marketing campaign, a monthly batch run, a Monday morning traffic surge — and the autoscaler does exactly what it is configured to do: adds instances. The peak passes. The instances remain. This happens when scale-in rules are not defined, when scale-in thresholds are too close to scale-out thresholds, or when scale-in cooldown periods are so long that the next peak arrives before the previous scale-in completes. The result is a workload that grows to its maximum instance count over several weeks and stays there indefinitely. At that point you have fixed provisioning at peak capacity, billed continuously, with the overhead of the autoscaling configuration on top.

The second failure mode is scaling on the wrong metric. The most common example is scaling on CPU when the actual bottleneck is memory pressure, thread pool exhaustion, or database connection limits. A workload that is genuinely CPU-constrained responds well to CPU-based scaling — adding instances distributes the load and the metric returns to baseline. A workload that is memory-constrained will scale out under CPU pressure, discover that the new instances also become memory-constrained almost immediately, scale out again, and eventually hit the instance maximum without resolving the underlying problem. The cost implication is significant: you are paying for additional compute without recovering performance, which means you are also likely to pay for an incident investigation when the workload eventually degrades anyway.

The third failure mode is the absence of a maximum instance cap, or a cap set so high it is functionally absent. A runaway scaling event — a retry storm, a misconfigured load test pointing at production, a dependency failure causing elevated latency and therefore elevated request queuing — can drive an uncapped autoscaling group to hundreds of instances in minutes. Azure will scale as aggressively as your limits and quotas allow. The billing consequence of a two-hour runaway event against a workload with no instance maximum can easily exceed a month of normal operating cost. The cap is not primarily a performance decision — it is a cost circuit breaker.

App Service Autoscaling — What to Watch

App Service autoscaling uses rule-based configuration: define a metric, a threshold, and a scale action. The mechanism is straightforward, which makes it easy to configure something that appears correct but behaves badly under real load.

The evaluation window is the most important and most consistently misunderstood parameter. The evaluation window determines how long Azure observes the metric before deciding whether to scale. A one-minute window is almost never correct for production workloads — it means the autoscaler is making decisions based on sixty seconds of data, which is enough to capture a transient spike but not enough to determine whether the load is sustained. The consequence is oscillation: the workload scales out on a brief spike, scales back in when the spike passes, and repeats the cycle continuously. Every scale event has a warm-up cost, a connection migration cost, and a configuration overhead cost. Continuous oscillation can cost more than running at the scaled-out level continuously. Use a five to ten minute evaluation window as a starting point and only move toward shorter windows if your traffic profile genuinely justifies it.

Scale-in cooldown prevents scale events from happening faster than the workload can stabilise after a change. The typical mistake is setting this too short in the belief that faster scale-in means lower cost. On sustained load, a short cooldown causes the autoscaler to scale in, observe that load is still elevated, and scale back out — oscillation again. A cooldown of ten to fifteen minutes for scale-in gives the workload time to redistribute in-flight requests and stabilise before the next evaluation.

The threshold relationship between scale-out and scale-in requires deliberate asymmetry. If you scale out at 70% CPU and scale in at 50% CPU, you are operating within a twenty point band that provides meaningful hysteresis. If you scale out at 70% and scale in at 60%, a workload running at 65% will oscillate continuously. Scale in at 30–35% when you are scaling out at 70% — the gap looks excessive until you consider that the scale-in decision is being evaluated against a smaller pool of instances, which will individually show higher utilisation than the scaled-out pool did.

My Configuration Standard: My default starting point for most workloads is an 80/20 split (Scale-out at 80%, Scale-in at 20%). I prefer this because it forces the infrastructure to work efficiently before spinning up new billing units. However, this isn’t a “set and forget” number. For erratic workloads where spikes are sustained and aggressive, I’ll pull that scale-out down to 70% to give the platform more lead time. The 80% ceiling is also my hard limit for RAM; if I see a workload hitting 80% memory while the CPU is idling at 5%, I stop looking at scaling rules and start looking at the instance size itself. You can’t scale your way out of a memory leak or an undersized SKU.

Minimum instance count is a genuine trade-off between cost and performance. A minimum of one instance saves money during off-peak periods but introduces cold-start latency when the first request arrives after the instance has been idle long enough for the runtime to wind down. For latency-sensitive APIs, the cold-start penalty is typically measured in seconds — long enough to breach an SLA and generate a support ticket. A minimum of two instances eliminates cold starts without a significant cost impact for most workloads, and means scale-out events are distributing load that was already warm, rather than spinning up from zero. Two is usually the right minimum for production APIs. One is acceptable for background workers and internal tooling where latency is not a concern.

The Scale-In Aggression Trade-Off

Aggressive scale-in and conservative scale-in are not one right and one wrong — they are different positions on a genuine trade-off, and the right position depends on the traffic characteristics of the workload.

Aggressive scale-in returns capacity to baseline quickly after a peak. The cost profile is lower on average, because fewer instance-hours are consumed in the period between a peak and the next demand spike. The risk is that a scale-in event coincides with the beginning of the next peak — the workload contracts just as demand is returning, triggers a scale-out, and the latency during the scale-out interval is visible to users. For workloads with bursty, unpredictable traffic, aggressive scale-in creates a saw-tooth performance profile that is difficult to predict and harder to explain to stakeholders than a slow billing trend.

Conservative scale-in keeps more capacity available for longer after a peak. The cost profile is higher because instances that are no longer fully utilised remain running. The benefit is a buffer against the next demand spike — if traffic returns before scale-in completes, the workload absorbs it without a scale-out event. For workloads with sustained load or sharp unpredictable spikes, conservative scale-in produces more stable performance at the cost of a higher baseline bill.

The decision rule is to match scale-in aggressiveness to traffic predictability. For workloads with predictable daily patterns — morning ramp, evening trough, minimal weekend load — lean aggressive on scale-in and use scheduled scaling to pre-scale ahead of the morning peak and scale in after the evening trough. Scheduled scaling does not wait for a metric threshold; it adjusts instance count on a schedule you define. The combination of aggressive scale-in with scheduled pre-scaling eliminates both the cost waste of conservative scale-in and the latency spike risk of aggressive scale-in against unpredictable traffic. For workloads with genuinely unpredictable traffic — event-driven architectures, consumer-facing applications with viral growth potential, anything dependent on external demand signals you do not control — lean conservative. The extra spend buys meaningful latency headroom that customers and SLAs will notice when you get it wrong.


Autoscaling done well produces lower average cost than equivalent fixed provisioning. The discipline is in the tuning, and specifically in the scale-in configuration, which receives a fraction of the attention that scale-out does during initial setup. Scale-out is visible — it responds to incidents, it prevents degradation, it is the mechanism that keeps the service up. Scale-in is invisible until you look at the bill. Look at the scale-in configuration first. The scale-out is probably fine.

Daniel Inman
About the Author

Daniel Inman

Cloud Solution Architect focused on Azure, platform design, and translating technical complexity into decisions that teams can actually execute.

Previous FinOps Is What Happens When Architecture Decisions Aren't Deliberate Next The Complete Guide to Cost-Optimised Cloud Architecture on Azure