There is a specific moment that changed how I think about cloud cost. It was not a workshop, a certification, or a vendor briefing. It was a number on a screen during an architecture review. A team had built a well-designed, well-tested, well-monitored integration platform. It was elegant. It was also costing three times what it needed to, and nobody in the room could have told you that, because nobody had asked the question during the design. The cost was not a secret — it was simply not a design concern. It was something that would be dealt with later, by someone else, in a different meeting.
That meeting never happened. The cost ran for eighteen months before a finance review triggered a conversation that should have happened in the original architecture session.
Every line on your Azure bill is the financial consequence of a decision made in a design meeting. The VM tier, the service pattern, the region, the commitment model, the tagging schema — all of it is architecture. Most organisations are significantly overspending on Azure not because they chose the wrong cloud vendor, but because their architecture decision process does not treat cost as a first-class requirement alongside availability, security, and performance. The discipline of cost-optimised architecture is about changing that: making cost decisions deliberately, at design time, with the same rigour applied to any other non-functional requirement.
[DAN: Open with a specific observation from your career — something that crystallised for you that cost was an architecture problem. A moment, a number, a conversation. This is what makes the guide feel like it comes from experience rather than theory.]
This guide is the complete framework for doing that. It covers the five decision categories where architecture choices compound into cost outcomes, the governance model that makes those decisions visible and accountable, and a prioritised starting point for architects inheriting an existing Azure environment.
The Architecture Cost Framework
Cost is not a single design decision. It is the accumulated result of hundreds of decisions made across five distinct categories. Understanding which category a cost problem belongs to determines how to fix it — because the interventions are entirely different.
1. Specification decisions — What resources are provisioned and at what tier. Over-provisioning, wrong SKU for the access pattern, no autoscaling configured. This is the most visible category in a cost report and the easiest to act on, but it is rarely the largest driver of avoidable spend.
2. Design pattern decisions — Synchronous vs asynchronous processing, caching vs direct reads, shared vs dedicated resources. These decisions produce costs that appear as entirely legitimate service charges. There is no cost report line that reads “synchronous overhead.” It reads “App Service Plan” — and it looks correct, because it is correct. The architecture is the problem, not the resource.
3. Geography decisions — Region selection, multi-region topology, DR posture. Azure pricing varies by up to 40% between regions on the same services. Most organisations chose their primary region once and never revisited whether that decision applies equally to all environments and workloads.
4. Commitment decisions — Pay-as-you-go vs reserved instances vs savings plans. The most consistently underutilised cost lever in Azure. The premium for pay-as-you-go flexibility is 40 to 72 per cent on compute. Most organisations pay it indefinitely, for resources they have no intention of removing.
5. Governance decisions — Tagging architecture, policy enforcement, accountability structures. The enabler category. Without governance, you cannot see which resources belong to which workloads, which teams own which spend, or whether the decisions in the other four categories are working. Governance failures make every other cost problem invisible.
The key insight is that these five categories are not independent. A governance failure — bad tagging — makes specification decisions invisible because you cannot attribute spend to workloads accurately. A bad commitment strategy means every specification decision is inflated by 40–72% unnecessarily. A design pattern problem cannot be fixed by rightsizing resources — it requires redesign. The framework works as a whole, and the optimisation sequence matters.
This post covers all five. For practitioner depth on each, every section links to the dedicated post in this series.
Specification — Getting the Tier Right
Specification is the category finance teams focus on exclusively, and it is the right place to start — but with the right framing. The goal is not to minimise specification. It is to ensure specification matches actual workload demand, measured rather than assumed.
The over-provisioning pattern is consistent across every Azure environment I have reviewed. A team spins up a D8s_v3 during a proof of concept because they need headroom. The PoC becomes production. The VM stays a D8s_v3 because “it’s working.” Eighteen months later it is running at 9% CPU utilisation and nobody has touched it because the risk of changing a running system feels higher than the cost of leaving it. That risk calculus is wrong — but it is a rational response to an architecture process that never built a review gate into the promotion path.
The fix is structural, not technical. Azure Advisor will surface the obvious rightsizing candidates. It will not surface the services that are correctly sized for peak load but have no autoscaling configured, so they run at peak-sized cost around the clock. The architectural intervention is a review gate at every promotion boundary — PoC to dev, dev to staging, staging to production — that asks two questions before anything proceeds: does the specification match the workload’s measured demand, and is there an autoscaling policy in place if demand is variable?
[DAN: Add a specific example of a specification decision that compounded over time — a VM that stayed over-provisioned, a SQL tier that was never reviewed. Even a rough scale is useful.]
Pricing model literacy matters here too. Azure SQL’s DTU vs vCore distinction, App Service plan tier boundaries, and the cost difference between storage access tiers are all areas where an incorrect specification produces sustained avoidable cost with no obvious signal in a cost report. For a detailed treatment of the pricing model decisions that most frequently catch architects out, see Azure Pricing Model Quirks That Catch Architects Off Guard.
The deeper treatment of specification decisions — including the specific patterns that compound most severely and the review process that catches them — is in The Architecture Decisions That Are Silently Destroying Your Azure Budget.
Design Patterns — The Invisible Cost Driver
Design pattern costs are the hardest to surface from a cost report, because they do not appear as waste. They appear as legitimate service charges for correctly functioning resources. The architecture is the problem. Rightsizing resources makes no difference when the design pattern is the driver.
Synchronous vs asynchronous is the most consequential pattern decision from a cost perspective. When services call each other synchronously — service A waits for service B — both services must be running simultaneously, at scale, all the time. In practice, this often means two App Service plans or two AKS node pools sized for simultaneous peak, even when actual throughput is a fraction of that. Async patterns — Service Bus, Event Grid, Azure Queue Storage — decouple the services. Service A drops a message and continues processing. Service B processes when it processes. Neither needs to be sized for simultaneous peak. The compute cost difference for the same logical workload can be 30–50%, and it will not appear anywhere in a cost report as “synchronous overhead.” See The Architecture Decisions That Are Silently Destroying Your Azure Budget for the full treatment of synchronous architecture cost patterns.
Caching strategy compounds similarly. A database sized for read peak is significantly more expensive than a database sized for average load with a caching layer — Azure Cache for Redis at the appropriate tier — in front of it. The architectural principle is that you should be paying database prices for database work, not for read traffic that could be served from memory. The caching layer costs less than the database tier difference in most production configurations.
Autoscaling deserves its own mention because it is both the most valuable cost lever and the most reliably misunderstood. Configured correctly, autoscaling allows specification to track actual demand — you provision for average load rather than peak, and scale out to handle peak events. Configured poorly, autoscaling produces cost spikes that are worse than a fixed high-tier configuration: scale-out events triggered too aggressively, cool-down periods too short, and minimum instance counts set to peak rather than average. The detailed configuration guidance for this is in Autoscaling Azure Workloads Without Creating Cost Spikes.
The key point for design pattern decisions: they cannot be fixed retrospectively by right-sizing resources. They require an architecture conversation — ideally before the system is built, but the conversation is worthwhile at any stage.
Geography — The Decision Nobody Revisits
Region strategy is a legitimate cost lever that most organisations ignore entirely after the initial infrastructure setup. Azure pricing varies by 20–40% on some services between regions. The UK South vs West Europe cost difference for equivalent services is not negligible at production scale.
Non-production environments are the most straightforward opportunity. Production environments may have genuine data residency requirements — compliance obligations, contractual constraints, data sovereignty rules — that dictate region choice. Development and test environments almost never have the same requirements. If production must be in UK South, there is typically no compliance reason for the dev environment to also be in UK South. Running non-production workloads in a paired region with lower pricing does not create compliance exposure; it creates cost reduction. The failure is an assumption, applied uniformly across all environments, that was only verified for production.
DR posture is the geography decision that produces the most sustained avoidable cost. Hot standby configurations — full production specification running continuously in a secondary region — are the default deployment pattern for many teams, because they are the safest thing to configure when you do not want to have a recovery conversation at 2am. But the right DR posture depends on actual RTO and RPO requirements, not on what was easiest to configure. A warm secondary — resources pre-provisioned at reduced scale, promoted during a failover event — meets the RTO/RPO requirements of many workloads at a fraction of the cost of a hot standby. A cold secondary, with infrastructure as code and a tested runbook, meets the requirements of workloads where hours of recovery time is acceptable. The most expensive Azure decision nobody questions is almost always the DR posture that was configured for convenience rather than requirement.
The data sovereignty assumption — that all environments must run under the same region constraints as production — is worth challenging explicitly in every architecture review. The assumption is frequently wrong, and the cost of not challenging it accumulates at every environment below production.
Commitment — Paying for Flexibility You Are Not Using
Reserved instances and savings plans represent the highest-leverage, fastest-payback cost intervention available on Azure. A 1-year reservation reduces compute cost by 30–40% compared to pay-as-you-go. A 3-year reservation reduces it by 50–72%. Most organisations’ reservation coverage is significantly below where it should be, because nobody made an active decision to leave workloads on pay-as-you-go — they simply never made a decision at all.
The framework for commitment decisions is a three-category workload classification. Stable baseline workloads — running continuously, specification unchanged for more than six months, no architectural change planned in the next 12 months — are reservation candidates. The question is not “should we commit?” It is “what is the right reservation term?” Variable but bounded workloads — those that scale but within a predictable range — are savings plan candidates, where the commitment is to a spend level rather than a specific resource configuration. Genuinely unpredictable workloads — new services, experimental architectures, workloads in active development — stay on pay-as-you-go, where flexibility has genuine value.
The psychological barrier to commitment is real and consistently underestimated. Organisations frame reservations as “locking in spend” rather than “eliminating a premium we are paying for optionality we are not using.” The reframe matters: you are not committing to new spend by purchasing a reservation. You are stopping payment for flexibility you already proved you are not using, because the workload has been running unchanged for a year. See The Hidden Cost of Cloud Flexibility for the full argument on this framing, including the distinction between the flexibility types you actually need and the ones you are paying for unnecessarily.
Reservation portfolio management is an ongoing discipline, not a one-time purchase. Reservations need to be reviewed quarterly: Are utilisation rates above 85%? Are any reservations covering resources that no longer exist? Are there new stable workloads that have emerged since the last review that are not yet covered? Azure Cost Management’s reservation recommendations are the right starting point for each review cycle. They are not a substitute for architectural judgement about what is genuinely stable vs what only appears stable in a short observation window.
Governance — The Enabler of Everything Else
Without governance, every other category is effectively operating blind. You cannot make good decisions about specification if you cannot attribute spend to workloads. You cannot manage a commitment portfolio if you cannot see which resources are running and who owns them. You cannot identify design pattern problems if cost data is not connected to the services that generated it.
Tagging is where governance starts and where it most commonly breaks down. Every Azure cost guide tells you to implement cost centre tags. Most teams have tags. Most tag implementations are incomplete, inconsistently applied, and not enforced in a way that produces reliable cost allocation data. The failure mode is not a missing tag policy — it is a tag policy in Audit mode, applied to the resources that were easy to tag, missing the expensive managed services and network components that were deployed by a pipeline that predated the tagging requirement. The governance value of a tag schema comes entirely from consistent enforcement. Azure Policy in Modify mode — not Audit mode — is the enforcement mechanism that produces reliable data. Modify appends required tags at resource creation time, which means every new resource lands with the correct tag rather than appearing as unallocated spend that someone has to attribute retrospectively. See Azure Policy for Cost Governance for the specific policy definitions and enforcement patterns that deliver the majority of governance value.
Cost accountability is the governance question that produces the most discomfort and the most improvement. In most organisations, the person who specifies the architecture and the person who receives the bill are in different teams with different incentives. This is not a people problem — it is a structural problem. Architects optimise for delivery and reliability. Cost is downstream. Finance optimises for budget control. Architecture is upstream and opaque to them. The structural fix is making architects accountable for the cost consequences of their decisions, not by adding a reporting burden, but by making cost a first-class output of the architecture review — a number that is estimated before the architecture is committed, reviewed at every promotion gate, and tracked quarterly. See How Cloud Cost Becomes Someone Else’s Problem for the ownership model that makes this work in practice.
[DAN: Add a note about the governance decision that has had the most impact in your experience — the one change to governance that produced the biggest improvement in cost visibility or control. This is the section decision-makers read most carefully.]
The monthly cost review cadence connects governance to action. What it should look like: architects in the room, not just finance; workload-level cost data, not subscription aggregates; trend analysis, not just point-in-time comparison; and a defined action log with named owners for any cost that is not explained and justified. See The Azure Cost Review Every Board Should Have for the structure that produces decisions rather than observations.
Alert configuration is the operational layer of governance — catching spend anomalies between review cycles. The failure mode is default configuration: subscription-level budget alerts routed to subscription owners, triggering at arbitrary percentage thresholds, reaching inboxes that nobody monitors. The configuration that produces signal rather than noise routes workload-level alerts to workload owners, combines budget alerts with anomaly detection, and defines the expected response for each alert type before enabling it. See Azure Cost Anomaly Alerts That Actually Work for the configuration approach.
For the leadership framing — how to take this architecture argument into a board or CTO conversation — see Three Questions Your CTO Should Ask About Your Azure Workload, Your Azure Bill Is an Architecture Problem, and The Azure Cost Review Every Board Should Have.
For the cultural and structural arguments — why FinOps is a symptom, why architecture debt produces cost debt, and why the most expensive cloud decisions are the ones nobody questions — see FinOps Is What Happens When Architecture Decisions Aren’t Deliberate, Stop Calling It Cloud Waste — It’s Architecture Debt, and The Most Expensive Azure Decision Nobody Questions.
Where to Start
An architect inheriting an existing Azure environment with a mandate to improve cost performance has a consistent set of high-leverage starting points. The sequence matters as much as the actions.
1. Run Azure Advisor and categorise by framework area. Before optimising anything, understand the shape of the problem. Advisor recommendations that are predominantly VM rightsizing indicate a specification problem. A high proportion of idle resource recommendations indicates an abandonment and governance problem. Reservation recommendations indicate a commitment problem. The category distribution tells you where to focus first.
2. Review commitment coverage. This is the highest-leverage intervention with the fastest payback and the lowest disruption. Pull reservation utilisation and coverage data from Azure Cost Management. If coverage is below 60% for compute that has been running more than six months, you have a near-certain saving that does not require any architectural change, just a procurement decision. Commit the baseline before optimising anything else.
3. Audit the governance baseline. Check tag compliance in Azure Policy. If compliance is below 80%, your cost allocation data is unreliable — and every decision you make based on it is suspect. Establish reliable data before drawing conclusions from it. This includes checking that alert routing is going to people who can act on it, not distribution lists and service accounts.
4. Review the three highest-cost workloads for specification and design pattern issues. The top 20% of workloads typically account for 80% of the spend. A focused review of the highest-cost workloads — using the five-category framework — will surface the majority of avoidable cost faster than a broad audit of everything. For each workload, ask: is the specification justified by measured demand? Is there an autoscaling policy? Is the design pattern synchronous where async would serve the same requirement? Is the region selection applied uniformly to all environments without verifying the requirement?
5. Establish the review cadence before optimising anything. This is the step most teams skip, and it is the one that determines whether improvements persist. Optimisation without a review process reverts. Resources drift back to over-provisioned states. Reservations expire without renewal. Tag compliance decays. The monthly cost review — with architects, with workload-level data, with a defined action log — is the mechanism that makes every other improvement permanent.
Cost as Architecture Discipline
Cost-optimised architecture is not about building cheaper systems. It is about building systems where every cost decision is deliberate — where the trade-off between cost, performance, and reliability is made consciously rather than by default. The VM tier is chosen because it matches measured demand, not because it was the default in a Terraform module. The design pattern is synchronous because the use case genuinely requires real-time response, not because async was harder to implement. The region was selected with full awareness of the pricing implications, and the decision was re-examined when the workload profile changed. The commitment level reflects the actual stability of the workload, not the path of least resistance.
The organisations that do this well have lower Azure bills not because they are more frugal, but because they understand what they are paying for and why. The bill is a readable document. Every line connects to a decision that was made deliberately, by someone who knew what it would cost. That understanding starts in the architecture process — not in the finance review, not in a FinOps programme, not in a retrospective after the bill has arrived.
[DAN: Close with your personal statement on cost as an architecture discipline — something you would say to a client who is starting this conversation. This is the sentence that gets quoted.]
Ready to make cost a first-class concern in your Azure architecture? Get in touch to talk through your current architecture and where the biggest opportunities are.
Explore the full Cost-Optimised Cloud series for practitioner-depth guides on each area covered here.