Azure Policy Cost Governance: Three Policies Every Architecture Team Needs

Architecture Brief Systems thinking, implementation detail, and a bias toward clarity over noise.

Azure Policy has hundreds of built-in definitions. Cost governance has its own dedicated policy categories within the portal. And most engineering teams have zero cost-related policies enforced in their subscriptions — they have audit policies that surface violations nobody acts on, or they have nothing at all. This post is about three policies that, enforced correctly, prevent the most common cost architecture failures before they become budget conversations.

Why Audit Mode Is Not Governance

Azure Policy has three enforcement modes worth understanding: Audit surfaces violations in a compliance report. Deny blocks non-compliant resource creation. Modify auto-remediates at resource creation time, appending or replacing properties on resources before they land.

The mistake most teams make when implementing cost policies is defaulting to Audit mode because Deny “might break something.” That logic is understandable in the early days of a governance programme. It becomes self-defeating if Audit becomes the permanent state.

Audit mode cost policies are a reporting tool. They tell you what is wrong after it has been created. The resources already exist. The cost is already accumulating. A Deny policy prevents the problem from being created in the first place — which is the only enforcement posture that actually prevents cost rather than documenting it after the fact.

The right approach is to treat Audit as a discovery phase, not a destination. Run the policy in Audit for two to four weeks to understand the blast radius: how many existing resources are non-compliant, which resource groups are affected, and which teams need to be informed. Then move to Deny or Modify once you have confidence in the exclusion scope. Every governance programme I have seen that left cost policies permanently in Audit mode ended up with compliance reports that nobody read and costs that nobody controlled.

[DAN: Add your experience with the Audit-to-Deny transition — specifically how you handled the exemption process for legitimate edge cases and what the most common exemption requests were. In practice, the exemption conversation surfaces the cases that genuinely need different treatment from those that are just resistance to change, and that distinction matters for how you scope the policy going forward. This is where most governance initiatives stall.]

Policy 1: Restrict Expensive VM SKUs

The policy: deny creation of VM SKUs above a defined cost threshold without an explicit approval tag on the resource.

The built-in Azure Policy definitions for this — “Not allowed resource types” and “Allowed virtual machine size SKUs” — are blunt instruments. An allowlist of permitted SKUs means you must maintain the list as new SKU families are released, and a single blocklist covers everything regardless of context. A tag-based approach is more operationally sustainable.

The implementation: deny creation of any VM in the Ev5, Mv2, or DC series without a tag approved-sku: true set on the resource. The policy uses a compound condition — SKU family match AND absence of the approval tag — so that the expensive SKU series are blocked by default but immediately unblocked by adding the tag. This creates a friction point rather than a hard block. Deploying an E96v5 requires someone to consciously add an approval tag, which creates an audit trail of deliberate decisions rather than accidental over-provisioning.

Assign this at the subscription level with exemptions for specific resource groups that legitimately house high-memory production workloads. Exempting at the resource group level rather than the subscription level keeps the policy meaningful for the 90% of resource groups where expensive SKUs are genuinely unexpected.

The pushback you will get is predictable: “But we might need those SKUs.” The answer is simple — you can still use them. You just need to tag them as approved first. This is not a capability restriction; it is a decision-forcing function. The approval tag is the signal that someone with cost awareness signed off on the SKU choice, rather than it being a default selection in a Terraform module that nobody questioned.

Combine this policy with Azure Advisor’s VM right-sizing recommendations. The policy prevents new over-provisioned VMs from being created; Advisor surfaces the existing ones. Together they address the problem at both the creation point and the remediation point.

Policy 2: Enforce Auto-Shutdown on Non-Production Resources

The policy: require a shutdown-schedule tag on all VMs in non-production subscriptions or resource groups, and back it with an Azure Automation runbook that enforces the schedule.

The problem this solves is straightforward arithmetic. A VM running 168 hours per week costs roughly three times more than the same VM running 56 business hours. Non-production environments — dev, test, staging — are rarely accessed outside business hours. The VMs run overnight and through weekends because nobody set up a shutdown schedule, not because there is a legitimate need for 24/7 availability.

Implementation: tag shutdown-schedule: weekdays-1900 on all non-production VMs, backed by an Automation runbook that shuts down tagged VMs at the scheduled time and starts them on a morning schedule if your teams need that. The policy component denies creation of VMs in non-production resource groups without the shutdown-schedule tag present. This means every new VM created in a non-production environment must declare its shutdown schedule at creation time — it cannot be added as an afterthought, because the VM cannot be created without it.

[DAN: Add what percentage of non-production compute cost you have recovered by implementing shutdown schedules in practice. Even a rough order of magnitude is useful here — if you recovered 40% of non-production compute spend, that number gives weight to what is otherwise a straightforward-sounding change. Also note any legitimate exceptions and how you handled them.]

The edge case to design for is build agents and CI/CD infrastructure. These legitimately run overnight for scheduled build jobs, extended test suites, or pipeline work that runs outside business hours. The clean solution is a separate resource group for CI/CD infrastructure, excluded from the policy scope, rather than building exceptions into the tag value scheme. A shutdown-schedule: ci-exempt tag value that the runbook ignores works, but it creates an escape hatch that teams will abuse. A dedicated resource group with a clear ownership and usage pattern is cleaner and easier to audit.

Policy 3: Require Cost-Centre Tag at Resource Creation

The policy: a Modify mode policy that appends a cost-centre tag at resource creation if one is not already present, defaulting to an unallocated value that is deliberately visible in cost reports.

The choice of Modify mode rather than Deny is intentional. A missing cost tag should not block resource creation — that creates friction that drives teams to work around governance rather than with it. But a missing cost tag should produce a visible signal in cost reports, and that signal needs to be unambiguous.

The unallocated default is the key design decision. Resources without a cost tag do not disappear into the billing data; they appear as a distinct line item in cost reports. Unallocated cost then becomes a metric that workload owners are accountable for reducing, rather than a data quality problem that finance tries to solve by reverse-engineering resource names. The accountability shifts: instead of asking “where did this cost come from,” the question becomes “why does your team have unallocated resources this month.”

Schema discipline matters here and is consistently underestimated. The tag value should come from a defined list of valid cost centres, enforced via the policy’s allowedValues parameter. Free-text cost-centre tags produce fifteen spellings of “Infrastructure,” three versions of “Finance-UK,” and a handful of values that nobody can trace to a real cost centre. The allowedValues constraint means the list of valid cost centres is encoded in the policy definition, and any value outside that list is rejected at creation time.

Pair this with a monthly report that makes the unallocated bucket concrete: “Unallocated resources this month: £X, distributed across these resource groups: [list].” Send that report to the owners of those resource groups. The accountability loop closes when the people who create resources can see the cost impact of not tagging them correctly.

Making Policies Stick

Three policies in Deny mode will cause friction when first enforced. That friction is the point — it means the policies are working. But policies without an exemption process get bypassed through workarounds rather than through legitimate exceptions, and that produces worse outcomes than no policy at all.

Define the exemption process before you enforce: who can request an exemption, what the approval criteria are, how long the exemption lasts, and how exemptions are reviewed. Time-limited exemptions — 90 days, renewable — are more maintainable than permanent exemptions that accumulate quietly over years. The exemption process also surfaces the legitimate edge cases that the policy definition did not account for, which feeds back into better policy scoping.

Start enforcement in non-production environments first. The learning from non-prod — unexpected failures, cases where the policy is too broad, teams that need different handling — prevents a painful rollout in production. Non-production is also where teams build their muscle memory around the new constraints before those constraints affect production deployments.

The right signal that governance is working is a decreasing trend in both Deny violations and Audit findings over time. Deny violations should fall as teams adapt their deployment processes to comply. Audit findings should fall as teams proactively remediate existing non-compliant resources. If violations are stable or increasing after the first month of enforcement, the policy is not changing behaviour — and enforcement without education and tooling support rarely changes behaviour on its own.

[DAN: Add your experience with the policy rollout sequence — specifically whether you started with non-prod, how you communicated the policies to engineering teams, and what the initial reaction was. The change management side of policy governance is consistently harder than the technical implementation, and the specifics of how you framed the policies to engineering teams (as guardrails rather than restrictions) makes a material difference to adoption.]

Three policies, enforced correctly, prevent the majority of cost architecture failures before they happen. The challenge is not finding the right policies — it is committing to Deny mode rather than the comfortable Audit mode that produces reports without accountability. The governance value of Azure Policy comes entirely from enforcement.

#azure policy #cost governance #finops #governance #infrastructure

About the Author

Daniel Inman

Cloud Solution Architect focused on Azure, platform design, and translating technical complexity into decisions that teams can actually execute.

Learn more LinkedIn

Azure Policy for Cost Governance: The Three Rules That Matter Most

Why Audit Mode Is Not Governance

Policy 1: Restrict Expensive VM SKUs

Policy 2: Enforce Auto-Shutdown on Non-Production Resources

Policy 3: Require Cost-Centre Tag at Resource Creation

Making Policies Stick

Daniel Inman

Pass this article on

Related Architecture

Governance as a Competitive Advantage, Not a Compliance Tax

Governance Drift: How Azure Environments Decay Over Time and How to Prevent It

The Azure Governance Conversation Nobody Wants to Have (Until It's Too Late)