Smart decisions under pressure are how most Azure environments start drifting. A policy exemption to unblock a release. A quick fix through the portal at 2 AM to stop an outage. Temporary elevated access so an engineer can resolve an incident faster.
Each one is the right call at the time. But once the pressure passes, these shortcuts rarely get a second look.
The exemption remains active with no expiry date. The quick fix lives outside your codebase and is untracked by the next deployment. The temporary access quietly becomes permanent because the incident is closed and no one circled back.
They accumulate into a backlog of unresolved exceptions sitting beneath your governance posture, unnoticed until an audit, an outage, or a cost anomaly surfaces them.
In this edition, I’ll break down how this exception debt compounds across policy, configuration, and access, and what it takes to manage its lifecycle.
Stay updated with Simform’s weekly insights.
Your compliance dashboard is green. Your policies aren’t enforcing anything.
The problem
Azure’s compliance reporting has a blind spot that most teams discover too late. Exempted resources count toward your compliance score while bypassing actual evaluation.
So the dashboard shows green because those resources are excluded from the count entirely, not because they’re compliant.
Two mechanisms feed this. First, most policy exemptions lack an expiration date. They’re approved to unblock a release or a migration, the work ships, and the exemption stays active indefinitely because no one owns its closure.
Second, teams routinely set policies to audit-only during rollouts so violations get logged without blocking deployments. The intent is to switch to enforce mode once the environment stabilizes. That switch rarely happens.
The impact
A SaaS company with 14 Azure subscriptions ran a routine compliance review ahead of a SOC 2 renewal. The review surfaced 23 active policy exemptions across production environments.
17 had no expiration date. Four traced back to a migration completed nine months earlier. The remediation sprint took three weeks and delayed the certification by a full audit cycle.
The shift
Treat every exemption as a tracked work item with an owner, a review date, and a remediation deadline. An exemption is closed when the underlying condition is resolved, and the policy is restored to enforcement.
Teams that manage this well flag any exemption older than 90 days for re-justification and treat the count of active exemptions as an operational metric.
Your templates say one thing. Your environment does another.
The problem
Engineers fix production incidents in the moment. Someone scales up a VM through the Azure portal to stop a performance issue. The incident has resolved. And the change stays exactly where the engineer left it, in the live environment, outside the infrastructure code that defines it.
The Terraform or Bicep template still describes the original configuration, while the live environment reflects the fix. The next automated deployment either overwrites the fix and reintroduces the problem or skips the resource entirely because someone flagged it to avoid that conflict.
Enough of these accumulate, and engineers start checking the portal before deploying because they know the code no longer reflects reality. That’s when manual changes stop being exceptions and become the default operating mode.
The impact
A VM that scales up during an incident and never scales back down is a line item nobody questions because it looks intentional. Multiply that by a dozen manual changes over a year, and you have spent that on no one’s plan or review. Disaster recovery depends on templates that no longer match what’s actually running.
The shift
Reconcile every change made outside the deployment pipeline back into the codebase within a defined window, and track it the same way you’d track any unresolved production issue. Treat unreconciled drift as an open incident with an owner, a deadline, and visibility.
The access you granted during the last outage is still active
The problem
During an incident, engineers need elevated permissions fast. Someone gets Owner-level access to a subscription, or an app registration gets broad API permissions to enable a workaround.
The incident resolves, the post-mortem happens, and the access remains because no one created a ticket to revoke it.
Datadog’s study found that 46% of Microsoft Entra ID applications still had active credentials that were older than 1 year.
In mid-market teams where the same person handles deployments, incident response, and audit prep, reviewing who still has access to what loses to everything else on the list.
The impact
The Global Incident Response Report found that identity weaknesses played a material role in nearly 90% of investigations across more than 750 engagements.
Standing access that should have been temporary expands your attack surface incrementally. It looks normal in the directory because someone with the right authority granted it, and it stays undetected until a penetration test surfaces it or an attacker uses it.
The shift
Time-bound elevation should be the default. Azure supports this natively through Privileged Identity Management, where roles activate for a set duration and deactivate automatically.
Teams require a ticket reference for every elevation, auto-expire assignments that aren’t renewed, and investigate recurring elevation requests.
If the same engineer activates Owner access to the same subscription every sprint, that signals the baseline role design needs fixing
Microsoft’s Cloud Adoption Framework now explicitly recommends treating quick fixes and exemptions as managed backlog items, with ownership, tracking, and closure. Exceptions are a normal part of running Azure at scale. The question is whether yours have a lifecycle or just a start date.
If your environment has accumulated exceptions that need a lifecycle, here’s how we structure that operating model.