Azure Cost Management and Service Level Agreements for AZ-900

1. Introduction: Why Cost and Availability Matter in Azure

AZ-900 isn’t just testing whether you know Azure features. It’s also checking whether you understand that cloud choices are business decisions just as much as technical ones. Cost, governance, and availability all pull on each other. Usually, the more resilience you build in, the more you’ll spend. And if you chase the lowest cost too aggressively, you can end up taking on more operational risk than you expected.

A core concept is CapEx versus OpEx. CapEx is upfront spending on assets such as servers and storage hardware. OpEx is ongoing spending over time. Cloud does tend to move organizations toward OpEx, but Azure isn’t just pay-for-what-you-use in the simplest sense. You’ll also run into provisioned-capacity, license-based, and commitment-based pricing, like reservations and savings plans. In real-world environments, that flexibility only works well when you’ve got governance in place to keep things from drifting out of control.

2. Azure Cost Drivers and Pricing Models

Azure cost comes down to a few basics: what you deploy, where you deploy it, how long it stays running, and how it’s licensed. Common pricing dimensions include:

Compute: VM family, size, operating system, runtime hours, and whether pricing is pay-as-you-go, reserved, savings-plan eligible, or Spot.
Storage: capacity used, redundancy choice, access tier, transaction volume, snapshots, and backup retention.
Databases: provisioned compute or service tier, storage, backup retention, and high-availability options.
Networking: outbound internet egress, inter-region transfer, VPN Gateway, ExpressRoute, load balancer, NAT Gateway, and sometimes cross-zone traffic depending on the service design.

Region matters because Azure prices vary by region. Region choice also affects latency, data residency, and service availability. Instead of the vague term “billing zone,” use the more accurate idea of bandwidth pricing zones and transfer paths: inbound data transfer is generally free, while outbound internet egress is typically charged, and some inter-region transfers also add cost.

Licensing is another major factor. Azure Hybrid Benefit can reduce costs for eligible Windows Server and SQL Server workloads, but eligibility depends on the specific product and licensing terms. When you’re planning a migration, it’s really important to verify the exact entitlement instead of assuming every existing license will automatically qualify.

Azure does not use a single pricing model. It uses consumption-based, provisioned-capacity, license-based, and commitment-based pricing. A serverless function may bill per execution and execution time, while a managed database may bill for provisioned compute even when demand is low.

A useful exam distinction is IaaS vs PaaS vs SaaS. IaaS often gives maximum control but more administrative overhead. PaaS may appear more expensive than a basic VM line item, but total cost of ownership can be lower after patching, backup, scaling, and operational effort are considered. SaaS shifts even more responsibility to the provider.

Example: a continuously running internal Windows VM with steady demand may be a good fit for reservation-based discounts and possibly Azure Hybrid Benefit. A bursty event-driven process may be better on a consumption model because paying for idle VM hours would be wasteful.

3. Pricing Models and Cost Optimization Options

The main AZ-900 pricing choices are pay-as-you-go, reservations, Azure savings plan for compute, and Azure Spot Virtual Machines.

Option	How It Works	Best Fit	Key Caveat
Pay-as-you-go	Pay for usage with no long-term commitment	Short-term or unpredictable demand	Most flexible, usually least discounted
Reservations	Discounts for specific eligible resource SKUs/quantities over 1- or 3-year terms	Stable, predictable workloads	More specific and less flexible
Azure savings plan for compute	Lower prices for eligible compute services in exchange for a fixed hourly spend commitment for 1 or 3 years	Predictable compute spend with changing instance usage	Applies only to eligible compute services
Azure Spot VMs	Deep discounts on unused capacity	Batch, test, rendering, noncritical jobs	Can be evicted; not suitable for guaranteed continuity

Reservations apply only to eligible services and scopes, not to every Azure resource. Savings plans are also limited to eligible compute services and work differently: the commitment is an hourly spend amount rather than a reservation for a specific SKU. Spot VMs are interruption-tolerant only; they can be evicted due to capacity pressure or pricing conditions and should not be treated as highly available compute.

When I’m helping teams optimize cost, I usually recommend this order: right-size first, clean up waste second, and only then apply discounts to the steady workloads that are left. Discounting an oversized resource only creates cheaper waste.

4. Cost Estimation Tools and Billing Scopes

Three Azure tools are commonly confused:

Azure Pricing Calculator: estimates the cost of a planned Azure deployment before it exists.
Azure TCO Calculator: compares estimated on-premises costs with Azure for migration business cases.
Azure Cost Management + Billing: analyzes actual spend, budgets, exports, forecasting, and cost visibility after resources are running.

Practical Pricing Calculator workflow: choose a service such as a VM, select region, operating system, size, expected hours, managed disk type, storage amount, and estimated outbound bandwidth. Then compare a second region or a different VM size to see how the estimate changes.

Practical TCO workflow: enter current server count, storage, network assumptions, power/cooling, and virtualization details to compare on-premises costs with Azure. This supports migration conversations, not live spend analysis.

Cost Management + Billing workflow: use Cost Analysis to filter by subscription, resource group, service name, location, or tag; group by resource or service; review trends; create budgets; and export data to storage or reporting tools.

A subscription is a key management, deployment, access-control, and cost scope, but billing can also be viewed at higher billing-account scopes depending on the Azure agreement. For governance, remember the hierarchy: management groups > subscriptions > resource groups > resources. Policy and RBAC commonly apply at these scopes, while billing visibility may also exist above subscription level.

5. Cost Governance in Practice

Cost governance combines visibility, ownership, and control. The most important fundamentals are tags, budgets, Azure Policy, RBAC, and regular review.

Budgets are spending thresholds or targets with alerts. They do not automatically stop resources or enforce a hard spending cap. If an organization wants action at 80% or 100% of budget, it must pair alerts with automation such as Logic Apps, Azure Automation, Functions, or an operational process.

Tags support cost allocation and showback/chargeback. A practical taxonomy is:

Environment=Dev/Test/Prod
Owner=TeamA
CostCenter=FIN001
Application=PayrollAPI

Tagging works only when it is applied consistently. Missing tags reduce reporting quality, and historical costs are not always retroactively categorized the way beginners expect. That’s why enforcing tags through policy is so important.

Azure Policy is not a cost tool by itself, but it can indirectly control cost by auditing, denying, appending, or deploying required settings. Common examples include requiring tags, limiting deployments to approved regions, and restricting VM SKUs so you don’t end up with expensive or noncompliant deployments.

Azure Advisor provides recommendations such as rightsizing or identifying underused resources, but customer action is still required to implement savings.

6. Common Hidden Azure Cost Sources and How I’d Troubleshoot a Spend Spike

Many Azure bill surprises come from resources that are not the main application service. Common hidden costs include managed disks, snapshots, backup vault usage, outbound bandwidth, NAT Gateway, public IP addresses, load balancer SKUs, Log Analytics ingestion and retention, and forgotten test resources.

A common VM mistake is assuming “stopped” means “not billed.” If a VM is shut down from inside the guest OS, it may still be allocated. Compute charges typically stop only when the VM is deallocated. Even after you stop or deallocate something, you’re not always off the hook. Managed disks, snapshots, backups, and certain networking components can still keep generating charges.

Diagnostic workflow for an unexpected bill spike:

Open Cost Analysis and compare this period to the previous one.
Group by service name, then by resource, to find the largest increase.
Filter by subscription, resource group, region, and tags to identify ownership.
Check Activity Log for newly created or resized resources.
Review Azure Advisor for underutilized resources.
Check Azure Monitor metrics for bandwidth, CPU, transactions, or log ingestion spikes.
I’d also look for orphaned assets — things like unattached disks, old snapshots, idle public IPs, or dev and test workloads that nobody’s forgotten to clean up yet.

Security and observability also affect spend. Excessive diagnostic logging, long retention periods, Defender plans, public exposure, and DDoS-related traffic patterns can increase both cost and operational risk.

7. Practical AZ-900 Scenarios

Scenario 1: Which tool? A team wants to estimate a new web app with one VM, managed disk, storage account, and outbound traffic. Use Azure Pricing Calculator. If the question asks whether moving 40 on-prem servers to Azure saves money overall, use Azure TCO Calculator. If the question asks why last month’s Azure invoice increased, use Azure Cost Management + Billing.

Scenario 2: Do you go with a reservation or stick with pay-as-you-go? A production VM that runs 24/7 and has steady demand is a classic example. This is a strong candidate for Reservations, and if it is an eligible Windows workload, possibly Azure Hybrid Benefit as well. A temporary proof of concept with uncertain duration is better suited to pay-as-you-go.

Scenario 3: Spot suitability. A nightly batch rendering job can restart if interrupted. Azure Spot VMs are appropriate because eviction is acceptable. A customer-facing checkout service is not.

Scenario 4: Governance. A finance team wants monthly alerts at 50%, 80%, and 100% of expected spend for a test subscription. Create a budget with notifications. If the organization wants resources shut down automatically at 100%, that requires separate automation.

8. SLA, Downtime, Composite SLA, and Availability Design

An Azure Service Level Agreement (SLA) is a service-specific commitment defined in Microsoft’s SLA terms, usually focused on availability or connectivity under stated conditions. It is not a promise of zero downtime, and it is not the same as a support plan or disaster recovery design. If Microsoft does not meet the SLA conditions, the usual remedy is a service credit, subject to claim requirements; this does not compensate for business loss.

Important caveat: SLA applicability depends on service-specific terms and deployment configuration. Some services only provide the expected SLA when deployed in a recommended redundant design. A single-instance architecture may not have the same SLA posture as a multi-instance or zonal deployment.

Preview services and features typically have limited or no SLA and reduced support commitments. For exam questions, if you see Preview, do not assume full production guarantees.

Downtime math is usually based on a period such as a month. Formula: Total time × (1 - SLA). If we use a 30-day month as the baseline, the numbers work out like this:

99.9% availability ≈ 43.8 minutes downtime per month
99.95% availability ≈ 21.9 minutes per month
99.99% availability ≈ 4.38 minutes per month

These are approximate values. AZ-900 usually tests the concept, not advanced math.

Composite SLA uses a simplified multiplication approach for serial dependencies in exam scenarios. If two required services each have 99.9% availability, the combined availability is 0.999 × 0.999 = 0.998001, or 99.8001%. And as you add more dependent components, the end-to-end availability usually drops a bit more. That simple multiplication assumes independent failures and a serial design. Once redundancy enters the picture, the math gets a little more nuanced, so it doesn’t behave exactly the same way anymore.

Worked example: a web app depends on a frontend service at 99.95% and a database at 99.9%. Composite availability is approximately 99.85%. If you add redundancy to the architecture, effective availability can improve because the solution isn’t relying on just one instance of each component anymore.

Availability Sets distribute VMs across fault domains and update domains to reduce the impact of planned maintenance and hardware failure. Availability Zones are physically separate locations within a region with independent power, cooling, and networking. Compared with an Availability Set, Zones usually provide stronger protection against datacenter-level failures within a region.

Regions and region pairs support broader continuity planning. Region pairs can provide platform-level recovery and update-prioritization benefits, but they are not a substitute for customer-designed backup, failover, and disaster recovery architecture.

The exam distinction is straightforward:

High availability: keep services accessible with minimal downtime.
Fault tolerance: continue operating despite component failure.
Disaster recovery: restore service after a major outage.

In most cases, higher availability comes with a higher cost. Zone redundancy, premium tiers, geo-replication, and multi-region designs can all improve resilience, but they’ll also increase spending.

9. Support Plans, Azure Status, Service Health, and Resource Health

Support plans determine how you get help from Microsoft; they do not change a service SLA. If a question asks about uptime commitment, the answer is SLA. If it asks how to contact Microsoft for technical help, the answer is support.

For health visibility, remember this comparison:

Azure Status: public, broad view of Azure service status.
Azure Service Health: personalized view of incidents, planned maintenance, and advisories affecting your subscriptions and regions.
Resource Health: health state of an individual resource, such as a VM.
Azure Monitor: metrics, logs, alerts, and operational telemetry for your resources and applications.

Operational workflow: if there is a rumor of a broad Azure outage, check Azure Status. If you need to know whether your subscription is affected, check Service Health. If one VM is unavailable, check Resource Health and Azure Monitor.

10. AZ-900 Quick Review and Exam Guidance

Must know comparisons:

Pricing Calculator = estimate future Azure cost
TCO Calculator = compare on-premises with Azure
Cost Management + Billing = analyze actual spend, budgets, exports, and forecasting
SLA = uptime/service commitment
Support plan = help from Microsoft
Availability Set = fault/update domain protection for VMs
Availability Zone = datacenter-level separation within a region
Reservation = specific eligible resource commitment
Savings plan = hourly spend commitment for eligible compute
Spot VM = discounted but interruptible compute

Common traps:

“Estimate before deployment” is not Cost Management.
“Migration comparison” is not Pricing Calculator.
“Stopped VM” does not always mean deallocated VM.
“Budget” does not mean hard spending cap.
“High SLA” does not mean disaster recovery is solved.
“Preview” does not imply full SLA/support.

For exact prices and exact SLA conditions, use current Azure pricing information and current Azure SLA terms because both pricing and service terms change over time. For AZ-900, focus on choosing the right tool, understanding the tradeoff between flexibility and commitment, and recognizing that architecture strongly affects both cost and availability.