AZ-900 Azure Cost Management and SLAs: Pricing, Budgets, Optimization, and Availability Explained

AZ-900 Azure Cost Management and SLAs: Pricing, Budgets, Optimization, and Availability Explained

1. Introduction: Why Cost and Availability Matter in Azure

In AZ-900, cost and availability are two of the biggest real-world topics because they’re where the tech meets the business. Honestly, that’s the point where cloud choices stop being abstract and start hitting real budgets, real users, and real deadlines. A workload might be cheap to run but a little fragile, or it might be built to be much more resilient and, naturally, cost more. That trade-off comes up constantly in Azure projects. Azure gives you plenty of choices, but, naturally, each one comes with a trade-off. You can’t really expect every benefit without paying for it somehow.

That’s the big idea to keep in mind: cost and uptime are closely linked. More redundancy, stronger recovery options, premium tiers, and designs that span zones or even regions usually come with a higher price tag. That’s pretty normal, because you’re paying for extra resilience. Simpler and cheaper designs can be totally fine for dev/test, internal tools, or lower-priority workloads. But for customer-facing systems, they might not be enough.

Azure runs on a consumption model, so your bill can move around from month to month. That’s totally normal, honestly, but it does catch people off guard the first time they see it. If a team leaves resources running, picks services that are a bit bigger than they really need, stores a lot of data, or pushes a lot of network traffic, the bill can climb pretty quickly. I’ve seen that happen more than once. On the availability side, Azure gives you service level agreements, or SLAs, but here’s the thing: an SLA isn’t a promise of zero downtime, and it doesn’t magically make your app resilient. Architecture still matters.

2. Core Cost Concepts and Pricing Drivers

Azure largely shifts spending from CapEx to OpEx. Instead of buying hardware up front, you consume cloud services and pay for usage. That usage is metered differently by service. Many Azure compute services, including virtual machines, are commonly billed per second after a minimum period, while storage is billed by capacity and transactions, and networking often includes transfer-based charges.

A few key things drive Azure cost:

  • Resource type: VMs, App Service plans, SQL databases, storage accounts, and VPN gateways all use different pricing models.
  • Size and performance tier: A larger VM, premium SSD, or higher database tier costs more than a smaller or standard option.
  • Region: The same service can cost different amounts in different Azure regions.
  • Redundancy: Zone-redundant and geo-redundant options usually cost more than local redundancy.
  • Data transfer: Inbound data transfer to Azure is generally free, while outbound internet egress is typically charged. Inter-region traffic can also incur charges.
  • Licensing: Existing eligible licenses may reduce cost through programs such as Azure Hybrid Benefit for Windows Server and SQL Server.

Azure also offers a Free account and some free services, which are useful for learning, but they do not represent a full production pricing model.

Here’s a very exam-relevant gotcha: shutting down a VM from inside the guest operating system isn’t the same thing as stopping billing. If the VM remains allocated, compute charges can continue. To stop compute billing, the VM typically must be stopped/deallocated from Azure. And even then, you can still get charged for related resources like managed disks, snapshots, backups, and some public IP configurations.

3. Azure Pricing Models and Discount Options

Azure does not have only one purchasing model. Knowing when to use each option is important for both the exam and real-world cost optimization.

Model Best use Flexibility Cost profile
Pay-as-you-go Unpredictable or short-term workloads High Highest flexibility, no commitment discount
Reservations Steady long-running workloads Lower Discount for committing to specific resource families and scopes
Azure Savings Plan for Compute Predictable compute spend with some variation More flexible than reservations Discount for committing to an hourly compute spend
Spot VMs Interruptible workloads Low continuity Can be very cheap, but capacity can be reclaimed

Reservations are best when you know a workload will run consistently. Savings Plan for Compute is broader and can apply across eligible compute usage, which gives more flexibility. Spot VMs are useful for batch jobs, testing, or fault-tolerant processing, but not for workloads that require guaranteed continuity.

4. Pricing Calculator vs TCO Calculator

These tools answer different questions:

Tool Purpose When to use it Exam keyword
Azure Pricing Calculator Estimate Azure service cost Before deployment Estimate
Azure TCO Calculator Compare on-premises cost with Azure Before migration Compare
Microsoft Cost Management and Billing Analyze actual spend After deployment Actual
Azure Advisor Recommend optimizations After deployment Recommendations

Using the Pricing Calculator is straightforward: choose a service, select region, SKU or size, expected hours or usage, storage type, redundancy, and estimated network transfer. For a simple example, you might estimate two VMs, managed disks, a storage account in the Hot tier, and a small amount of outbound internet traffic. It’s useful for planning, but it’s not a fixed quote. Actual billing can still land a bit differently because usage changes, taxes, exchange rates, support plans, and pricing updates all play a role.

Using the TCO Calculator is different. You enter on-premises inputs such as number of servers, storage, networking, power, cooling, facility cost, support contracts, and administration overhead. It helps build a migration business case. For example, if a company has six aging servers and a storage array due for refresh, the TCO Calculator helps compare replacement costs with Azure migration assumptions.

5. Microsoft Cost Management and Billing

Microsoft Cost Management and Billing is the tool for understanding what you actually spent. It supports cost analysis, budgets, forecasting, exports, and chargeback or showback reporting.

A useful distinction: management groups organize subscriptions for governance, while billing accounts, billing profiles, and invoice sections are the commercial billing constructs. Cost analysis can often be viewed at different scopes, including subscription, resource group, or management group, but governance scope and billing scope are not the same thing.

In cost analysis, you can filter and group by resource, resource group, service, location, or tag. That is how you answer questions like:

  • Which department increased spend this month?
  • Which resource group is generating the most storage cost?
  • Did outbound bandwidth spike after a new release?

Forecasting helps estimate whether current spending trends may exceed budget. Exports can send cost data to storage for reporting or FinOps analysis. Visibility can vary by account type, role, and assigned permissions, so RBAC matters here too.

6. Budgets, Tags, and Governance

Budgets let you define spending thresholds and trigger alerts. A pretty common setup is to alert at 50%, 80%, and 100% of a monthly budget. Budgets don’t automatically stop services, but they can send notifications and, if you wire them up, they can trigger automation through action groups, Logic Apps, Functions, or runbooks.

Tags help allocate cost. A simple taxonomy might include:

  • Environment = Prod, Dev, Test
  • Owner = Team or individual
  • Application = CRM, ERP, Portal
  • CostCenter = Finance code

Tags improve reporting quality, but they do not reduce cost by themselves. Tags don’t always get applied automatically to every related resource, so inconsistent tagging can really weaken chargeback and showback reporting. Azure Policy can help enforce required tags by auditing, appending, or denying noncompliant deployments.

For governance, I like to think in layers: management groups for organization, subscriptions for isolation and billing boundaries, resource groups for operational grouping, tags for allocation, RBAC for access control, and Policy for guardrails.

7. Cost Optimization in Practice

Cost optimization is really about matching the service to the workload. It’s not just about grabbing the cheapest option on the list and hoping it works out.

  • Rightsize overprovisioned VMs and databases.
  • Use autoscale where supported, such as App Service or VM scale sets, so capacity follows demand.
  • Schedule shutdown for non-production resources.
  • Clean up hidden costs such as unattached disks, snapshots, backup storage, reserved public IPs, NAT Gateway, and Log Analytics ingestion or retention.
  • Choose storage tiers carefully: Hot, Cool, and Archive have different access and cost profiles.
  • Choose the right redundancy for business need.

Storage redundancy is a classic cost-versus-availability decision:

Option Meaning Availability profile Cost tendency
LRS Locally redundant storage Copies within one datacenter Lower
ZRS Zone-redundant storage Copies across availability zones Higher
GRS Geo-redundant storage Replication to a paired region Higher
GZRS Geo-zone-redundant storage Zone plus paired-region replication Highest of these options

A practical way to think about it is pretty simple: if the workload is predictable, look at reservations or a savings plan; if it’s non-production, schedule shutdowns; if it can be interrupted, Spot might make sense; if it needs elasticity, use autoscale; and if you’ve got eligible Microsoft licenses, take a hard look at Azure Hybrid Benefit.

8. Understanding Azure SLAs

An SLA is Microsoft’s contractual uptime commitment for a service under documented conditions. It is service-specific, and eligibility can depend on the deployment model, tier, and whether minimum architectural requirements are met. For example, some SLAs require multiple instances rather than a single instance.

An SLA is not the same as application uptime. A service can meet its SLA while your application still fails because of poor design, dependency issues, code defects, or client-side problems. Service credits may be available if SLA conditions are not met, but they are governed by the SLA terms and are not automatically granted in every incident.

Also, not all maintenance causes downtime. Azure platform maintenance is often designed to minimize impact, and whether an event counts toward SLA measurement depends on the specific service terms.

9. SLA Percentages, Downtime, and Composite SLA

Small percentage changes matter. Approximate monthly downtime looks like this:

SLA Approx monthly downtime Approx yearly downtime
99% 7.3 hours 3.65 days
99.9% 43.8 minutes 8.76 hours
99.95% 21.9 minutes 4.38 hours
99.99% 4.4 minutes 52.6 minutes

A composite SLA applies when multiple required components are in series. The simple model multiplies the availability values. For example, if Service A and Service B both offer 99.9% availability and your solution depends on both of them, the composite SLA drops to about 99.8001%.

0.999 0.999 0.999 0.999 If you multiply 0.999 by 0.999, you get 0.998001, which works out to 99.8001%.

So if both services are required and each one is rated at 99.9%, the overall composite SLA ends up being 99.8001%. And if you add a third dependency at 99.95%, the overall availability drops a little more again. That’s why extra dependencies can sneak up on you. This multiplication model is a simplification that assumes independent dependencies and that all components are required for the solution to function.

10. High Availability versus Disaster Recovery: this is one of those areas where people often mix things up, so it’s worth slowing down for a second.

High availability focuses on surviving localized failures with minimal interruption. Disaster recovery focuses on recovering from larger failures, such as a zone or region outage.

Pattern What it protects against Relative cost
Single VM Very limited protection Lower
Availability Set Localized hardware or update events in a datacenter or cluster scope Higher
Availability Zones Datacenter-level failure within a region Higher
Multi-region DR Region-level failure Highest

Availability Sets apply to Azure virtual machines and distribute instances across fault domains and update domains to reduce the chance that one localized event affects every VM. Availability Zones provide physically separate datacenter-level locations within a region. Region pairs support recovery planning and update sequencing at the platform level, but they do not automatically fail over your application. You must architect replication, failover, and testing yourself.

Backups are important, but backup alone is not high availability. A backup helps restore data after failure; it does not keep a service continuously available during the failure.

11. Monitor, Service Health, Resource Health, and Advisor

Tool Purpose Exam clue
Azure Monitor Metrics, logs, alerts, and operational telemetry for your resources Observe workload behavior
Azure Service Health Personalized view of Azure incidents and maintenance affecting your subscriptions Tenant-specific platform issues
Azure Status Public global service status information Public status view
Resource Health Health status of a specific Azure resource Single resource health
Azure Advisor Recommendations for cost, reliability, security, performance, and operations Improve and optimize

If an application is unavailable, a good diagnostic flow is: check Service Health for platform incidents, check Resource Health for the affected resource, review Azure Monitor metrics and logs for workload symptoms, and use Advisor for longer-term optimization recommendations. Advisor recommendations are based on telemetry and usage patterns, so they’re usually worth a close look before you make changes. I wouldn’t treat them as gospel, but I definitely wouldn’t ignore them either.

12. Troubleshooting Cost Spikes and Availability Issues is where the theory starts turning into real life, because this is usually when people first notice something’s off.

If you get an unexpected cost spike, I’d start with this checklist:

  • Review Cost Management by scope, service, and resource group.
  • Group by tag to identify owner or department.
  • Check for newly created or resized resources.
  • Look for deallocated versus merely stopped VMs.
  • Check unattached disks, snapshots, backups, public IPs, NAT Gateway, and Log Analytics usage.
  • Review outbound and inter-region traffic.

For an availability issue, ask:

  • Is there an Azure platform incident in Service Health?
  • Does Resource Health show a problem with the specific VM, database, or service?
  • Do Monitor metrics show CPU pressure, memory pressure, failed requests, or latency spikes?
  • Is the application single-instance when the SLA or business need expects redundancy?
  • Did a dependency such as database, storage, DNS, or network path fail?

13. AZ-900 Exam Traps and Rapid Review

These distinctions matter a lot on the exam, so keep them straight:

  • Pricing Calculator = estimate before deployment.
  • TCO Calculator = compare on-premises with Azure.
  • Cost Management and Billing = actual spend after deployment.
  • Advisor = recommendations, not billing data.
  • Monitor = metrics and logs for your resources.
  • Service Health = personalized Azure service incidents and maintenance.
  • Azure Status = public global service status information.
  • Budgets alert; they do not inherently stop spend.
  • Tags organize and allocate cost; they do not directly save money.
  • SLA is an uptime commitment, not zero downtime.
  • Composite SLA can be lower than each individual service SLA.
  • Availability Set is not the same as Availability Zone.
  • Region pairs do not automatically provide disaster recovery for your app.
  • Stopping a VM in the OS is not the same as deallocating it in Azure.
  • Outbound internet data transfer is commonly billed; inbound is generally free.

If you remember the problem each tool solves, most AZ-900 questions in this area become much easier. Estimate with the Pricing Calculator, compare with TCO, monitor actual spend with Cost Management, optimize with Advisor, observe workloads with Monitor, and check platform issues with Service Health. Then connect all of that back to the bigger theme: better availability usually requires better architecture, and better architecture often costs more.