AZ-900 Azure Cost Management and SLAs: Pricing, Budgets, Optimization, and Availability Explained
1. Introduction: Why Cost and Availability Matter in Azure
In AZ-900, cost and availability are two of the biggest real-world topics because they’re where the tech meets the business. Honestly, that’s the point where cloud choices stop being abstract and start hitting real budgets, real users, and real deadlines. A workload might be cheap to run but a little fragile, or it might be built to be much more resilient and, naturally, cost more. That trade-off comes up constantly in Azure projects. Azure gives you plenty of choices, but, naturally, each one comes with a trade-off. You can’t really expect every benefit without paying for it somehow.
That’s the big idea to keep in mind: cost and uptime are closely linked. More redundancy, stronger recovery options, premium tiers, and designs that span zones or even regions usually come with a higher price tag. That’s pretty normal, because you’re paying for extra resilience. Simpler and cheaper designs can be totally fine for dev/test, internal tools, or lower-priority workloads. But for customer-facing systems, they might not be enough.
Azure runs on a consumption model, so your bill can move around from month to month. That’s totally normal, honestly, but it does catch people off guard the first time they see it. If a team leaves resources running, picks services that are a bit bigger than they really need, stores a lot of data, or pushes a lot of network traffic, the bill can climb pretty quickly. I’ve seen that happen more than once. On the availability side, Azure gives you service level agreements, or SLAs, but here’s the thing: an SLA isn’t a promise of zero downtime, and it doesn’t magically make your app resilient. Architecture still matters.
2. Core Cost Concepts and Pricing Drivers
Azure largely shifts spending from CapEx to OpEx. Instead of buying hardware up front, you consume cloud services and pay for usage. That usage is metered differently by service. Many Azure compute services, including virtual machines, are commonly billed per second after a minimum period, while storage is billed by capacity and transactions, and networking often includes transfer-based charges.
A few key things drive Azure cost:
- Resource type: VMs, App Service plans, SQL databases, storage accounts, and VPN gateways all use different pricing models.
- Size and performance tier: A larger VM, premium SSD, or higher database tier costs more than a smaller or standard option.
- Region: The same service can cost different amounts in different Azure regions.
- Redundancy: Zone-redundant and geo-redundant options usually cost more than local redundancy.
- Data transfer: Inbound data transfer to Azure is generally free, while outbound internet egress is typically charged. Inter-region traffic can also incur charges.
- Licensing: Existing eligible licenses may reduce cost through programs such as Azure Hybrid Benefit for Windows Server and SQL Server.
Azure also offers a Free account and some free services, which are useful for learning, but they do not represent a full production pricing model.
Here’s a very exam-relevant gotcha: shutting down a VM from inside the guest operating system isn’t the same thing as stopping billing. If the VM remains allocated, compute charges can continue. To stop compute billing, the VM typically must be stopped/deallocated from Azure. And even then, you can still get charged for related resources like managed disks, snapshots, backups, and some public IP configurations.
3. Azure Pricing Models and Discount Options
Azure does not have only one purchasing model. Knowing when to use each option is important for both the exam and real-world cost optimization.
| Model | Best use | Flexibility | Cost profile |
|---|---|---|---|
| Pay-as-you-go | Unpredictable or short-term workloads | High | Highest flexibility, no commitment discount |
| Reservations | Steady long-running workloads | Lower | Discount for committing to specific resource families and scopes |
| Azure Savings Plan for Compute | Predictable compute spend with some variation | More flexible than reservations | Discount for committing to an hourly compute spend |
| Spot VMs | Interruptible workloads | Low continuity | Can be very cheap, but capacity can be reclaimed |
Reservations are best when you know a workload will run consistently. Savings Plan for Compute is broader and can apply across eligible compute usage, which gives more flexibility. Spot VMs are useful for batch jobs, testing, or fault-tolerant processing, but not for workloads that require guaranteed continuity.
4. Pricing Calculator vs TCO Calculator
These tools answer different questions:
| Tool | Purpose | When to use it | Exam keyword |
|---|---|---|---|
| Azure Pricing Calculator | Estimate Azure service cost | Before deployment | Estimate |
| Azure TCO Calculator | Compare on-premises cost with Azure | Before migration | Compare |
| Microsoft Cost Management and Billing | Analyze actual spend | After deployment | Actual |
| Azure Advisor | Recommend optimizations | After deployment | Recommendations |
Using the Pricing Calculator is straightforward: choose a service, select region, SKU or size, expected hours or usage, storage type, redundancy, and estimated network transfer. For a simple example, you might estimate two VMs, managed disks, a storage account in the Hot tier, and a small amount of outbound internet traffic. It’s useful for planning, but it’s not a fixed quote. Actual billing can still land a bit differently because usage changes, taxes, exchange rates, support plans, and pricing updates all play a role.
Using the TCO Calculator is different. You enter on-premises inputs such as number of servers, storage, networking, power, cooling, facility cost, support contracts, and administration overhead. It helps build a migration business case. For example, if a company has six aging servers and a storage array due for refresh, the TCO Calculator helps compare replacement costs with Azure migration assumptions.
5. Microsoft Cost Management and Billing
Microsoft Cost Management and Billing is the tool for understanding what you actually spent. It supports cost analysis, budgets, forecasting, exports, and chargeback or showback reporting.
A useful distinction: management groups organize subscriptions for governance, while billing accounts, billing profiles, and invoice sections are the commercial billing constructs. Cost analysis can often be viewed at different scopes, including subscription, resource group, or management group, but governance scope and billing scope are not the same thing.
In cost analysis, you can filter and group by resource, resource group, service, location, or tag. That is how you answer questions like:
- Which department increased spend this month?
- Which resource group is generating the most storage cost?
- Did outbound bandwidth spike after a new release?
Forecasting helps estimate whether current spending trends may exceed budget. Exports can send cost data to storage for reporting or FinOps analysis. Visibility can vary by account type, role, and assigned permissions, so RBAC matters here too.
6. Budgets, Tags, and Governance
Budgets let you define spending thresholds and trigger alerts. A pretty common setup is to alert at 50%, 80%, and 100% of a monthly budget. Budgets don’t automatically stop services, but they can send notifications and, if you wire them up, they can trigger automation through action groups, Logic Apps, Functions, or runbooks.
Tags help allocate cost. A simple taxonomy might include:
- Environment = Prod, Dev, Test
- Owner = Team or individual
- Application = CRM, ERP, Portal
- CostCenter = Finance code
Tags improve reporting quality, but they do not reduce cost by themselves. Tags don’t always get applied automatically to every related resource, so inconsistent tagging can really weaken chargeback and showback reporting. Azure Policy can help enforce required tags by auditing, appending, or denying noncompliant deployments.
For governance, I like to think in layers: management groups for organization, subscriptions for isolation and billing boundaries, resource groups for operational grouping, tags for allocation, RBAC for access control, and Policy for guardrails.
7. Cost Optimization in Practice
Cost optimization is really about matching the service to the workload. It’s not just about grabbing the cheapest option on the list and hoping it works out.
- Rightsize overprovisioned VMs and databases.
- Use autoscale where supported, such as App Service or VM scale sets, so capacity follows demand.
- Schedule shutdown for non-production resources.
- Clean up hidden costs such as unattached disks, snapshots, backup storage, reserved public IPs, NAT Gateway, and Log Analytics ingestion or retention.
- Choose storage tiers carefully: Hot, Cool, and Archive have different access and cost profiles.
- Choose the right redundancy for business need.
Storage redundancy is a classic cost-versus-availability decision:
| Option | Meaning | Availability profile | Cost tendency |
|---|---|---|---|
| LRS | Locally redundant storage | Copies within one datacenter | Lower |
| ZRS | Zone-redundant storage | Copies across availability zones | Higher |
| GRS | Geo-redundant storage | Replication to a paired region | Higher |
| GZRS | Geo-zone-redundant storage | Zone plus paired-region replication | Highest of these options |
A practical way to think about it is pretty simple: if the workload is predictable, look at reservations or a savings plan; if it’s non-production, schedule shutdowns; if it can be interrupted, Spot might make sense; if it needs elasticity, use autoscale; and if you’ve got eligible Microsoft licenses, take a hard look at Azure Hybrid Benefit.
8. Understanding Azure SLAs
An SLA is Microsoft’s contractual uptime commitment for a service under documented conditions. It is service-specific, and eligibility can depend on the deployment model, tier, and whether minimum architectural requirements are met. For example, some SLAs require multiple instances rather than a single instance.
An SLA is not the same as application uptime. A service can meet its SLA while your application still fails because of poor design, dependency issues, code defects, or client-side problems. Service credits may be available if SLA conditions are not met, but they are governed by the SLA terms and are not automatically granted in every incident.
Also, not all maintenance causes downtime. Azure platform maintenance is often designed to minimize impact, and whether an event counts toward SLA measurement depends on the specific service terms.
9. SLA Percentages, Downtime, and Composite SLA
Small percentage changes matter. Approximate monthly downtime looks like this:
| SLA | Approx monthly downtime | Approx yearly downtime |
|---|---|---|
| 99% | 7.3 hours | 3.65 days |
| 99.9% | 43.8 minutes | 8.76 hours |
| 99.95% | 21.9 minutes | 4.38 hours |
| 99.99% | 4.4 minutes | 52.6 minutes |
A composite SLA applies when multiple required components are in series. The simple model multiplies the availability values. For example, if Service A and Service B both offer 99.9% availability and your solution depends on both of them, the composite SLA drops to about 99.8001%.
0.999 0.999 0.999 0.999 If you multiply 0.999 by 0.999, you get 0.998001, which works out to 99.8001%.
So if both services are required and each one is rated at 99.9%, the overall composite SLA ends up being 99.8001%. And if you add a third dependency at 99.95%, the overall availability drops a little more again. That’s why extra dependencies can sneak up on you. This multiplication model is a simplification that assumes independent dependencies and that all components are required for the solution to function.
10. High Availability versus Disaster Recovery: this is one of those areas where people often mix things up, so it’s worth slowing down for a second.
High availability focuses on surviving localized failures with minimal interruption. Disaster recovery focuses on recovering from larger failures, such as a zone or region outage.
| Pattern | What it protects against | Relative cost |
|---|---|---|
| Single VM | Very limited protection | Lower |
| Availability Set | Localized hardware or update events in a datacenter or cluster scope | Higher |
| Availability Zones | Datacenter-level failure within a region | Higher |
| Multi-region DR | Region-level failure | Highest |
Availability Sets apply to Azure virtual machines and distribute instances across fault domains and update domains to reduce the chance that one localized event affects every VM. Availability Zones provide physically separate datacenter-level locations within a region. Region pairs support recovery planning and update sequencing at the platform level, but they do not automatically fail over your application. You must architect replication, failover, and testing yourself.
Backups are important, but backup alone is not high availability. A backup helps restore data after failure; it does not keep a service continuously available during the failure.
11. Monitor, Service Health, Resource Health, and Advisor
| Tool | Purpose | Exam clue |
|---|---|---|
| Azure Monitor | Metrics, logs, alerts, and operational telemetry for your resources | Observe workload behavior |
| Azure Service Health | Personalized view of Azure incidents and maintenance affecting your subscriptions | Tenant-specific platform issues |
| Azure Status | Public global service status information | Public status view |
| Resource Health | Health status of a specific Azure resource | Single resource health |
| Azure Advisor | Recommendations for cost, reliability, security, performance, and operations | Improve and optimize |
If an application is unavailable, a good diagnostic flow is: check Service Health for platform incidents, check Resource Health for the affected resource, review Azure Monitor metrics and logs for workload symptoms, and use Advisor for longer-term optimization recommendations. Advisor recommendations are based on telemetry and usage patterns, so they’re usually worth a close look before you make changes. I wouldn’t treat them as gospel, but I definitely wouldn’t ignore them either.
12. Troubleshooting Cost Spikes and Availability Issues is where the theory starts turning into real life, because this is usually when people first notice something’s off.
If you get an unexpected cost spike, I’d start with this checklist:
- Review Cost Management by scope, service, and resource group.
- Group by tag to identify owner or department.
- Check for newly created or resized resources.
- Look for deallocated versus merely stopped VMs.
- Check unattached disks, snapshots, backups, public IPs, NAT Gateway, and Log Analytics usage.
- Review outbound and inter-region traffic.
For an availability issue, ask:
- Is there an Azure platform incident in Service Health?
- Does Resource Health show a problem with the specific VM, database, or service?
- Do Monitor metrics show CPU pressure, memory pressure, failed requests, or latency spikes?
- Is the application single-instance when the SLA or business need expects redundancy?
- Did a dependency such as database, storage, DNS, or network path fail?
13. AZ-900 Exam Traps and Rapid Review
These distinctions matter a lot on the exam, so keep them straight:
- Pricing Calculator = estimate before deployment.
- TCO Calculator = compare on-premises with Azure.
- Cost Management and Billing = actual spend after deployment.
- Advisor = recommendations, not billing data.
- Monitor = metrics and logs for your resources.
- Service Health = personalized Azure service incidents and maintenance.
- Azure Status = public global service status information.
- Budgets alert; they do not inherently stop spend.
- Tags organize and allocate cost; they do not directly save money.
- SLA is an uptime commitment, not zero downtime.
- Composite SLA can be lower than each individual service SLA.
- Availability Set is not the same as Availability Zone.
- Region pairs do not automatically provide disaster recovery for your app.
- Stopping a VM in the OS is not the same as deallocating it in Azure.
- Outbound internet data transfer is commonly billed; inbound is generally free.
If you remember the problem each tool solves, most AZ-900 questions in this area become much easier. Estimate with the Pricing Calculator, compare with TCO, monitor actual spend with Cost Management, optimize with Advisor, observe workloads with Monitor, and check platform issues with Service Health. Then connect all of that back to the bigger theme: better availability usually requires better architecture, and better architecture often costs more.