Service-Level Agreements (SLA)
10 Jan 2023
- Formal documents to define the performance standards that apply to Azure.
- Specify also what happens if a service or product fails to perform to a governing SLAs specification.
- There are SLAs for individual Azure products and services.
- ❗ Azure does not provide SLAs for most services under the Free or Shared tiers
- Three key characteristics of SLAs for Azure products and services:
- Performance Targets
- Specific to each Azure product and service.
- E.g. uptime guarantees or connectivity rates
- Uptime and Connectivity Guarantees
- 📝 Monthly Uptime % =
(Maximum Available Minutes-Downtime) / Maximum Available Minutes X 100
- 📝 Range from 99.9% (“three nines”) to 99.999% (“five nines”) for any paid tier service.
- In other words minimum SLA for all non-free Azure services are 99.9%
- E.g. Azure Cosmos DB (Database) service SLA offers 99.999 percent uptime
- meaning it allows for about 5 minutes of total downtime per year.
- also includes low-latency commitments of less than 10 milliseconds on DB read + write operations.
- 📝 Service credits
- Given to paying Azure customers if uptime percentage is lower than given in SLA.
- Describe how Microsoft will respond if an Azure product or service fails to perform to its governing SLAs specification.
- E.g. customers may have a discount applied to their Azure bill, as compensation for an under-performing Azure product or service.
- Read more: SLA Summary for Azure Services
Composite SLA
- Result of combining SLAs across different service offerings.
- 📝 Calculating downtime
- E.g. web app (99.95% SLA from Azure) writes to SQL database (99.99% SLA from Azure)
- Composite SLA =
99.95 percent × 99.99 percent = 99.94 percent
- =
0.9995 * 0.9999 = 0.9994
- Means combined probability of failure is higher than the individual SLA values
- You can improve the composite SLA by creating independent fallback paths.
- E.g. if the SQL Database is unavailable, you can put transactions into a queue for processing at a later time.
- Web app (99.95%) writes to either SQL Database (99.99%) or queue (99.9%)
- Application is still available even if it can’t connect to the database.
- ❗But it fails if both the database and the queue fail simultaneously.
- If the expected percentage of time for a simultaneous failure is 0.0001 × 0.001
- the composite SLA for this combined path of a database or queue would be:
1.0 − (0.0001 × 0.001) = 99.99999 percent
- If we add the queue to our web app, the total composite SLA is:
99.95 percent × 99.99999 percent = ~99.95 percent
- Improves SLA but application logic gets more complicated
- You are paying more to add the queue support and there may be data-consistency issues you’ll have to deal with due to retry behavior.
Application SLA
- By creating your own SLAs, you can set performance targets to suit your specific Azure application.
- 💡 >= four 9’s (99.99%) SLA performance targets =>
- manual intervention from failures may not be enough (difficult to be quick enough)
- should have self-diagnosing & self-healing solutions.
Resiliency
- Resiliency is the ability of a system to recover from failures and continue to function.
- High availability and disaster recovery are two crucial components of resiliency
- 📝 Disaster recovery: When Godzilla destroys your data center, you do have alternative locations to keep providing your service and protocols/means for the other location to know how to keep delivering the service.
- Failure Mode Analysis (FMA)
- Goal:
- Identify possible points of failure.
- Define how the application will respond to those failures.
- Read more: Designing resilient applications for Azure
High availability
- 📝 Availability is often given as percentage uptime
- Refers to the time that a system is functional and working.
- Most providers prefer to maximize the availability of their Azure solutions by minimizing downtime.
- ❗ As you increase availability, you also increase the cost and complexity of your solution.
- As your solution grows in complexity, you will have more services depending on each other.
- You might overlook possible failure points in your solution if you have any interdependent services.
- 💡E.g. a workload that requires 99.99 percent uptime shouldn’t depend upon a service with a 99.9 percent SLA.
- Read more: Availability choices for Azure compute