In 2021 we saw several high-profile cloud platform failures, including across AWS and other related cloud services (such as Fastly). While the media is quick to decry these issues, no cloud service provider offers 100% uptime and most offer much lower availability rates than traditional infrastructure. So, why do we see such a disconnect between expectations and reality?
Uptime and availability are measurements of the time period (excluding scheduled outages) for which computer systems are available for use. Over the course of the 70 years of commercial computing, we’ve experienced a continual improvement in availability and expectation of uptime that trends towards 100%. We can calculate availability using the following formula:
Where U is the uptime percentage, M represents the possible maximum number of minutes per month available (minus outages), and D represents the number of minutes that services are not available.
End users have become used to availability numbers defined in the high 90s, with ever-increasing proximity to 100% expressed as a series of “nines”. The following table shows typical downtime figures per year for a range of availability percentages.
|Availability %||Annual Downtime||Monthly Downtime||Weekly Downtime||Daily Downtime|
|99% (“two nines”)||3.65 days||7.31 hours||1.68 hours||14.4 minutes|
|99.9% (“three nines”)||8.77 hours||43.8 minutes||10.1 minutes||1.44 minutes|
|99.99% (“four nines”)||52.6 minutes||4.38 minutes||1.01 minutes||8.64 seconds|
|99.999% (“five nines”)||5.25 minutes||26.3 seconds||6 seconds||864 milliseconds|
|99.9999% (“six nines”)||31.56 seconds||2.63 seconds||605 milliseconds||86.4 milliseconds|
|99.99999% (“seven nines”)||3.16 seconds||263 milliseconds||60.5 milliseconds||8.64 milliseconds|
We’ve quickly reached the stage of logarithmic improvements in availability, with each decimal representing a ten-fold improvement in uptime (if it can be achieved).
When we look at the uptime guarantees from traditional on-premises vendors, the availability figures quoted typically run into five-9s or greater. However, these numbers are statistical calculations across the entire user base of an installed product. So, for example, a storage array with 99.999% availability should experience less than 30 seconds of downtime per month, on average. This doesn’t mean that every system will work this way. Many customers will experience no downtime, while others may see downtime figures greater than the 30-second expectation. In reality, downtime is likely to be in minutes or hours, so figures like 30 seconds downtime per month will have little meaning other than in average calculations.
- AWS Outage – There’s always a dependency somewhere
- Defining Hybrid and Multi-Cloud
- So Your AWS-based Application is Down? Don’t Blame Amazon
- The Risk of Shared Service Level Agreements
In general, vendor on-premises hardware has set a high bar for availability, with some caveats. Gaining high levels of uptime comes with the requirement to deploy redundant infrastructure components. This design methodology ensures that any outage (or potential outage) is managed by the system and goes towards delivering improved uptime.
RAID, for example, was designed initially to use inexpensive disks with low individual availability that offered much greater uptime when disks operated as a RAID set with redundant extra capacity. In the 2000s, clustering was a typical availability solution. Server virtualisation provided the capability to implement HA (high availability) for virtual machines and therefore manage server failures. Today we’re building resilient clusters of containerised applications that offer both resiliency and scale-out capabilities.
How do the public cloud providers manage uptime and availability guarantees? All of the major public cloud vendors offer similar uptime SLAs (service level agreements), usually based over the billing cycle of one month.
AWS divides services into region-level and instance-level SLAs for common platforms like EC2 and EBS. At the region level, the service level objective (SLO) is 99.99% and 99.5% for instance-level services. GCP aims for 99.5% single instance and 99.99% across multiple zones or load balanced. Figures for Azure are also similar, based on whether services are deployed across availability zones and regions.
Should a service fail, all the cloud providers will offer service credits based on the degree of the outage. Typically, this is based on 10% credit for failing the 99.99% uptime, 25% credit if uptime falls below 99% and 100% service credits if this figure drops below 95%. To be clear, there’s no compensation for loss of business, only future credits for service failure.
The 95% figure represents approximately 36.5 hours in a month, which is well above the outage times we saw in 2021. Remember also that SLAs are individually calculated per service. So, for example, an application running on a virtual instance may be up and running, but DNS resolution could fail. Only the DNS part of the bill would see SLA credits because the virtual instance is still accessible across the network.
At first inspection, the public cloud would appear to offer much lower levels of availability than on-premises infrastructure. However, cloud infrastructure rarely (if ever) goes down for scheduled work, which implies the design of public cloud services includes the ability to do online and non-disruptive maintenance. This also includes ongoing upgrades and improvements.
Cloud solution providers also offer SLAs on services that aren’t directly infrastructure-based. For example, AWS RDS has a separate SLA, with specific exclusions for service “misuse”. Now the cloud provider has to combine the availability of hardware with the reliability of software.
One clear comparison point can be made with the emergence of on-premises consumption-based models for infrastructure. Dell APEX, for example, offers storage and compute services. The service offering description for the storage service (found here, PDF) includes a section on availability (appendix A) that indicates an uptime of 99.99% and service credits of 10% for failure to meet this SLA, with 25% and 100% credit for failing to meet 99.95% and 99.9% levels respectively. However, Dell APEX services also expect to have change and maintenance windows that sit outside these SLAs. The APEX SLA definition doesn’t indicate any duration or frequency for maintenance, which still has a direct impact on customer application uptime and end-user experience. So while APEX appears to offer similar SLAs to cloud, the end-user experience could be worse or require the deployment of additional equipment.
Incidentally, it’s worth noting that even where the hardware used to deliver “as-a-service” is the same as a directly purchased solution, the SLAs are different, presumably because the on-premises vendors are simply aligning with expected models in the public cloud.
The Architect’s View™
The uptime and availability figures for cloud services are well documented and transparent. Cloud service providers have never aimed to match the availability of on-premises infrastructure because the cloud vendors offer service guarantees and not just hardware reliability. This distinction is important because as we see on-premises vendors moving to “as-a-service” models, public and private cloud SLAs will align. This transition means that on-premises vendors with technology that allows non-disruptive upgrades, maintenance and replacement will have a more robust offering than those simply packaging existing products as “as-a-service” ready.
While vendors such as AWS offer guidelines and tools like AWS Well-Architected, we must remember that cloud offers generic SLAs across all applications. If uptime is critical to the business, then choices have to be made as to whether mission-critical applications should be treated differently – either in their architecture or by running them on-premises. In 2021 we saw much talk about multi-cloud and hybrid-cloud from an infrastructure perspective. In reality, the use of many cloud models is much more likely to be driven by business decisions than technical ones.
Copyright (c) 2007-2022 – Post #be4d – Brookend Ltd, first published on https://www.architecting.it/blog, do not reproduce without permission.