Is the Public Cloud Becoming More Reliable?

When AWS announced the GA of EBS io2 Block Express volumes, durability was increased to 99.999% compared to between 99.8 – 99.9% for previous SSD and HDD offerings. We wondered whether there was a consistent improvement in public cloud reliability and why this might be. Here are some initial discoveries.

Definitions

It’s worth spending a moment to consider durability and availability in a storage and workload context. Durability relates to permanent data loss, whereas availability covers uptime (and by extension, downtime). An application can be unavailable without losing data permanently, for example, if network issues occur or a forced VM or host reboot occurs. We should expect durability to be way better than availability.

The durability of individual storage media is quoted in MTBF (mean time before failure) or AFR (annual failure rate), the latter of the two becoming increasingly more common. The two measures are inverses of each other. A two million hour MTBF is an AFR of 0.44%, whereas an MTBF of 1.5 million hours is 0.56% – simply divide the number of hours in a year by the MTBF to get AFR. In a population of, for example, 10,000 unprotected drives, an AFR predicts a failure rate of 44 drives per year.

When AWS claims 99.9% durability for EBS volumes, this represents an AFR of 0.1% or ten failures per year in a sample of 10,000 volumes. This figure is clearly better than an unprotected drive but likely only achieved by three-way mirroring (0.44^3). Both AWS io2 options offer 99.999% durability (or 0.001% AFR, or one failure per year in 100,000 drives).

This improvement is two orders of magnitude better than previous generations and must be delivered with some form of erasure coding (which might include RAID). As drive capacities increase, this is the only way we think that data loss can be prevented. AWS S3 gives us an idea of this, where durability is quoted as eleven 9’s (99.999999999%), and we can be pretty sure erasure coding is the standard for object storage, not just for protection but also efficiency.

Note: For the sake of brevity, we’ve not discussed MTTDL (mean time to data loss), where lost data due to failures, fails to be rebuilt or recovered before another catastrophic failure. The most simplistic example of this is where a working RAID-1 mirror fails before a failed mirror is rebuilt. This naturally affects our calculations, but this post isn’t a lesson in mathematics, just a discussion on observed evolution.

Availability

Remember that availability and durability are not the same. Data storage solutions are implemented by storage servers and networking, each of which has a reliability factor that affects availability. In the enterprise, Infinidat, for example, implements three storage nodes and remote replication to reach a 100% guarantee (which we will refer to later). In recent years, enterprise vendors have pushed past the typical five 9’s to offer one or two orders of magnitude of availability. This has been through increased product reliability, architectural changes, and the use of remote (synchronous) replication.

Instance Reliability

Has AWS instance reliability changed over time? Amazon currently offers a 99.99% service availability guarantee for EC2 in any individual region. In the last ten years, this guarantee has only changed once, from 99.95% to 99.99%, not a substantial amendment. However, the timing of SLA changes does align with the initial rollout of Nitro. Naturally, it takes time to fully deploy a new hardware solution across the entire cloud ecosystem, but it’s possible the gradual deployment of Nitro has enabled AWS to offer an improved SLA.

Issues of Scale

It’s tempting to think that the public cloud hyper-scalers are moving to increased levels of reliability to make their offerings look more “enterprise-like”. However, the original mantra of the cloud included the assumption that developers should be building reliable software on unreliable hardware and use software-based techniques to manage availability. We don’t think that position has changed. Instead, the issue at the heart of these improvements is simply dealing with scale.

With a constant failure rate, a 10-fold increase in the volume of business (deployed virtual instances and associated storage) relates to a 10-fold increase in failures. While software can manage those failures, any unexpected failure is sub-optimal, resulting in additional network traffic for rebuilds, reduction in performance and increased cost. Then there’s the risk factor. Each unrecovered disk failure requires rebuilds from backup and could result in downtime. At hyper-scale, it makes sense to be as reliable as possible.

Philosophy

Of course, there’s a philosophical argument that says hardware isn’t becoming more reliable but instead, software is reaching down to ever lower layers of integration with the components of disks and servers. Hyper-scalers want greater control over the hardware components, with the benefits delivered through infrastructure software. Ultimately, the aim is the same – make the entire end-to-end system much more reliable. It’s just that we see that improved reliability in terms of what’s offered at the IaaS level.

Cloud Ecosystem

How are the other public clouds doing? Google currently offers service level objectives of 99.99% for instances in multiple zones or 99.5% for a single instance. This SLO is arguably slightly worse than the terms quoted in 2015, where the SLO was 99.95% for any instance type and definitely worse than 2018. For storage, the picture is mixed; generally, regional devices are better than zonal (which should be expected), with the best devices offering 99.9999% – better than AWS. Curiously, zonal extreme persistent disk has the same durability as regional balanced and SSD persistent disk and is better than regional standard disk. This obviously reflects the underlying technology in play.

Azure offers anywhere from 95% to 99.9% SLA for single-instance virtual machines, depending on the disk type in use. Virtual machines in an availability set within the same zone have 99.95% SLA, and virtual machines across two or more availability zones have 99.99% SLA. Obviously, these guarantees only apply to the uptime of the instances and not to the application, which also needs to be fault tolerant. Storage SLAs in Azure are based on the storage accounts, offering 99.9% to 99.99% availability and at least eleven 9’s of durability.

The Architect’s View®

The service levels on offer are surprisingly variable between cloud vendors and even within their own product sets. For the storage component, we haven’t examined performance as another metric, which also must be taken into consideration.

However, the idea of the “unreliable cloud” compared to the enterprise is probably not justified. Storage products are high availability; however, the public cloud generally doesn’t offer the highest level of availability provided by solutions like VMware’s vSphere HA. In that respect, the public cloud isn’t the same as on-premises, but it was never promised to be. If the VMware HA experience is essential, then there’s always vSphere running on the cloud, with the associated price premium.

As ever, the detailed specifics of implementation by each cloud vendor still reflect the parameters of the services offered, even though we might think that had all been abstracted away. So, perhaps the cloud is just someone else’s computer after all.