Public Cloud is Still Hardware Focused

Microsoft Azure has recently announced that the latest AMD EPYC processors will be available on a range of Azure compute instances. The new instances are storage optimised, meaning they are aimed at high I/O workloads. This isn’t surprising because the EPYC processors have typically 2-3 times more bandwidth capability than their Intel Xeon counterparts. With the focus on implementing fixes for Meltdown/Spectre, where there is an impact on performance for high intensity I/O, then AMD might be a more attractive option in the future.

Looking more widely at the detail of Compute offerings in the public cloud, I was initially amazed that there’s still such a focus on hardware. However, after initially drafting this post, the impact of Meltdown/Spectre made me think again. I’ll walk through my logic still, but please read the conclusions, which are different post-Meltdown.

Hardware Focused

Have a look at the announcement I’ve referenced. The new Lv2 Series of instances uses the EYPC 7551 processor and SSD storage. These are physical hardware characteristics, compared to metrics like vCPU and memory. Across the Azure Compute offerings you can see the same thing; references to specific processor models and speeds. Azure only recently moved to using vCPU to show performance, whereas AWS originally used ECU (EC2 Compute Units) as an abstracted measure of performance from when EC2 was first released.

With a little digging, I’ve found a reference to the various processor models, compute instance families and values Azure provides for their ECU equivalent, called ACU or Azure Compute Unit. You can find the list here. Within the document, there are details of the ACU values for each family of VMs. Obviously, per family the ACU values vary depending on the use of Turbo Boost, so figures are quoted with a range.

AWS ECU

So how can we compare with AWS EC2 Compute Units? An ACU is based on a Standard_A1 instance and given a relative value of 100. A-Series is based on the Intel E5-2670 processor, whereas the ECU was originally based on a 2007 1.2Ghz Intel Xeon. AWS no longer references this definition in the documentation of what constitutes an ECU and some time ago dropped the use of ECU on their instance definitions in the EC2 documentation. However, you can still find ECU figures for everything other than t2 instances, when launching an instance in the Ec2 GUI. An EC2 m5.large, for example, is 2 vCPU and 10 ECU, whereas m4.large is 2 vCPU and only 6.5 ECU. m5 instances are based on Intel Xeon Platinum 8175 whereas m4 use Xeon E5-2676 (v3 Haswell) or E5-2686 (v4 Broadwell) processors.

All Bets are Off

At this point, I was hoping to dive a bit deeper into providing some basis for reasonable comparison between cloud providers. However, Meltdown now confuses the issue. The patches being developed have a variable impact on workload. This is sold to us on the “your mileage may vary” scale, which basically means no-one has any real idea of the impact. We do know that storage-intensive workloads are more likely to suffer. This has resulted in an interesting stance by storage vendors, some of whom are claiming their products are closed systems and not affected by Meltdown/Spectre. What I think they really mean is they are choosing not to apply patches due to the negative performance impact and the backlash from customers. Anyway, I digress.

The issue Meltdown patches introduce is that pre and post-patch performance figures will be different. There will even be differences introduced not just by the cloud hypervisors, but by the age of the CPUs and versions of operating systems. So, although it seemed initially reasonable to compare cloud providers, the truth is much more opaque than this. In fact, I’d say we can only talk in generalities about relative performance, even within the same cloud provider.

The Architect’s View™

It becomes obvious why, even with the public cloud and with application abstraction, we can’t avoid talking about the hardware. It’s reasonable to expect that generations of processor families will have different performance characteristics. However, I would hope cloud providers would at least benchmark their offerings as they are released, which is what we do see. Will Azure, AWS, GCP and others redo their benchmarks post-Meltdown? I don’t know. It may be that the cloud service providers will claim little or low impact and so say the work isn’t justified.

What alternatives do we have to manage performance? Well, like the last 40-50 years of computing before, we’ll have to go back and review the use of capacity planning tools. There were lots of vendors present at AWS Re:invent selling this kind of solution, so there’s another area of research and investigation to cover. In the meantime, we need to continue with using the approximations the vendors provide us. If I had a storage intensive application today, I’d be seriously investigating the AMD options.

One last thought… As instances are essentially virtual machines, can we expect that the only true abstracted way to run an application is Serverless? Are there any Serverless benchmarks yet?

Hardware Focused

AWS ECU

All Bets are Off

The Architect’s View™

Related Links