The Impact of Meltdown/Spectre for Storage & HCI

Edit: Chris Mellor at The Register has some additional comment from vendors, including Scale Computing. (link). Further Register emails (link) point to significant slow downs in workload. And here’s another post regarding the need (or not) to patch storage appliances.

Unless you’ve been 100% disconnected from technology (and the Internet) over the last few days, it’s been impossible to avoid the discussion about new vulnerabilities discovered in Intel and other processors. Two new exposures, dubbed Spectre and Meltdown, identify issues with speculative execution of code that allows leaking of or access to sensitive user data. The Spectre exploits expose data through a number of branch execution issues, whereas Meltdown provides unauthorised access to kernel memory from user space.

So far, the industry has responded quickly (although the threats were identified in the middle of last year). Patches are available for popular operating systems and the major cloud providers have already started patching machines supporting their cloud infrastructure. However, with Meltdown, in particular, there is a workload dependent performance impact that could be anywhere between 5% and 50%, especially for storage-intensive workloads.

KAISER

Looking a bit deeper into the Meltdown vulnerability, the ability to access kernel-mode memory is being mitigated using a patch called KAISER. This implements stronger isolation between kernel and user space memory address spaces and has been shown to stop the Meltdown attack. KAISER was already in development for other reasons and so I guess that is why we have seen the quick rollout of fixes for Linux, Windows and MacOS. Patching against Meltdown has resulted in performance degradation and increased resource usage, as reported for public cloud-based workloads.

Storage

Presumably, the overhead for I/O is due to the context switching that occurs reading and writing data from an external device. I/O gets processed by the O/S kernel and the extra work involved in isolating kernel memory introduces an extra burden on each I/O. I expect both traditional (SAS/SATA) and NVMe drives would be affected because all of these protocols are managed by the kernel. However, I wonder (pure speculation) if there’s a difference between SAS/SATA and NVMe, simply because NVMe is more efficient?

The additional work being performed with the KAISER patch appears to be introducing extra CPU load in the feedback reported so far. This means it also must affect latency. Bearing in mind almost the entire storage industry uses x86 these days, what will be the impact for the (hundreds of) thousands of storage arrays deployed in the field, plus software-defined solutions?

Traditional Arrays

The impact to traditional storage is two-fold. First, there’s extra system load, second potentially higher latency for application I/O. Customers implementing this patch need to know if the increased array CPU levels will have an impact on their systems. A very busy array could have serious problems. The second issue of latency is more concerning. That’s because like most performance-related problems, quantifying the impact is really hard. Mixed workload profiles that exist on today’s shared arrays mean that predicting the impact of code change is hard. Hopefully, storage vendors are going to be up-front here and provide customers with some benchmark figures before they apply any patches.

SDS

Then there’s the issue of how Meltdown affects SDS-based implementations. The same obvious questions of latency and performance exist. However, there’s another concern around solutions that use containers to deliver storage resources. Meltdown has been shown specifically to impact container security, enabling one container to read the contents of another. If storage is being delivered with containers, how is the data being protected in this instance? What protection is there to ensure a rogue container doesn’t get access to all of the data containers on a host?

Hyper-converged

Extending the discussion further, there seem to be some specific issues for hyper-converged solutions and storage. Hyper-convergence distributes the storage workload across all hosts in a scale-out architecture. Implementing patches for Meltdown could increase the storage component overhead by up to 50%. If storage uses 25% of the processor of each host, then the impact (for example) could be an increase in 12.5% of CPU utilisation. This could put some deployments under stress and will certainly impact future capacity planning.

Vendor Responses

A quick check across vendor websites shows few statements on the impacts to storage products from either Meltdown or Spectre. The only communication I’ve received has been from Storpool, which indicates the company is still investigating the impact of the bugs and the recommended patching. Of course, storage vendors may be writing to their customers directly, in which case I wouldn’t see it. However, a public statement would be good to see. For hyper-converged, I’ve found feedback from Nutanix (via this link), but there’s no mention of the impact to performance. Here’s what I’ve located so far.

IBM – Potential Security Issue (blog post from 3 January 2018) – IBM claims storage appliances are not affected.
Dell EMC – Meltdown and Spectre Vulnerabilities – EMC choosing to hide details behind the firewall – access only if you are a customer.

The Architect’s View

Meltdown and Spectre could be seen as “once in a generation” type flaws that are actually very hard to exploit. However, I would say that as we see more transparency in the hybrid cloud age, then it’s unlikely we’ve seen the end of big issues like this. It would be good for the storage vendors to put a stake in the ground and say what they are doing to mitigate the impact of Meltdown/Spectre. The public cloud providers have been quick to do it, although that focus has been more about getting patched than the impact on application performance.