As the slowest part of the computing architecture, persistent storage is in a constant battle to improve performance. That goal has moved from being focused on throughput a few decades ago to one of reducing latency as low as possible. To that end, NVM Express and NVMe over Fabrics have become an area of focus for storage start-ups. At the same time, we see another strand of storage DNA that is developing storage-class memory (SCM) or persistent memory (PM) products. These make storage byte-addressable and can also put storage on the memory bus. Having storage closer to the processor reduces latency, but as we will discuss, causes other challenges.
The choices for using SCM are pretty straightforward. We can put SCM into the host. We can add it to a storage appliance or we could use it in a hybrid platform like hyper-converged infrastructure. Wherever we place it, the goal is simple – increase the throughput and reduce the latency of persistent I/O. What are the implications of each choice?
Just over two years ago, HPE demonstrated persistent memory in the form of Intel Optane as an acceleration feature for HPE 3PAR. This feature is now generally available and will also be added to the Nimble storage appliances (currently in tech preview). With SCM, HPE claims to reduce I/O latency to below 200µs with 99% of IOPS below 300µs. While this is good, it is nowhere near close to the capabilities of SCM itself. Intel Optane DC P4800X NVMs SSDs can achieve read/write performance of 10µs, with Intel Optane NVDIMMs claiming as low as 0.3µs (although these will need the next generation processor architecture).
Putting SCM into the host is the next logical step to fully exploiting the performance benefits on offer, as it removes the latency of the network. This could be in a direct-attached storage (DAS) or hyper-converged model. The exact implementation in HCI really depends on the implementation of the storage layer. That’s the subject for a different post. As a DAS solution, the SCM would be more directly available to the operating system but would come with all the challenges of a DAS implementation that storage area networks were meant to avoid. SANs provide consolidation, simplified management and both greater availability and resiliency compared to DAS. If an application server fails, then the SAN has the data. This model is well-known and deployed heavily across the industry.
One solution on offer in the market is MAX Data from NetApp. Memory Accelerated Data or MAX Data is a software solution that came via the acquisition of Plexistor by NetApp in 2017. We’ve talked about this technology before and also recently recorded a Storage Unpacked podcast episode from NetApp Insight that provides a high-level view of the technology (embed below).
Understanding exactly how the software works is probably easier to explain in written rather than spoken form. To see why this is important, we have to start by looking at the complexity of the I/O stack in modern operating systems. Figure 1 gives an example of just how complex this is.
Plexistor originally developed a technology called Software-Defined Memory (SDM) that uses a combination of SCM and flash storage to expose either a large memory pool or POSIX-compliant file system as a single namespace directly to an application running under Linux. Presenting a file system to the application means that little or no application changes are needed to use SDM. The file system implementation cut through many of the layers shown in figure 1, providing a much more efficient implementation. In fact, two years ago Plexistor claimed around 10% processor utilisation improvement using SDM.
SDM uses the benefits of byte-addressability and persistence of storage-class memory. Without the ability to read and write at the byte level, any solution that tries to use (for example) NAND flash would have a problem delivering the same levels of performance. This is because traditional storage uses a read/modify/write process at the block level, even if an individual byte is being changed.
However, we can’t fill our servers with only byte-addressable storage. The technology is expensive and limited in capacity. That’s why the original Plexistor SDM solution was capable of tiering data from high-performance but limited SCM to cheaper block-storage media like flash. This is how MAX Data works today.
It’s Not A Cache
Note that we’ve said tiering here. This is a tiered solution where data exists only in one place; either locally on the host or in a storage appliance. In the case of MAX Data, that’s an AFF A800 array. The differentiator between a tier and cache is extremely important. Caching is great for heavy read-biased activities where the entire data set can be read into memory. When the active data set is larger than the cache, thrashing occurs and cache objects have to be expired or dropped from the cache to make way for the next piece of active data. The result is a cliff-edge drop in I/O performance.
Write I/O with caching requires either to write-through (write to the cache backing store) or to replicate the cache somehow (when using a write-back algorithm). Write through reduces write latency to the speed of the backing store whereas write-back requires additional hardware.
With tiering, read and write I/O can operate at the speed of the storage-class memory, as long as sufficient capacity exists in the server. Data is tiered based on either capacity or data “temperature” (cooler, less active files are moved down). Note that tiering to the AFF layer is at the file, not block level. As MAX Data is POSIX compliant, applications can provide “intent” on file usage through commands like fadvise, which provides a mechanism to pin data into the SCM layer.
There’s still the issue of resiliency to mitigate. If a server running MAX Data fails, the data is either trapped in the server or could be lost if the server had some kind of issue that damaged the SCM (like fire). In this instance, data can be synchronously replicated to another node across a high-speed network like 100GbE or InfiniBand. Naturally, there might be a small latency penalty to pay here, depending on the networking used.
The integration with NetApp AFF means data can be protected through snapshots (called MAX snap in this instance) and from there archived or accessed however the customer chooses using existing ONTAP data services.
Quid Pro Quo
In the interests of balance, what are the disadvantages of using MAX Data? Possibly the most obvious is in increased complexity. The application is now using pretty specific (and more expensive) hardware. Implementing resiliency requires another server to which the data is mirrored. Recovery means bringing data back from that server to the original server. There’s currently no HA option that allows the application to continue running on the mirror hardware.
It’s also not 100% clear what the O/S requirements are. Rob McDonald indicated (on the podcast) that the MAX Data code ran in user space, whereas the original SDM implementation from Plexistor was kernel based. NetApp may have worked hard to mitigate this but I suspect there are still some kernel components.
Today NetApp states that the MAXFS file system has been submitted to the Linux kernel as ZUFS (Zero-copy User-mode File System). This would operate like FUSE does today and have some supported kernel code, with most of the file system functionality delivered from user space.
It’s worth remembering that there is no easy way to fix the yin and yang of I/O performance versus latency. Compromises have to be made somewhere. But MAX Data does at least minimise the compromises that have to be made. So for those applications that truly must run with sub-10 microsecond latency, then MAX Data is a practical solution.
A few thoughts spring to mind on what might be planned for MAX Data in the future. The first is the ability to address the MAX Data namespace as memory. The memory API existed in with Plexistor SDM, however, NetApp hasn’t spoken about it much (although it does appear on the image borrowed for figure 2). How exactly could this be used? It would be interesting to see how this feature compares with, for example, Western Digital’s DRAM caching solution.
Then there’s the public cloud. MAX Data is software. So, the solution could be deployed there. In the Plexistor days, CEO Sharon Azulai provided some performance figures at a Tech Field Day presentation using AWS instances. So why not use Plexistor as a super-fast file system that is backed by either NetApp Cloud Volumes or local NetApp Private Storage? This would seem a logical extension.
Finally, there’s the question of using MAX Data with NetApp AI, the reference architecture that uses AFF and NVIDIA DGX servers. Multiple vendors have demonstrated solutions in this area. How would NetApp’s figures change when storage latency reduces from 200µs to only 3µs?
The Architect’s View
I’d like to see some more performance figures for MAX Data. At this stage, I think NetApp is keeping its powder dry and will release some interesting data in the future (note: I have no actual insight here, I’m guessing), once the implications can be easily digested. The cloud and AI integrations look the most interesting, but for businesses, the initial implementations may be more mundane and tackle traditional applications (like Oracle) where performance and licence cost savings can be made. I would class MAX Data as a “must watch” technology for 2019 that could really exploit the potential of SCM and technologies like Intel Optane.
There’s much more information that could be practically fitted into this post. So, I recommend checking out this video (it’s about an hour long) from a Tech Field Day event last year. It provides a much greater deep dive into the technology.
Disclaimer: I was invited to attend NetApp Insight in Barcelona in December 2018, with access to marketing and technical people to help formulate some of the thoughts in this blog post.
Copyright (c) 2007-2019 – Post #2281 – Brookend Ltd, first published on https://www.architecting.it/blog, do not reproduce without permission.