Avoiding All-Flash Missing IOPS

We are living in a golden age of storage media, where new products come to market offering ever-greater performance with lower latency than previous generations. Today the hard drive and even early flash drives look positively archaic compared to modern NVMe drives built from 3D-XPoint, capable of delivering hundreds of thousands of IOPS and gigabytes of throughput.

With such a rich set of resources from which to build systems, why are we not seeing this performance directly available in shared storage systems? Where are all of those missing IOPS going?

Media Performance

It’s worth taking a moment and looking at the scale of the problem from a media perspective. The hard drive market has bifurcated into capacity and performance. Ignoring capacity for a moment, we see that typical top-end hard drives have an I/O latency of at best 2ms, on average around 2-10ms. Look at any workload performance tests, and with mixed I/O patterns, hard drives will usually deliver fewer than 500 IOPS. The mechanical nature of the device and the need to move read/write heads across the internal platters directly affects performance.

Solid-state disks fare much better. Typical SAS/SATA drives will deliver around 2GB/s read/write (depending on block size) with anything up to 400,000 read IOPS and 150,000 write IOPS (250,000 typically with 70/30 workload split). Larger capacity drives do better because they have more internal channels and chips to write data on in parallel.

Top-end NVMe flash drives can deliver around 8GB/s read, and 5GB/s write throughput at just under 1 million read IOPS with 130K write IOPS. NVMe Optane drives currently offer similar performance levels but can hit latency figures as low as 10µs (compared to about 25µs for NAND), some 500-fold better than HDDs we originally started discussing.

Parallelism

With the move to NVMe media, the NVMe protocol introduces massive parallelism compared to SAS/SATA devices. An NVMe drive is capable of managing up to 64K queues of I/O, each holding up to 64K requests. This improvement compares to single queue management for both SAS/SATA, with 32 requests on SATA drives and around 254 on SAS. Doing work in parallel fits well with the concept of shared storage, as we will discuss later.

Shared Storage

It seems reasonable to assume that aggregating lots of persistent media would produce systems that have many millions of IOPS at very low latency levels. If you build a system from (for example) 48 or 96 flash drives, we should expect tens of millions of IOPS and tens of gigabytes of throughput at very low latency. But this isn’t borne out in reality, and for flash systems, this never has been. Real-world performance figures have lagged significantly behind what should be expected for all-flash storage arrays, based on an aggregate of the underlying performance of the media. So what’s going on?

Shared Architectures

First of all, we need to look at the reason storage was aggregated into shared systems in the first place. In the pre-2000 period, storage was distributed across many physical servers. Resources were wasted and orphaned on servers that didn’t need all of the installed capacity. Drives fail and that results in high levels of maintenance and downtime. Deploying many spindles (individual drives) was the only way to improve performance.

Implementing shared storage across a storage network resolved many of these issues. Resources were now pooled and more efficient. A smaller set of arrays provided better management and could afford to hold spare devices and do automated rebuilds when drives failed. Systems could use cache to improve performance and mitigate the issues of slow hard drives.

Features

As shared storage acquired new features, the code path increased. Storage administrators love the abstraction of physical media, creating systems with block and file on the same platform and of course, data protection features like RAID, snapshots and replication. All of these features are great but introduce latency into the I/O path. RAID-5, for example, introduces a write penalty – each logical write I/O requires four I/O requests to media. Basic RAID-6 requires six I/O requests. All of these extra features need to be tracked and managed in metadata, which naturally creates an overhead.

Media Management

In the days of hard drives, a few extra hundreds of microseconds didn’t really matter, however as we move to latency levels of around 10-50µs for our media, then inefficient software becomes a problem. As drive capacities increase, the only way to get best use out of their capabilities will be to run I/O in parallel. Shared storage is all about mixed workloads, so a modern storage architecture needs to be able to both manage multiple concurrent host/input streams while writing to multiple concurrent media streams – per media device.

New media is also challenging to manage compared to hard drives. NAND flash is written to and read from in blocks. The underlying architecture requires that eventually, even byte-level changes will require the erasure and re-writing of entire blocks of data. When there are insufficient free blocks, an SSD will go into ‘garbage collection” mode, directly impacting I/O performance. Storage array vendors have to work around these problems in order to maintain consistent and low latency.

Controller Architecture

Probably the biggest issue in exploiting storage media well is the architecture of shared storage. The traditional architecture where all data has to pass through a set of shared controllers creates an inherent bottleneck on I/O throughput. Overall performance is not dictated by the media (as it was with spinning disks), but now on the bandwidth of the processor(s), front-end port connectivity, PCI Express lanes and storage I/O adaptors. Where possible, it makes sense to keep the control and data planes separate and eliminate the bottlenecks introduced by a dual controller architecture.

Fundamental Redesign

Where storage architectures were specifically designed for NAND, new storage platforms will need to be written specifically for NVMe devices. As research has shown (LINK), NVMe performance means persistent storage is no longer the bottleneck in servers and storage. Fundamentally, to make best use of NVMe media, we will need to redesign storage platforms, eliminate new bottlenecks and design to meet the needs of the media.

Why Care?

It’s an obvious question – why care about lost IOPS if my vendor gives me the performance I need at the right price? There are a number of good reasons to use NVMe flash efficiently.

Cost – NVMe, flash and SCM devices are expensive, in absolute $/GB terms much more expensive than spinning media. Depending on your supplier and device choice, there’s probably still a 10x differential. NAND flash is definitely becoming cheaper with the introduction of TLC and now QLC (links) but costs are still high. However, do the same comparison on $/IOPS and flash/SCM delivers so much more. Efficient use of resources, means you could get a better deal from another vendor.
TCO – Platform cost isn’t just about the media. It also encompasses, space, power, cooling and the management overhead of administering many devices. The expense of managing shared storage can easily outweigh the acquisition cost.
Working Set – Efficiently using an array that delivers high I/O density means modern application demands such as analytics can be accommodated on a single platform. Processing ML/AI is all about low latency and high throughput, which can’t easily be achieved by tiered or scale-out disk-based architectures.

What does Good look like?

How can good, efficient solutions be identified? There are a few obvious measures.

Latency – With NVMe drives, latency needs to be at worst around 200µs to be competitive. Today, the fastest platforms are down to the sub-100µs level, depending on the architecture and features. However, latency numbers can be “best case” rather than averages, which is misleading in terms of consistency. Ask your vendor if numbers are real-world or based on the best scenario.

Bandwidth– Hero numbers are being quoted that are into hundreds of gigabytes per second. This is good, but if 300 NVMe drives are required to deliver it, the average per drive may be less than 1GB/s or even in the multiple hundreds of MB/s. It’s worth asking your vendor exactly what configuration is delivering performance numbers quoted and try to work out how much bandwidth per drive is being used.

Throughput – the quote of IOPS numbers can be misleading. Vendors will typically quote figures for read and write separately, at small block sizes. We saw these obfuscation tricks with the first all-flash arrays. Again, it’s worth asking under what conditions the I/O figures were generated.

Most important, ask how systems perform across a variety of workload mixes. 70% read, 30% write is typical and can be very different from the 100%R/W hero numbers.

When looking at the architecture, ask questions about scalability. Does the solution scale linearly for performance and concurrent users? How many controllers are needed per set of drives? Ask whether performance is consistent with all data services turned on. Some vendors quote numbers with no services enabled and the difference can be eye-watering.

The Architect’s View

We’re entering a period in which the challenges seen in the first generation of real all-flash solutions are being played out again with NVMe. In some respects, the stakes are higher, because we’ve already become accustomed to high-performance solutions and expectations are themselves high. Platform architecture is going to be even more important than it was in the first move to solid-state media because NVMe will expose any weaknesses in design. Despite the claims of software eating the world, good hardware design is about to become important all over again.

Comments are always welcome; please read our Comments Policy. If you have any related links of interest, please feel free to add them as a comment for consideration.