The Race towards End-to-End NVMe in the Data Centre

The Race towards End-to-End NVMe in the Data Centre

Chris EvansAll-Flash Storage, NVMe

The next greatest thing in storage hardware is the move to NVMe.  This is occurring in two ways – conversion of internal designs within shared storage platforms to use NVMe at the “back-end”, and front-end (or host connectivity) using NVMe over Fabrics in some form.  Vendors are starting to use the term “end-to-end” NVMe as a distinguishing factor in the next generation of products.  By this, they mean that host to media is NVMe all the way.  How does this all play out and what should potential customers look for?

Architectures Primer

Let’s do a quick recap how architectures for shared storage have evolved.  Generally in shared storage arrays, at the “back-end”, controllers talk to hard drives or SSDs.  Physical connectivity is established through SAS (Serial Attached SCSI) adapters, SAS expanders and cabling.  SAS was preceded by FCAL (Fibre Channel Arbitrated Loop), which was rather messy.  Moving to SAS significantly improved reliability.  SAS is a point-to-point protocol, although the bandwidth is shared across the connected adapters and drives.  The move from FCAL to SAS simplified cabling and improved reliability.

At the front end, Fibre Channel has dominated the enterprise for nearly 20 years, with iSCSI fitting more SMB/SME type applications.  FCoE (Fibre Channel over Ethernet) never really took off and that’s an interesting metaphor for the rip-and-replace discussion we’ll have later on.

NVMe Back End

NVMe as a back-end protocol means replacing the existing SAS enclosures, drives and associated hardware with NVMe-capable components.  NVMe drives use a form-factor called U.2 (formerly SFF-8639), which looks like a 2.5″ SSD, albeit with a slightly different connector compared to SAS (although it was designed to be pin compatible).  Drives are generally dual-ported and are hot-swappable, which is essential for enterprise operations.  NVMe SSDs plug into a PCIe backplane or switch, which plugs directly into the PCI root complex.

SAS 3.0 supports up to 12Gb/s (although a lot of systems still use 6Gb/s) whereas NVMe drives will support 4x lanes of PCIe 3.0 each, in total just shy of 4GB/s.  Note the units difference here – raw maximum bandwidth at the protocol level for NVMe is about 3-4 times that of SAS.  However, performance improvements are not purely in the hardware interface alone.  NVMe is a much more efficient protocol that has less overhead than SAS or SATA.  This makes NMVe storage much less of a bottleneck than flash storage that went before.  It also positions things nicely for the eventual adoption of Storage Class Memory (SCM).

Better, Faster, Stronger

With a revised backplane and NVMe drives, today’s storage arrays are much faster than their predecessors.  This results in an increase in performance generally, with the next potential bottleneck being front-end connectivity.  Remember that Fibre Channel (FC) and iSCSI are simply transports for SCSI.  When FC was invented, it probably seemed logical to use SCSI as the storage protocol, rather than create a new one.  Early Fibre Channel Networks (or SANs) were still using disk, so the latency introduced by the network wasn’t much of a problem.

Today, that latency is an issue.  The answer is NVMe over Fabrics (NVMe-oF).  The term NVMe-oF encompasses a range of solutions that can use NVMe as the storage protocol with Fibre Channel, and RDMA (Ethernet – RoCE and iWARP or InfiniBand) as the fabric (transport layer).  The result of using NVMe-oF to the host is an increase in performance and further reduction in latency.

NVMe over Fibre Channel (FC-NVMe) can use existing Gen5 (16Gb/s) and Gen6 (32Gb/s) technology.  This means IT organisations that have deployed Gen5/6 switches and HBAs can start to use an FC-NVMe array today.  There’s no “rip and replace” of the storage network and the two protocols (SCSI and NVMe) can work side-by-side together.

Alternatively, RDMA can be used as the transport layer, running across Ethernet (RoCE – RDMA over Converged Ethernet) or iWARP (Internet-Wide Area RDMA Protocol).  There are lots of technical differences between iWARP and RoCE that make one protocol more suitable than another, however, this is out of scope for this discussion.  Typically RoCE-based solutions have been brought to market first by vendors.

Vendor Implementations

Who has NVMe support today?

Apeiron Data has a disaggregated scale-out NVMe solution based on storage enclosures and custom host bus adaptors.  The ADS1000 array supports up to 32 40GbE ports, 24 NVMe 2.5″ drives with a total capacity from 38TB to 384TB per shelf.  Throughput is quoted at 72GB/s per shelf, with up to 18 million 4K IOPS.  Optane-based systems achieve 12µs latency, with NAND flash systems seeing 100µs (of which only 2.7µs is claimed to be the platform itself).  The disaggregated architecture means there is no traditional controller.  Instead, hosts are connected directly to drives with redundant I/O modules (IOMs) in the chassis that manage connectivity.

Dell EMC announced the new PowerMax platform recently at Dell Technologies World.  PowerMax is intended as a replacement for VMAX, with a fully configured system capable of a claimed 10 million IOPS, 150GB/s throughput and a 50% improvement in latency over VMAX 950F.  There are two models – the 2000, which offers one or 2 PowerBricks per system and the 8000, with one to 8 PowerBricks.  The 2000 models implement two 24-bay NVMe DAE drive shelves per PowerBrick, providing a maximum of 96 drives per system.  The 8000 series scales to 288 drives.  In both model types, the maximum drive capacity is 7.68TB.  Maximum effective capacities for the 2000 and 8000 series are 1PB and 4$B respectively.  PowerMax currently does not offer any NVMe-oF support.

E8 Storage has two storage appliance models that implement a disaggregated architecture.  The E8-D24 is a dual controller design, with eight 100Gb Ethernet or 100Gb/s InfiniBand network connectivity.  The controller supports up to 24 drives and 154TB of raw capacity.  performance figures are quoted at 10 million IOPS (read) and 1 million (write) with 40GB/s and 20GB/s throughput respectively.  Latency is quoted at 100µs (read) and 40µs (write).  The E8-S10 appliance has a single controller and four network ports, with up to 10 drives and a maximum 77TB of raw capacity.  Latency figures are equivalent to the E8-24, with throughput of 4 million IOPS (read) and 500,000 IOPS (write) at 16GB/s (read) and 8GB/s (write) throughput.

Kaminario announced their K2.N NVMe-based platform in mid-2017.  The architecture uses a scale-out a design comprising c.nodes (controllers) and m.nodes (media/storage) connected through a converged Ethernet fabric (up to 50GbE RoCE).  Each m.node can hold up to 24 NVMe SSDs from 960GB 7.68GB capacity.  Front-end connectivity supports NVMe-oF, Fibre Channel and iSCSI.  Performance is claimed to be around 100µs latency, with 400K IOPS and 5GB/s throughput per c.node.

NetApp recently announced the release of AFF A800, an NVMe-enabled all-flash platform that delivers “end-to-end NVMe” support.  The base hardware consists of two 2U controllers with 48 internal NVMe SSD slots in total, each controller connecting to each SSD with two lanes of PCIe Gen 3.  This translates to around 2GB/s of bandwidth per drive per controller.  NetApp is claiming performance figures of 300GB/s throughput, 11.4 million IOPS and latencies as low as 200µs or better, which is presumably in a fully configured system.  Note that at initial release, the maximum drive capacity is 7.6TB.  I expect this is because the only 15TB on the market are SAS devices.  Expansion shelves for A800 also look to only accommodate SAS drives.  At the front end, A800 currently offers only FC-NVMe, with NVMe over Ethernet likely in a future release.

Pavilion Data has a rack-scale NVMe solution based on standard NVMe drives, a centralised PCIe fabric and 10 dual-controller line cards based on a low-power Broadwell SOC.  Each controller supports four 100GbE Ethernet ports and internal PCIe connectivity to up to 72 drives for a maximum of 1PB in 4U.  Performance is claimed to be 120GB/s of throughput at 100µs latency.  Front-end host connectivity is over 40GbE or 100GbE RoCE using standard host-based NVMe-oF drivers.

Pure Storage announced the new FlashArray//X series, including the //X90 at Pure Accelerate in May 2018.  The FlashArray platform has been NVMe-enabled since FlashArray//M was released in 2015.  This means most existing customers can transition their existing hardware to NVMe without replacing the entire platform.  The //X systems use DirectFlash, a custom NVMe NAND flash module that improves throughput and reduces latency to the NAND itself.  DirectFlash scales from 2.2TB to 18.3TB capacities, delivering up to a maximum of 3PB in 6U. Pure are claiming latencies as low as 250µs with current protocols.  NVMe-oF support is currently a future enhancement planned for 2H2018.

Tegile has two NVMe-based storage arrays released in 4Q2017.  The IntelliFlash N5200 and N5800 use a similar form-factor to other vendors, with 24 NVMe connected drives in a 2U footprint.  Tegile claims to be able to deliver 3 million IOPS and consistent 200µs latency with full data services.  The N5200 scales from 23-184TB (960Gb to 7.68TB drives), wile the N5800 scales from 19-154TB (800GB to 6.4TB drives).  At the front-end, both platforms support 8/16Gb/s Fibre Channel, 40GbE/10GbE and 1Gb/s Ethernet.

Vexata has two NVMe-based products, one using NAND flash (VX-100F), the other Intel Optane (3D-XPoint) (VX100-M).  Both systems are based on a dual active/active controller architecture that actively separates the control and data planes.  Each chassis is capable of supporting up to 16 ESMs (Enterprise Storage Modules), each holding four NVMe or Optane drives.  Maximum capacity is 187.5TB for VX-100F and 32TB for VX-100M.  Front-end connectivity supports 16x 32Gb/s Fibre Channel.  Vexata claims 7 million IOPS and 70GB/s with a 70/30 read/write workload at just 200µs for VX-100F and 7 million IOPS at 40µs with 80GB/s throughput for VX-100M.

Other Implementations

IBM has previewed NVMe over Fabrics for InfiniBand running on FlashSystem and Power9.  NetApp also has NVMe over Fabrics for InfiniBand on the EF570.  HPE has announced new Nimble solutions that are “NVMe Ready”.  Excelero has a software-defined solution that implements a proprietary form of RDMA called RDDA that meshes together NVMe drives across many servers (check out our Storage Unpacked podcast from last year that covers the NVMesh architecture).

The Architect’s View

There’s a lot of information to absorb in this blog post.  Congratulations if you made it all the way here!  The key takeaways are that internal infrastructures are evolving to use NVMe and we can expect to see a transition from traditional Fibre Channel and iSCSI to much faster storage networks.  Like the move to NAND flash before it, the move to NVMe will highlight issues with architectures that can (and cannot) fully exploit the new medium.  NVMe over Fibre Channel provides a transition process for businesses heavily invested in Fibre Channel hardware (and cabling).  However, for super-low latency and high throughput, FC is being superseded by Ethernet.

We can imagine (once again) a cycle of premium-priced products that depreciate in cost over time.  We also have some new architectures, like disaggregated and consolidated platforms that look to challenge the traditional shared storage architecture.

Comments are always welcome; please read our Comments Policy.  If you have any related links of interest, please feel free to add them as a comment for consideration.  

Copyright (c) 2007-2019 – Post #1CB1 – Chris M Evans, first published on https://www.architecting.it/blog, do not reproduce without permission. Disclaimer: NetApp, Pure Storage and Vexata are clients of Brookend Ltd.