NVMe over Fabrics - Caveat Emptor?

For the past 20 years, Fibre Channel has been the dominant enterprise data centre storage networking technology. By this, I mean traditional Fibre Channel, not variants that work over IP or use Converged Ethernet (more on that in a moment). How will the NVMe transition work and will Fibre Channel continue to dominate?

Of course, storage protocols other than Fibre Channel do exist. Smaller organisations have typically used iSCSI, because of the relative ease of implementation, but Fibre Channel has been more successful for many reasons we’ll get into.

As someone who used Fibre Channel in the early days (and was a user of mainframe ESCON before that), it was clear that the relative simplicity and the benefits of implementing shared storage would be significant to data centre operations. As a result, getting Fibre Channel out of the data centre will be harder than people think – and that means developing a transition strategy as we move to a world of NVMe.

Success Factors

Why was Fibre Channel so successful? With the proliferation of data in separate servers and the use of direct-connect SCSI, storage was in a bit of a mess towards the late 1990s. If you’ve ever walked into a data centre without shared storage (and I have), you will recognise the support headache of managing many servers, each with potentially differing drive types, sizes and capacities. Some of the issues include:

Inventory – with so many different drives that get deployed over time, IT teams need to retain large inventories of different drive types, with an understanding of where and how large drive types may be substituted (or not).
Downtime – not all drive replacements can be done online; taking systems down for drive replacement and data rebuilds is impactful and could affect the bottom line.
Fragmentation – as drive capacities increased (and smaller capacities were discontinued), it was increasingly likely that individual servers were overprovisioned with storage capacity that would never be used.
Risk – probably the most important factor. With so many drives in a large data centre, administrators would be replacing drives on a daily basis, introducing the risk of choosing the wrong drive – or even the wrong server – for replacement.

Centralising storage took away most of the issues on this list and importantly allowed scalability, increased performance and reliability while reducing arguably the most important factor – risk.

Reassuringly Expensive

I’ve never seen Fibre Channel as a complex solution, although I agree that it was expensive. When HBAs cost upwards of $1000 each, fitting two into a server definitely increased the real-estate cost of the server farm. Add in around $500/port for a switch and it’s easy to see why Fibre Channel was used in the enterprise, while SMBs focused on iSCSI.

However, expensive as it was, Fibre Channel was and still is incredibly reliable. From a management perspective, it’s also very practical because fabric definitions can be centralised and pushed out to multiple switches simultaneously. New features like Smart SANs and Peer Zoning can reduce the management overhead even further.

Most of the time, getting things right with Fibre Channel is about getting the right standards in place. After that, everything is relatively easy.

Here’s some additional background on NVMe and Fibre Channel in a Storage Unpacked podcast episode from November 2018.

Rip and Replace

So why has Fibre Channel stood the test of time over other storage networking solutions? I think we can see a number of reasons.

Industry adoption – the storage industry has widely adopted Fibre Channel on almost all shared storage platforms. It’s the protocol of choice for these systems and the de-facto industry standard.
Investment – businesses have invested in Fibre Channel. This isn’t just about physical switches and HBAs, but the wider infrastructure of cabling, port monitoring and aggregation tools and physical trunking.
People – let’s not forget that storage teams still exist in many organisations. The storage department has significant invested skills in Fibre Channel. There’s also a philosophical conflict in design methodology between network and storage teams and that’s fine. Storage networks should be designed differently from host traffic networks.

The last point probably addresses some reasons for the lack of adoption of FCoE. Vendors like Cisco would, of course, like to have their products as the only solution for all networking in the data centre. There are some benefits to consolidating storage and IP networking, but this can easily escalate into a problem of turf wars. Storage teams (rightly or wrongly) simply weren’t accepting of the move to a protocol and infrastructure on which they would have no control.

You can listen to a discussion on Ethernet versus Fibre Channel in this recent Storage Unpacked podcast episode.

Transition Strategy

The transition away from Fibre Channel will be a gradual but inevitable one, as we move to a more dynamic and software-defined data centre. The transition reflects how the enterprise data centre itself has evolved. It’s been ten years since EMC (now Dell EMC) first introduced flash storage into modern shared storage arrays. In that time, we’ve seen a transformation through hybrid to all-flash and now the adoption of NVMe technology. All of these platforms still exist within a data centre somewhere.

NVMe Transition

The move to NVMe will also be a transition. The majority of IT environments won’t need NVMe performance across the board. In the move to all-flash, only the worthiest of applications were placed on flash media because the price of flash storage was high. The same issues apply to the move to NVMe. We can identify a number of transition steps:

NVMe back-end. Move to an NVMe-based storage array. With Fibre Channel technology, vendor solutions are delivering response times in the 200-300µs range, with millions of IOPS and tens to hundreds of gigabytes per second. This process can be done through a normal refresh cycle.
NVMe front-end with Fibre Channel. Move to NVMe at the front-end over existing Fibre Channel equipment. This does need to be Gen5 (16Gb/s) or Gen6 (32Gb/s) capable but will give another performance boost. The impact of making this change is relatively small, as NVMe and SCSI can exist as protocols on the same Fibre Channel infrastructure.
NVMe front-end with Ethernet. Where performance dictates even faster speeds, NVMeoF using Ethernet switching will give even lower latency capability, around the 100µs level.
SCM NVMe. The final transition is to use storage-class memory such as Optane, over high-speed Ethernet or InfiniBand. Vendors are quoting figures around the 40-50µs response time mark (or lower) with this technology.

The Case for Ethernet

So why not just move today to Ethernet as the front-end technology? While there’s no reason not to, wholesale replacement of technology across the board introduces risk. IT has a history of implementing skunkworks projects to test things out, so the right approach may be to implement NVMeoF in a limited way, where any issues or caveats of the technology can be evaluated and ironed out.

What might these be? Well, some NVMeoF protocols (like those using RoCEv1) can’t be routed, creating a flat network. Remember the impact of RSCNs on large-scale Fibre Channel networks? Building one very large NVMeoF SAN may show similar problems. Then there are issues of how to manage maintenance, push out changes, code upgrades, multi-pathing, name resolution and so on.

The Architect’s View

NVMeoF is new. As with any technology, having a transition plan that doesn’t result in mass rip and replace will reduce the risk of transition. Picking solutions that can support today’s protocol and make the transition to NVMeoF in a seamless fashion will mitigate against some of the expected issues. Of course, if you absolutely have to have that 100µs latency, who am I to argue!

Comments are always welcome; please read our Comments Policy. If you have any related links of interest, please feel free to add them as a comment for consideration.