There’s a standard adage in the storage industry. If you want lower latency and better performance, then move storage closer to the processor. Typically, this means implementing DAS – Direct Attached Storage – and eliminating a shared network. So how can WekaIO Matrix, a distributed scale-out file system claim to be better than DAS yet still offer full data integrity?
Over the last 20 years, we’ve seen two main models for implementing shared storage; SAN and scale-out. SAN (Storage Area Networks) came about as a solution to fix many of the issues associated with DAS. These included resiliency (RAID protection), better/easier management, consolidation (and so less waste) and improved performance. The performance aspect was achieved by spreading data across many HDD spindles and aggregating the meagre IOPS count of individual drives.
- Understanding HPE’s Storage Product Portfolio
- Podcast #3 – Chris & Matt review the SFD18 presenters
- Unreliable Disks for Better Scale-out Storage
Scale-out storage has typically operated on unstructured file or object data, although solutions like Datera and SolidFire offer iSCSI block devices. Implementing scale-out as a file system has particular challenges in order to ensure data integrity across multiple nodes. File metadata and file locking requirements introduce complexity and as a result, can reduce performance and increase latency.
Just use DAS
An easy answer to using either scale-out or SAN is to simply implement directly-attached storage again. Modern applications like NoSQL (think MongoDB, Cassandra, Hadoop) are designed for scale-out at the application layer with local DAS attached. With this model, the application owner deals with the issues of data integrity. However, this comes at a price. Data protection and resiliency are now application responsibilities; failed nodes have to be rebuilt across the east-west network in the data centre. The old issues of wastage with DAS appear again.
Faster than DAS
To address the shortcomings of DAS, WekaIO has developed Matrix, a scale-out parallel file system for high-performance and low latency applications. Matrix runs on Linux and uses NVMe storage on commodity hardware or within public cloud and claims performance can be faster than DAS. With all of the overheads that exist to keep distributed data in sync, how can that be possible?
The Matrix architecture is shown in figure 1. At the lowest level, Matrix talks directly to PCIe-connected NVMe drives and to NICs using SR-IOV. The tasks of managing media and cross-node communications are handled by agents and processes running in LXC containers. This provides both the flexibility to scale as more devices are added to a node, but also allows logical communication between processes using virtual or physical networking. A back-end task like data protection doesn’t need to use a different process to talk to a local or remote NVMe drive. It just uses the logical network.
On an application server, the Matrix platform appears as a local file system through the WekaIO VFS driver. This allows the solution to be implemented either as an HCI-style deployment, as storage-less nodes that just run the VFS driver or to expose standard file and object protocols (NFS, SMB, S3).
We discuss the Matrix architecture in more depth in this Storage Unpacked episode from 2018. You can also read more on Matrix in the NVMe in the Data Centre report (register/download here).
How can this solution be faster than storage using the Linux I/O stack? Take a look at the Linux Storage Stack Diagram (figure 2) and the amount of indirection in place to cater for multiple device types, I/O protocols and features like multi-pathing. Although there has been some simplification with NVMe, there is still a lot of overhead in place. Matrix cuts through that by using a custom file system and direct management of hardware peripherals.
OK, the logic stacks up, so where’s the proof? The first place to look is to storage industry benchmarks. Matrix comes out top on the SPEC SFS2014 benchmark results for almost workloads (e.g. database). Naturally, I’ll caveat that data with the usual disclaimer to check out the details of the testing. You can see, for example, in software builds, NetApp has submitted a faster result (builds and throughput) albeit with higher response time. Both WekaIO and Matrix use a lot of DRAM in these tests, which obviously helps. However, I’d be more interested in ORT (overall response time) as a measurement here because it helps to understand how well Matrix scales across multiple nodes.
What about a real-world example? WekaIO recently presented at Storage Field Day 18. In one of the presentations, Sr Director of Sales Engineering, Shimon Ben-David showed a Matrix system delivering 10GB/s of throughput, even when individual nodes were failed during the test. You can watch this video here.
The limitation on performance in this instance was the bandwidth of the network card and not Matrix itself. The video is a little hard to follow if you’re not familiar with standard benchmark tools (lots of fio involved), but it is a real-world live demonstration of the Matrix software. All of the SFD18 presentations for WekaIO can be found here.
The Architect’s View
WekaIO is one of a number of new companies bringing super-fast scale-out storage to the enterprise. NVMe and NVMe-oF are providing the basis for this revolution. Imagine what Matrix could be like if storage-class memory was introduced into the product. If I have to choose one key architectural aspect in the Matrix solution, then it’s the elimination of many layers of O/S indirection, in order to gain high performance. WekaIO is effectively re-writing the file system layer and below. It may well be that this is now the only way to capitalise on the performance offered by new media.
It’s easy to think that shared storage is on the way out, because it can’t deliver the performance of local storage. WekaIO is showing this not to be true. I expect that SANs and scale-out will simply evolve and be key to building robust scale-out applications that will continue to need underlying data services.
I was personally invited to attend Storage Field Day 18, with the event teams covering my travel and accommodation costs. However, I was not compensated for my time. I am not required to blog on any content; blog posts are not edited or reviewed by the presenters or the respective companies prior to publication.
Copyright (c) 2007-2019 – Post #C160 – Brookend Ltd, first published on https://www.architecting.it/blog, do not reproduce without permission.