“Build it and they will come” they say, or at least, something similar. So, Vast Data has built a new scale-out storage platform and hopes that customers will flock to adopt it. It’s 2019 – do we need another scale-out storage architecture?
I’m sure when we reflect on products that have gone before, we see opportunities to do things better. In fact, the ability to innovate is partially driven by hindsight, but also by new technology. The team at Vast Data took a look at the storage media landscape and played a what-if game with projections for the future. What if cheap, low-cost flash storage was available? What if persistent storage-class memory was ubiquitous? How could a better storage product be built from these technologies?
The clear trajectory for flash storage is one of reducing costs and increasing capacities. This trend continues with QLC flash, 4-bits per cell technology that will scale to tens of terabytes in a single SSD. As we discussed with Steve Hanna last year, QLC drives are already hitting historically low prices and will continue to go lower. In fact, I checked on the current retail pricing of the Micron 5120 ION drive at the time of writing this post and there had already been a 30% reduction since November 2018 (4 months).
You can listen to the full discussion with Steve below.
The problem with QLC compared to previous generations of NAND flash is endurance. QLC drives typically have a DWPD of less than 1, depending on the workload profile, but hold up well with large-block sequential write I/O. You can see this in the image from Micron that maps out the performance of the 5210 ION drive going from highly random to highly sequential I/O. If I/O can be optimised to be infrequent and sequential, then these drives can last a long time. Keep that thought in mind for a moment.
Now let’s talk about storage-class or persistent memory. SCM is a new form of storage that acts like persistent memory. In other words, data can be written at the byte level (byte addressability) and the contents of the device are not lost when the power is turned off. In developing storage systems, persistence and performance are a great combination.
Many existing storage designs look to improve performance using DRAM caching. This introduces dependencies on design that constrict the ability to scale, or kill performance in failure scenarios. Typically, cached DRAM writes have to be duplicated to a second controller and cached reads have to be synchronised in case the persistent copy of data on disk/SSD changes.
QLC & SCM Together
Imagine putting both SCM and QLC technologies together. QLC provides cheap, scalable capacity. SCM mitigates the issues of QLC endurance while offering performance closer to DRAM – with persistence. So, many of the complications of building scale-out storage are removed when using QLC and SCM in partnership. This is exactly what Vast Data has done in building their new storage platform.
Vast Data has implemented what they term a disaggregated shared-everything architecture (DASE). In this model, storage is deployed in dedicated disk enclosures that store a combination of QLC NAND flash and storage-class memory. Enclosures are linked together over an NVMe fabric based on Ethernet. In turn, data access is managed through controllers that can be either physical appliances or software running alongside applications on host servers. Today, these controllers expose either an NFSv3 interface or S3 object API. Additional protocols are a roadmap item.
The DASE architecture allows any controller to talk to any NVMe device (either QLC flash or SCM) across the entire infrastructure. As the controllers don’t maintain state, then there is no need to co-ordinate metadata management between controllers. Instead, all metadata structures are maintained on persistent SCM. The result is a scalable architecture with no inherent pinch points, that could scale to thousands of storage nodes and tens of thousands of controllers.
If all the metadata is on disk, the first obvious question is how this can be accessed consistently and efficiently? Vast Data has created a tree structure similar to a b-tree or binary tree that the company calls a V-tree. The shallow nature of the V-tree design allows any piece of metadata to be retrieved in seven or fewer steps through the structure. This enables all the metadata to reside on SCM, distributed and replicated across each of the storage nodes. SCM access latencies are in the order of 10 microseconds, so locating and/or updating metadata structures can be achieved extremely quickly. The design enables a metadata update to be treated as an atomic operation, which provides the serialisation to manage access from many controllers.
Both metadata and content are kept in the element store on storage nodes. Although I’m not sure of the specific implementation, one way to understand how the element store works is to liken it to an enormous key-value store, where each key and value are either metadata or data.
Data across nodes is protected with a new form of erasure coding. This algorithm allows for very wide stripes (up to 500+10 stripes) that reduces the data overhead to as little as 2%. Data is written to complete, new stripes and never updated in place. If you’ve seen this concept before then you’d be right. XtremIO uses a similar idea and it’s no surprise that Vast Data CEO Renen Hallak worked on and has patents covering XtremIO design. This is clearly one example of the benefit of hindsight that has enabled a more effective protection scheme, now that SCM is widely available.
Another benefit of writing very wide stripes to persistent flash is the ability to write sequentially and infrequently. As a result, the DASE architecture can extend the life of QLC to the point where Vast Data will guarantee the drives for up to 10 years.
Flash systems inevitably have some form of data reduction technology in use, as the delta between disk and flash drives has historically been high. Reducing the effective $/GB cost of media was initially a key feature of all-flash storage systems. The DASE architecture uses a data reduction technique that looks for deltas between a set of “fingerprint” blocks and other content stored in the system. As data is written to persistent media then only the reference fingerprints and deltas between those reference blocks and other media are stored. Think of this as being a little like incremental backups. With an initial full and a set of incrementals, we can create synthetic full images. The DASE data reduction strategy works in a similar way.
The idea of scaling RAID stripes and reducing overhead to a mere 2% of capacity is an appealing thought. In order to get to this point, though, you have to deploy a lot of storage, probably 2PB upwards. So TCO savings are going to be proportional to the capacity of storage being deployed. As a result, Vast Data’s new storage platform isn’t going to be for everyone except for those with large on-premises data requirements. Therefore, the three deployment models offered (enclosure & server appliance, enclosure & containers and software only) are based on petabyte entry-level capacity, capacities of 10PB or more and capacities of 100PB and more respectively.
If you want to learn more about the architecture, I recommend watching the following video from Storage Field Day 18 with CEO Renen Hallak.
The Architect’s View
Do we need another storage platform? One aspect to consider when answering that question is the impact new technology has on system architecture and design. The initial expense of flash (but with good endurance) made it hard to build a cheap all-flash array. Over time, the absolute cost of media has decreased, while the issues of endurance have increased. This necessitates new designs. Similarly, the availability of SCM means historical caching challenges can be eliminated. This couldn’t have been achieved previously without using some kind of closely coupled RDMA-based architecture.
How many potential customers need 10 or even 100 petabytes of storage capacity? No doubt there will be many, as unstructured data is a big growth area. Possibly the biggest challenge for Vast Data is convincing customers that a TCO and price point can be achieved that justifies putting all data (active and inactive) on a single unified flash platform. The economics of this will be interesting, because flash prices continue to decline and this model will only become more financially attractive over time.
You can listen to an architectural discussion of the VAST platform on these two Storage Unpacked podcasts, with Howard Marks.
I was personally invited to attend Storage Field Day 18, with the event teams covering my travel and accommodation costs. However, I was not compensated for my time. I am not required to blog on any content; blog posts are not edited or reviewed by the presenters or the respective companies prior to publication.
Copyright (c) 2007-2019 – Post #6355 – Brookend Ltd, first published on https://www.architecting.it/blog, do not reproduce without permission. Images courtesy and copyright of Vast Data, Inc.