StorONE recently issued a press release claiming the fastest RAID rebuild for an HDD storage array, with a single 14TB drive (half populated) rebuilt in less than 2 hours. This is an impressive number (based on caveats we’ll discuss in a moment), however, with all-flash systems becoming more common, is rebuild time still an issue?
It’s a fact of life that eventually, all hardware fails (the converse statement is that eventually, all software works). Component and systems failures occur across IT infrastructure, whether as a single hard drive, a memory DIMM or an entire server. There are also other risk factors that affect the ability to persistently store data, including power loss, fire and flood. All IT systems have to be designed for failure scenarios, so IT architects build in resiliency and redundancy as a matter of course.
Since the publication of the seminal white paper “A Case for Redundant Arrays of Inexpensive Disks (RAID)” in 1987, the IT industry has come to expect some form of hardware or software-based protection for persistent storage by using a set of inexpensive (or as it became, independent) disks.
Today, the concept of RAID seems so simple. Create a protection mechanism by distributing data and some redundant rebuild information (parity) across a set of cheap disks and use the redundancy to recover the lost data in the event of a hardware failure. We expect to see RAID solutions in everything from enterprise storage arrays to home NAS platforms.
Over the years, RAID solutions have expanded to cater for ever-larger HDDs. RAID-5 (a single redundant disk with distributed data and parity) isn’t good enough for high-capacity drives, as the incidence of a second failure or unrecoverable read error on the good drives will result in data loss. RAID-6 adds an extra level of protection but also requires more computing power to create the parity data at write time. There’s a clear risk versus cost equation in play. Larger HDDs make that risk calculation less trivial as the access time for HDDs hasn’t changed much in 30 years.
Erasure coding has been used (initially by object storage vendors) to implement both greater protection and geo-redundancy. The technology is similar to RAID in that it establishes redundant data components which are spread across multiple storage drives and/or systems. In the event of a failure, the data can still be read from the remaining devices, assuming enough still exist. The calculation overhead and network access times of erasure coding have been seen as barriers to replacing RAID, especially in systems with small-block random I/O (more on this in a moment).
In the enterprise, rebuild times have been a significant factor in storage array design. As drive capacities have increased, rebuilds can extend into days, rather than minutes or hours. In an HDD-based system, the reason is obvious. Hard drives are great at sequential I/O, but terrible at random access. A RAID rebuild will introduce a randomising factor into normal I/O operations, so has to be set at a low priority or will have a direct impact on host I/O. A RAID rebuild from a failed drive requires every other drive to take part, impacting the entire RAID set. As a result, many array vendors implemented “predictive sparing”, where SMART and other environmental data is used to detect a potential failure. Data is copied from a failing drive before a “hard fail”, as the copy process is quicker and less impactful than an entire RAID rebuild.
Of course, predictively sparing a drive risks the scenario that the disk wasn’t actually failing but experiencing a short-term blip (such as local vibration). Data can be unnecessarily rebuilt if the algorithm used to determine potential failure is over-zealous (I’ve seen this happen before).
All-Flash & Hybrid
What about all-flash systems? RAID rebuilds are still relevant because flash drives can also fail. IBM FlashSystem flash drives are pushing 40TB in capacity, with larger drives on the horizon from all vendors, so even with flash, rebuild time is a factor.
Hard drives will exist in the enterprise for some time (at least the next 5-10 years), as price parity with SSDs hasn’t been reached. Even when it does, systems won’t be replaced overnight, so RAID rebuilds will be a topic of discussion for some years yet.
Surely, we can’t be using the same RAID and rebuild techniques referenced in the 1987 paper? Well, yes and no. There are many examples in the industry of modified RAID designs and data layout techniques that look to mitigate some of the challenges of traditional rebuild techniques.
- XIV, developed by Moshe Yanai and acquired by IBM in 2007, used 1MB chunks of data spread across all drives to optimise the IOPS available from an entire system.
- HP EVA implemented vRAID at a block level, allocating free space across all drives in a redundant storage set (RSS). The RSS model ensures data isn’t too widely spread across a single disk group.
- Dell EMC XtremIO uses a protection mechanism called XDP, which offers wide striping and significant efficiency in all-flash systems.
- HPE 3PAR uses a method that divides physical disks into “chunklets” that are recombined as local drives with multiple RAID protection schemes across the same physical infrastructure.
There are many more examples here, including the implementation used by StorONE (more details can be found here). Some vendors are still using standard RAID groups and pools, despite the evolution of techniques to improve RAID performance. This legacy approach will be an issue in the future, especially with flash drives where endurance is a finite commodity and every rebuild has a cost.
One aspect of data protection within a storage system to consider is whether architectural changes offer a way to provide more efficient data protection. All of the examples referenced above resulted from a rethink in system design that put data protection and resiliency at the heart of the solution.
As we adopt new media, the characteristics of these devices allow – and almost demand a rethink of the architectural design of storage systems. As we discuss in this post, large-capacity HDDs and SSDs will introduce techniques such as zoning and SMR. The result is that these devices favour sequential, rather than update-in-place I/O. StorONE and VAST Data are two examples of vendors modifying their architectures to make use of a combination of fast persistent media (Optane) and cheaper capacity drives (like QLC NAND flash or HDDs). In both instances, new I/O doesn’t hit the “archive” tier directly but instead is coalesced into large sequential writes for greater media efficiency, write I/O performance, and endurance. I’ve put “archive” in quotes as the QLC NAND is probably best described as a “read-optimised” tier, with Optane acting as the “write-optimised” tier.
Finally, it’s worth highlighting that we’ve discussed storage systems extensively here. As software plays an increasing role in persistent storage solutions, we need to look wider at the SDS solutions in the market. How well do SDS solutions manage data resiliency and rebuilds? How efficient is the process and what impact do they have on host I/O? We know from this story that early vSAN implementations had problems, for example.
Software-defined storage can protect data as easily as well as traditional systems. One differentiating factor for SDS and container-attached storage is to show the exposure of a disk failure, estimate recovery time and provide a risk analysis to the application and system managers. This kind of data tends to be lacking in all software-based solutions. Another area of differentiation, especially for container-attached storage is to move away from mirrors or replicas and implement efficient RAID/erasure coding across the infrastructure. These are evolutions we talk about in recent predictions articles (parts iii and iv here).
The Architect’s View
Is RAID rebuilt time still an important metric? Yes, it definitely is, but fast rebuild times can’t come at the cost of application performance. Furthermore, we need to see solutions both managing and taking advantage of new media, while offering better insight into the management and rebuild of data during recovery. Hardware failure is a normal part of IT operations and the better we manage it, the more resilient our systems will be.
Copyright (c) 2007-2021 – Post #7b65 – Brookend Ltd, first published on https://www.architecting.it/blog, do not reproduce without permission.