Last week I heard about a large organisation that experienced a double-disk failure on one of their storage arrays. As a result, major systems were down for nearly 8 hours. Double-disk failures (where a second disk fails within a RAID group while the first failed disk is being rebuilt) are rare occurrences but they can happen. With the increasing use of thin provisioning (TP), the potential impact of a RAID group failure isn’t limited to a small set of LUNs as TP implementations tend to wide stripe LUNs across many RAID groups. This means there is a trade off in design between risk (the chance of failure) and impact (how many services are impacted). Question is, how does this vary for storage vendors?
Array Design Options
Storage vendors all tend to implement their RAID protection in a number of ways. For example, early EMC Symmetrix arrays used RAID-1 groups and created LUNs by splitting a physical disk into slices called hypers. A LUN would be made of hypers from disks on different drives and back-end connections. The risk of failure was based on losing both physical disks that made up the LUN. On a well laid out system, the hypers would be distributed across disks and backend directors, balancing both load and risk. If a disk pair did fail, it would affect only LUNs that had both hypers from the failed disk. Every other LUN would be accessible although it would no longer have RAID protection.
Hitach implemented a slightly different model. They created LUNs from fixed RAID groups of disks and as an example in a RAID-5 implementation would have RAID-5 3+1 (4 disks) or RAID-5 7+1 (8 disks) implementations. If two disks from a RAID group failed, then the impact would be to lose all LUNs created from that RAID group.
Both of these solutions have limited impact, being related to only the failed disks. Of course if many hosts have been built from those LUNs, then at the host level the impact is wider.
Now let’s look at some other implementations. The HP EVA platform distributes data protection across all disks within a single disk group. Best practice suggests having as small a number of pools as possible, as the spare drive overhead is proportionally taken from every hard drive, so having more pools means more spare disk overhead. Now if you have only one pool of disks as a single RAID group and you have a double disk failure, then the impact is substantial, affecting every LUN and host configured on the system. So, EVA doesn’t simply have one large RAID pool but uses multiple RAID sets within the pool, known as redundant storage sets or RSS. LUNs are still distributed across all RSS groups but the risk of failure is significantly reduced as multiple disk failures can occur within the same pool without causing an outage as long as they occur in separate RSS groups.
Now consider XIV. This deployment treats the array as one large pool with RAID-1 mirroring at a block level (1GB blocks to be precise). All LUNs are spread across all disks evenly to provide maximum performance. Imagine 1TB drives or 1000 blocks of data. With a 180-drive system, then there should be at least 10 shared blocks of data on every disk, which is also mirrored to another disk. A failure in any disk pair means every LUN risks being affected, the extent of any impact being a purely random event.
Random versus Controlled
So the first question is, do you want your failure to be totally random in terms of what’s affected, or do you want to be able to limit the damage and at least have some control? The XIV model provides no control and so all users are at equal risk. It’s also not easy to work out who’s affected until the time of the failure; possibly the worst time to be performing damage control.
Other vendor solutions do give more control over failures. Where disks can be physically pooled, the RAID type can be matched to the value of the data. For instance RAID-6 groups can recover from double disk failures; the trade off is a lower performance profile due to double parity calculations. Wide striping across an entire array is certainly not a recommended solution. Smaller pools will reduce the chance of failure and the impact. We should also mention pre-emptive failures too. Most vendors attempt to replace a drive before it has failed, by predicting when drive failure is imminent. Moving data from a functioning drive is much quicker than recreating that data from parity.
One point well worth discussing is that of recovery time. In traditional RAID systems, a failed device is rebuilt from parity, requiring reading all disks in the RAID group, parity calculations and writing data to a replacement of the failed device. If a hot spare is used and that disk doesn’t become the permanent replacement drive, then data has to be copied back to the permanent replacement of the failed device. In XIV’s favour is the potential time to recover failed disks. Firstly, data protection is only RAID-1, so there’s no parity calculation. Second, data rebuild is achieved by replicating all the unmirrored blocks of data from the failed disk. These blocks should be distributed evenly across all of the remaining disks in the array. XIV retains free space on all disks, so remirroring will read from and write to all disks in the array at the same time, potentially delivering a very fast recovery time. In practice, recovery time will depend on the activity on the array.
Every storage vendor implements data availability in different ways. There is a trade off between recovery time, risk and impact, which is often difficult to quantify. I would respectfully suggest that there is still a requirement to understand the impact of disk failure for an array from the design perspective but also from the potential business impact; it makes sense to minimise risk wherever possible.