Thanks to all those who posted in response to Understanding EVA earlier this week, especially Cleanur who added a lot of detail. Based on the additional knowledge, I’d summarise again:
- EVA disks are placed in groups – usually recommended to be one single group unless there’s a compelling reason not to (like different disk types e.g. FC/FATA).
- Disk groups are logically divided into Redundancy Storage Sets, which can be from 6-11 disks in size, depending on the number of disks in the group, but ideally 8 drives.
- Virtual LUNs are created across all disks in a group, however to minimise the risk of data loss from disk failure, equal slices of LUNs (called PSEGs) are created in each RSS with additional parity to recreate the data within the RSS if a disk failure occurs. PSEGs are 2MB in size.
- In the event of a drive failure, data is moved dynamically/automagically to spare space reserved on each remaining disk.
I’ve created a new diagram to show this relationship. The vRAID1 devices are pretty much as before, although now numbered as 1-1 & 1-2 to show the two mirrors of each PSEG. For vRAID5, there are 4 data and 1 parity PSEG, which initially hits RSS1, then RSS2 then back to RSS1 again. I haven’t shown it, but presumably the EVA does a calculation to ensure that the data resides evenly on each disk.
So here’s some maths on the numbers. There are many good links worth reading; try here and here. I’ve taken the simplest formula and churned the numbers on a 168-drive array with a realistic MTBF (mean time before failure) of 100,000 hours. Before people leap in and quote the manufacturers numbers that Seagate et al provide, which are higher figures, remember arrays will predictively fail a drive and in any case with temperature variation, heavy workload, manufacturing defects etc, the probability is lower than manufacturing figures (as Google have already pointed out).
I’ve also assumed a repair (i.e. replace) time of 8 hours, which seems reasonable for arrays unattended overnight. If disks are not grouped, then the MTTDL (mean time to data loss) is about 44553 hours, or just over five years. This is for a single array – imagine if you had 70-80 of them – the risk would be increased. Now, with the disks in groups of 8 (meaning that data will be written across only 8 disks at a time), the double disk failure becomes 1,062,925 hours or just over 121 years. This is without any parity.
Clearly grouping disks into RSSs does improve things and quite considerably so, even if no parity is implemented, so thumbs up to RSSs from a mathematical perspective. However if a double disk failure does occur then every LUN in the disk group is impacted as data is spread across the whole disk group. So it’s a case of very low probability, very high impact.
Mark & Marc commented on 3Par’s implementation being similar to EVA. I think XIV sounds similar too. I’ll do more investigation on this as I’d like to understand the implications of double disk failures on all array types.