Unreliable Disks for Better Scale-out Storage

Recently I was asked to review a document that used as a reference a piece of work from Google which talked about the need to relax the resiliency levels of hard drives and SSDs. The premise is interesting. Hyper-scalers claim they could do a better job in managing performance and availability if the HDDs they use were modified to provide more information on error conditions and failure in the drive media.

The Curse of Reliable Drives

Modern HDDs are remarkably reliable devices. Barring some occasional manufacturing errors, both HDDs and SSDs are commodity components with a huge amount of smarts built into the controllers. This intelligence is needed to manage the foibles of the media, whether coping with things like shingled media or managing the degradation of silicon NAND. The controller software is capable of redirecting I/O, reallocating sectors and distributing I/O evenly (in the case of SSDs) to cater for a huge number of failure scenarios.

Hyperscale Requirements

Where most customers would be happy that the drive deals with and masks media transient and permanent errors, the hyper-scalers (particularly Google in this case) would rather that many of those errors were exposed to the host in order to allow Google’s storage software layer to make decisions on how to cope with the failure. For example, a hard drive will retry a read failure to ensure the read wasn’t just a transient error. This retry process takes time and increases the latency of the request, affecting the response time of an I/O that is distributed across many devices. Google would much rather the drive simply failed and “failed fast”, allowing software to read/rebuild the data from elsewhere.

Everything has been done before

This idea of pushing up the resiliency intelligence into software isn’t a brand new idea. Pure Storage uses this technique in their FlashArray code, choosing to rebuild the data from parity rather than wait for a drive to respond when it may be busy, for example, during garbage collection. In Pure’s case, the process of managing the drive presumably comes from testing with hardware (I have no idea how much bespoke code if any are on the drives Pure uses), whereas Google is specifically asking manufacturers to offer APIs and response codes that allow alternative actions to be taken. Remember also that XIO has been managing drive failure scenarios since the company was founded and using that IP to provide years of maintenance free availability on their ISE devices.

Rebuild Times

I’ve also previously thought that HDDs could have been modified to allow drive rebuilds to be done quicker. Imagine a drive is suspected of failing. As the drive reads and writes data to service host requests it repositions the read/write head and passes over many tracks and cylinders that aren’t accessed. It could be possible to read that data and make it available in cache on a second channel that could speed up the rebuild process. This is just one idea; Google has many others, including their own version of parallel access (with more drive actuator arms), alternative form factors (to increase storage density), host-managed retries and background task management. Check out the full document here.

New Features for All

How practical would these new features be for “the rest of us”? Certainly, I imagine array vendors would see significant benefits of improved data virtualisation, as would the software defined companies. I doubt these features would have much value for the consumer as most are focused on dealing with the effects of managing thousands or hundreds of thousands of drives. This means we could end up with a two-tier drive hierarchy based on reliability – almost true enterprise drives.

SNIA Recommendations

SNIA have put together a response document (the one I mentioned reviewing earlier), which discusses how some of these features could be implemented. There aren’t many hard drive vendors these days (hello Seagate, WD and HGST), so getting consensus for new features might not be that hard. You can read the report here.

The Architect’s View®

We’ve already seen the hyper-scale compute companies (Google/FaceBook/Microsoft) develop server and rack architectures that provide cheaper, more efficient deployment and management than buying from standard vendors (check out the Open Compute Project, for example). So why not try to further optimise the components? Some of the changes suggested by Google could be easily implemented, however a lot depends on the continued take up of hard drives as the medium of choice for the data centre. NAND flash may well impact this, especially if QLC technology can be delivered relatively quickly and reliably. It’s clear from the development of Kinetic drives that HDD vendors want to remain relevant. Maybe they will need to listen to the hyperscaler’s requests or find NAND flash takes over their business quicker than expected.