Flash Capacities and Failure Domains

Intel Ruler SSD and Chassis — The Intel SSD DC P4500 Series in the “ruler” form factor was designed to optimize rack efficiency and will be available by end of 2017. (Credit: Intel Corporation)

Chatting with good friend Enrico Signoretti earlier today on the subject of 1PB flash in 1U, I was reminded of the new Intel Ruler form factor. In case you missed the news, in August Intel debuted a long, thin form-factor SSD dubbed the ruler that stacks back to front and vertically in a server, potentially allowing up 1PB per 1U of rack space. Product details are scarce however, from the images shown, the ruler SSD will be easier to hot swap in a server and have better heat dissipation, as I imagine the whole length of the body will be a passive heatsink. From the images shown, a single server could hold 32 ruler SSDs, each of 32TB, based on a chip size of 1TB. I’m guessing 32 active media slots and 4 for over-subscription.

Getting back to the discussion with Enrico, we we talking about failure domains. In this server form factor, the failure domain is either the Ruler blade or server, as the chassis design shown by Intel implies dual controller. Ruler is hot-swappable, which reduces the risk somewhat. I would imagine also that the blades themselves are in a redundant configuration to add an extra level of resiliency.

What happens if we get 32TB and larger? With QLC and a 1.5TB chip size, we could easily see 1.5PB in a chassis. How much failure can a single ruler tolerate before the whole device fails? Ultimately this is Enrico’s issue (I think), that with larger and larger devices, we are at risk of huge rebuilds, but more important, to make device pricing viable, Intel needs to be able to easily repair failed devices.

The Intel SSD DC P4500 Series in the “ruler” form factor was designed to optimize rack efficiency and will be available by end of 2017. (Credit: Intel Corporation)

SSD reliability is currently as good as hard drives, when looking at MTBF or AFR (Annual Failure Rate). As we scale up, it will be interesting to see if this level of reliability can be maintained. This begs the question as to whether the NAND or controller is likely to be the failure point and how many chips each controller channel will drive.

When Hitachi introduced their FMD, lots of additional intelligence went into the controller but RAID wasn’t included. Device failure was still managed across multiple devices. With 32 Rulers in a server and potentially data across multiple Rulers, it would probably make sense to use erasure coding, however that’s not efficient with small block writes. Implementing the right level of data protection could be an issue for these devices.

The Architect’s View

How comfortable would you feel about storing 1PB of active data on a single chassis? In the quest for ever higher densities, this level of density could be an issue. Object and file-based storage will probably be fine with erasure coding, but not so block-based storage. This exposure is one of the reasons, I think, that Pure Storage went for a more active blade and passive backplane in FlashBlade – with less to go wrong in the chassis. Funny that even as we move forward, the old issues of storage still exist.

Comments are always welcome; please read our Comments Policy. If you have any related links of interest, please feel free to add them as a comment for consideration.