Dude, Here's Your 300TB Flash Drive!

This week, Pure Storage announced FlashBlade//E and the aspiration to reach 300TB DirectFlash Modules by 2026. With the best current commodity drives stagnant at 30TB, how will Pure get to 300TB, and why does this matter?

Background

Back in 2018, we discussed the concept of a 100TB drive, which had been designed and built by Nimbus Data. At the time, Samsung had released 30TB drives, Seagate had a 60TB “concept” HDD, and we were starting to talk about fault domains and flash density as the ruler format started to emerge.

The Nimbus drive was challenged on several fronts. It used the SATA interface at 500MB/s, which meant single-threaded reading/writing and a de-facto 0.43 DWPD. The drive used a 3.5” form factor, whereas all modern flash drives had moved to 2.5”. It’s possible to fit either two or four 2.5” drives into the same volume as a 3.5” drive, so from a space perspective, this is either a 50TB or 25TB equivalent. As an archive flash drive, the Nimbus DC could have been viable if the price was right (we believe around $0.50GB – $0.60/GB).

Failure Domains

Putting 100TB of data onto a single drive can be problematic. If the drive completely fails, then the RAID or erasure coding rebuild would be significant, possibly into petabyte scale. That’s a lot of data traversing the I/O bus for no application benefit. Rebuilds hit application performance and, of course, put data at risk until the rebuild completes.

Ideally, we want to minimise the amount of rebuild work required. A preferred option is to simply copy data to another device before a failure occurs, known as predictive sparing, which has been a part of enterprise storage for decades. The art of the process is to know when a drive is likely to fail and pre-empt the failure at the right time. Over-aggressive predictive sparing creates unnecessary I/O and physical drive replacements. Under-aggressive predictive sparing results in more I/O expensive RAID rebuilds.

As drive capacities have increased, the impact of managing failures has had a greater impact. In the HDD market, drives can reallocate bad sectors, but this is generally a sign of an impending failure. Bad sector relocation also impacts performance. SSDs can have bad sectors but, more importantly, have a level of endurance based on write activity (DWPD). Your SSD will fail at some point. Note that HDD vendors also have been adding workload limits to drives for some time (we reported on this back in 2016).

Rebuild Management

If we reach 300TB drives, the rebuild overhead on a device failure will be enormous. That’s assuming, of course, that all failures result in an entire drive replacement. The SAS protocol has introduced the idea of Logical Depopulation for HDDs – effectively marking an entire platter as unusable. We discussed the concept in a Storage Unpacked podcast in September 2022.

Both NVMe SSDs and SMR hard drives can be divided into zones, almost like smaller, logical devices. It seems reasonable to expect that drive vendors can minimise rebuilds and manage device failures more effectively with features like namespaces (if they’re not already doing it).

The key to this process is not to think of the SSD or HDD as a black-box device but expose the internal management to the connected host. We can go back to our podcast from 2018 (Storage for Hyperscalers) to see that this process has been going on for some time.

Of course, managing partial failures only helps on the cost side of things if drives are repairable. HDDs generally aren’t repaired but recycled or head into landfill. We recorded a podcast back in 2019 where we discussed recycling versus reuse.

If you can’t repair an SSD or HDD, then the impact of a failure directly affects the TCO. The reliability of devices (MTBF, AFR) has hardly changed in years, typically around 0.55 AFR (2 million hours, MTBF).

Density

Problem number two – product density – how can we get more capacity into the same or similar form factors to those we use today?

First, we can assume that hard drives will not scale effectively in the future. In the past, hard drive capacity increased through improvements in areal density (the number of data bits per square inch of HDD platter). An increase in areal density is, at best, a two-dimensional change (the X and Y axes), with some benefit achieved by adding more platters to the Z-axis. The platter increase is now marginal and unlikely to increase unless we choose to entirely redesign the HDD architecture (which would increase unit costs). Areal density improvements have been slowing as manufacturers struggle to make new technologies viable.

In the SSD market, vendors have improved density using the Z-axis through tunnelling and stacking. 3D-NAND started to be used by storage appliance vendors around 2016 (see this post). Today, media vendors have started shipping 232-layer NAND, with forecasts of 400-500 layers possible within a single die; however, these are not being used in enterprise storage devices yet. This means we can expect a doubling or quadrupling of capacity from just the 3D-NAND aspect alone.

PLC (penta-level cell) NAND may make it to the market in the future; however, the improvements from this technology are marginal and offset by further reductions in endurance. A more likely scenario is that we see increased use of hybrid devices, where the sections of the NAND are reconfigured to operate at anything from SLC to PLC. Inactive (or read-intensive) data gets moved to the PLC sections, while active data sits in SLC.

A 2Tb NAND die is on the horizon. We expect there are other techniques being developed that will continue the growth in density. You can hear more about the challenges of increasing NAND density in a recent podcast we recorded in November 2022.

One other area to consider. IBM already applies heavy compression on FlashCore Modules. The FCM3 has a raw capacity of 38.4TB and 87.95TB effective (depending on compression ratio). So, another angle to increase density is to apply data optimisation techniques like compression with greater effectiveness.

Cost

Now to the third question, that of cost. If we can pack more data onto the same number of NAND chips, then the cost of storage should decline in line with the increase in density. We’ve seen this trend in the HDD market for many years. The BOM (bill of materials) remains roughly the same between generations of products, resulting in new model pricing at around $600/drive. The same logic should apply to NAND if the manufacturing process remains broadly the same between generations.

The custom drives produced by Pure Storage, IBM and ScaleFlux all use additional onboard processing. There is a cost associated with this, however. Vendors are moving from FPGAs to ASICs and custom SoCs. At scale, this transition may be more cost-effective for larger capacity drives while adding other benefits (like the compression mentioned above).

We discuss the impacts of computational storage in our recent eBook – Intelligent Data Devices, available for purchase and download today.

Intelligent Data Devices 2023 Edition – A Pathfinder Report

This Architecting IT report looks at the developing market of SmartNICs, DPUs and computational storage devices, as data centres disaggregate data management processes, security and networking. Premium download – $295.00 (BRKWP0303-2023)

Download Now

Cost of Repair

We should go back and look again at the ability to repair HDDs and SSDs. In the HDD market, the BOM and lifetime of a drive have been relatively static. The drive vendors know the expected number of returns under warranty and can build that into their margins. Returned drives most likely won’t be repaired but are hopefully recycled in some form.

In the SSD market, the unit cost for high-capacity devices is much higher, so it makes sense to repair them where possible. As a comparison, imagine purchasing a car and scrapping the whole vehicle because the tyres needed replacing!

SSDs can be repaired, but the use of commodity devices introduces a supply chain challenge. The customer returns failed drives to the appliance vendor under warranty, who then must return to the device vendor under whatever equivalent warranty exists between the two. With a unit cost of $5000 – $10,000 for the most expensive drives, building an efficient returns process is critical for both the end user and the storage vendor.

Without an efficient supply chain and repair process, media vendors will be reluctant to push capacities past current levels without the bill of materials cost remaining roughly consistent with today’s prices. Similarly, appliance vendors may want to limit the cost of an individual drive in the systems they deploy.

The Architect’s View®

So, how does this all play out in terms of Pure Storage and 300TB DFMs? We think the ability to reach 300TB capacity is within reach. This milestone will be achieved with improved NAND density (PLC and 3D-NAND), data optimisation (compression, de-duplication) and re-factoring (smaller processors onboard, more NAND chips).

The ability to manage devices through an abstracted Flash Translation Layer and direct data placement means 300TB rebuilds should be manageable. We believe that Pure Storage is probably also working on other techniques and ideas yet to be announced.

Third, the cost profile will be managed by controlled endurance (FTL), better AFR rates than the market average and by building repairable hardware.

What about the rest of the market? Only IBM has gone down the route of developing custom modules. Hitachi was an early developer of custom drives but appears to have moved away from the technology. The remainder of the market uses commodity components.

Does this matter?

There are two aspects to consider; the first is cost – if a 300TB DFM is (significantly) cheaper per TB than ten 30TB drives, then yes, the difference does matter.

Second, there’s the question of the environmental impact. If ten 30TB drives require more space and need more power and cooling, then the TCO is impacted. At some point, data centre costs will be an issue for all businesses as sustainability rises higher up the CTO’s agenda.

2026 is a short timeline to achieve a 6x improvement in DFM capacity. We will be watching closely to see if, when and how Pure Storage achieves this milestone.

Copyright (c) 2007-2023 – Post #c333 – Brookend Ltd, first published on https://www.architecting.it/blog, do not reproduce without permission. Pure Storage is a Tracked Vendor by Architecting IT in storage systems and software-defined storage.