Storage and Analytics - Creating Shared Knowledge

A few years ago I worked with a company called Storage Fusion that provided a SaaS offering for analysing storage systems. With a few simple scripts it was possible to collect the configuration information on one or more storage arrays and use that data to gain insights into metrics such as utilisation, performance and efficiency. One benefit of the product that interested me the most was the ability to be able to (anonymously) collate and analyse the configuration of petabytes of metadata, either as a snapshot or historically over time. This data provided valuable insights into market trends but also with deeper insight (not available at the time) could actually help vendors deliver better and lower cost products.

Valuable as the data was, Storage Fusion’s platform was limited. The metadata available was based pretty much on configuration information that showed disk and LUN layouts, efficiency of features like thin provisioning and so on. What wasn’t possible was to collect data to the level of granularity that has been built into new storage architectures, such as those from Nimble Storage (InfoSight), Tegile (Intellicare), Pure Storage (Pure1) and PernixData (Architect). These systems have two specific features that provide much greater insight; first they collect metrics that analyse all aspects of the platform, including specifics on the workload the system runs. Second they collate that data and perform both historical and “what-if” analysis across many customers. So how could this data be used? Here’s a few ideas and examples:

Identifying failing hardware. Obviously storage media fails from time to time, however being able to determine whether a batch of media has a worse failure rate than expected can both help to identify manufacturing defaults and plan to resolve them before other customers are affected. Shipping a drive to the customer before they are aware of a problem is a cool feature.
Optimise Drive Usage. As systems move to flash, one aspect of their use is the limited endurance SSDs offer. Initial deployments of flash were a little bit of a guessing game and vendors undoubtedly erred on the side of caution. However with field data companies like SolidFire were able to provide guarantees around drive lifetime (in this case an unlimited wear warranty) due to data received back from the field on the amount of data customers were actually writing. This has also allowed vendors to safely introduce the use of TLC NAND, which has lower endurance than both SLC and MLC.
Improve reliability/availability. Ultimately having more data on system operations means being able to proactively address customer problems and improve uptime and availability.
Reduce Cost. At some stage or other everything comes down to cost. Being able to use cheaper drives and reduce parts replacements means vendors can pass on cost reductions to their customers, keeping them competitive.

SDS and Analytics

As we move to a software-defined world how will new storage solutions, based around open source or software-only, deliver the analytics that are currently integrated into hardware appliances? At this stage I don’t believe I’ve seen anything in solutions like EMC ScaleIO, VMware Virtual SAN or Ceph that would deliver feedback on the infrastructure. In fact, in many instances failures aren’t even dealt with pro-actively; the software simply waits for the device to fail. Dealing with failure before it occurs is a much better way to operate infrastructure, especially at scale. Imagine if we waited for aircraft parts or bridges to fail before tackling the problem.