The Wisdom of the (Storage) Crowd

Commercial storage platforms have always benefited from the centralised collection and analysis of configuration information. Nimble pioneered the modern approach of using analytics to improve product resiliency and reduce the number of engineers required to support customers. Ceph was recently extended with a telemetry module that can now deliver anonymised data into a central repository. As possibly the first Open Source storage solution to implement this kind of data collection, how can we expect the future of storage telemetry to look?

Background

Storage vendors have collected system metrics for decades. Initially this was achieved with individual dial-home modems connected to each storage array. Over time, this sophistication increased with the collection of data over the public Internet. NetApp was one of the first vendors to introduce an in-built “AutoSupport” collection process that was configurable by the customer (more on this in a moment).

Call home typically worked in two modes. Systems would alert immediately when a problem, such as a device failure was identified. In many cases, the first time a customer became aware of a problem was when they received a call or email from the vendor. The second mode implemented a regular process to collect configuration information from across the entire storage estate.

ML/AI

In recent years, we’ve seen increasing sophistication in the data analysis vendors perform. Nimble (and now HPE) has delivered InfoSight with all of their storage product shipments and is extending this technology to other hardware devices. The latest versions of InfoSight technology provides the capability to both identify outliers for the customer and to detect potential bugs and misconfigurations that could impact uptime. The result is a seven 9’s level of availability for Nimble systems.

Machine learning is becoming increasingly crucial for vendors as a method of differentiation. With so many choices available, potential customers can select traditional storage, Open Source solutions or push data into the public cloud. If storage vendors can offer features that increase availability and resiliency, then this can be used as a partial justification for staying with commercial on-premises solutions.

Open Source Telemetry

Ceph introduced the Telemetry module in the Mimic (13.2.x) release. At this point, the data is pretty basic and shows simple statistics like cluster capacity, pool data and software version release. Some of the parameter constructs remind me of early ONTAP AutoSupport configurations.

The Ceph data is by no means comprehensive. On the other hand, at least the introduction of telemetry is a starting point from which the scope of data can be expanded. So far, it appears data collection is focused on information that will improve the efficiency of Ceph software and/or reduce configuration challenges. But there’s so much more scope to add the collection of data about media as one simple example.

Ownership

Going forward, I see three main problems with Open Source telemetry. The first is data ownership. When you sign up for a commercial platform, the vendor contract typically provides for the opt-in collection of data. This information is anonymised and effectively becomes the property of the vendor. They can choose to share the information or use it to develop better products. Here’s NetApp’s policy (PDF) and collection process (PDF).

An excellent example of this is a recent report presented at USENIX FAST 2020. The data comes from NetApp’s field data of 1.4 million SSDs on ONTAP systems and collected by ActiveIQ (the evolution of AutoSupport). This data is a rich source of information on reliability and usage patterns across the estate. We’ll look at the details in another post, but one interesting observation is the endurance rates of drives. Some 99% of SSDs don’t even use up 1% of their PE cycle (endurance) capabilities. Knowing this level of data allows drives to be retained in use for longer, potentially reducing the long-term cost of flash SSDs.

So, from an ownership perspective, who will own the data for Open Source collection and will we have both the privacy tools and legal frameworks in place to make data collection practical?

Cost

The second challenge is the cost. Collecting and storing data on this scale is hugely expensive. Pure Storage claims to collect over 1 trillion data points per day from their 10,000+ systems in the field (7PB of information). This Tech Field Day video from 2015 provides additional background.

While the collection of this data is extremely valuable, all vendors will point to cost considerations and have gone through the process of bringing the data in-house and using a mix of public cloud capabilities. It’s a constant effort to keep the data collection and analysis process efficient.

So, who pays for this with Open Source?

Analysis

The third issue is related to the cost discussion. It’s essential to have good data but getting value out of that information takes skills and expertise. Storage vendors have teams of data scientists looking at how to understand and exploit the hidden value in AutoSupport type data. This analysis is generally focused on improving product reliability. However, there are also scenarios where this data can be used to aid the sales process by showing customers their growth patterns in both performance and capacity.

So, who pays for the data scientists in Open Source? Do we make the data freely available (within agreed criteria) for anyone to analyse? Is there value in giving this data back to the community?

One example of where this kind of data sharing has been useful is with the regular reports we see from Backblaze. The company shares reliability statistics on hard drives through quarterly and annual reports. You can listen to a conversation with Andy Klein where we discuss the process on a recent Storage Unpacked podcast.

The Architect’s View

I would like to see some standards developed that could be used by all storage platforms (appliances or software) to deliver anonymised metrics into a central repository. Although the most prominent companies to fund this would be the media vendors, I suspect these companies may not be too keen on exposing any flaws or issues in their products.

Perhaps we need the Open Source storage community to drive this initiative. In whatever way we move forward, detailed metrics will remain relevant because, without an understanding of the hardware components, Open Source storage will always be at a disadvantage to traditional systems.

Are you aware of any other initiatives in place that are aiming to replicate the telemetry process put in place by Ceph? Drop us an email if you know of any other solutions taking this approach. We’d love to update this post and share the information.