The $2 billion Gamble on Data Management

What a week it’s been in the data storage industry. First Rubrik Inc announced an E-series round of funding totalling $261 million. This was followed barely 24 hours later by Veeam Software announcing it had raised $500 million in what seems to be regarded as an unexpected move. Across Actifio, Cohesity, Rubrik and Veeam, venture capitalists have invested close to $2 billion. This is one massive gamble on data management, but it is it money well spent?

Data Protection Origins

To date, the funding for each of the four companies I’ve mentioned is as follows:

Actifio – $352.5 million
Cohesity – $410 million
Rubrik – $533 million
Veeam – $500 million (plus initial self-funding)

This is a total just shy of $1.8 billion. Now clearly, there are more data management companies than just these four. However, the reason why they are interesting as a group is the source of their data. All of these companies provided copy data management or data protection functionality first and have then moved to a wider data management remit. In the case of the Cohesity/Rubrik/Veeam triumvirate, this focus was even more narrow in that the data initially came from the backup of virtual machines.

Data Sources

Why is this a source of potential problems? First of all, let’s look at the initial premise of these companies. Providing a better way to do backup and being able to optimise the data is a “good thing”. The process by which Rubrik, for example, streamlined backup operations through policy and automation was much more sensible than the historical process of building schedules and trying to fit them into a timeline. There’s also much more functionality being offered here through copy data management (minimising secondary copies of data) and in new features like instant restores.

So focusing on data protection alone, these vendors offered great advances over traditional offerings, neatly packaged as an appliance. As a result, they’ve been “kicking butt” in the data protection space.

Metadata

What’s being touted as the next generation of value comes from these data management companies having access to the metadata describing the data being protected. As we all know, metadata is data about data. In this context, the question we have to ask is how far that metadata descends into understanding the content being protected. To explain this, let’s use an example.

VM Backup

Imagine we are backing up virtual machines. Most VMs are self-contained, in that the data, application and operating system all sit within the same volumes or LUNs. If the data is in a structured format, then the VM will have a database installed. From a data mining perspective, we can pretty much throw away all the content of the data except what’s in the database. This is the true value of the content of the VM.

Now work with me for a moment. I’m sure some readers are thinking – hold on, isn’t there value in the metadata of the VM itself? Yes, there is, but only from a very basic level. So, for example, knowing the data is protected or knowing the content holds personal information is valuable. It allows IT organisations to certify that the data is being managed in a compliant way, but it doesn’t speak to the value of the data in the database.

Content is King

The true value of data is the content itself. Value comes from being able to bring together multiple datasets and derive additional insights from the information as a whole. So, specifically in this instance, this means bringing together both unstructured and structured content. How is this being done? Getting into structured data requires a number of processes:

Identifying the database itself – not too hard, but does require knowing software versions and data formats.
Accessing the data – a bit harder – databases use well-described file formats, so this is possible.
Indexing and storing – a little more tricky, the schema of the data needs to be understood.

Let’s not forget that databases can be encrypted, so the data management solution would need encryption keys. Remember also that the data is being backed up regularly, so will represent a time series of the actual content. To achieve real indexing of the data these solution providers will need to implement some form of ETL process, akin to building a traditional data warehouse.

Obviously, at the moment, this is not what’s being built. The data is sitting in the original format (VM image, files) and being de-duplicated for efficiency. One last comment on this point; I know most or all of the vendors listed can support databases natively, so wouldn’t have to go through the “unpacking” process of accessing the data via the file system. However, many IT organisations prefer the simplicity of VM backup in the first place.

OK, so we have our structured data. What about unstructured data protection? As far as I am aware, none of the companies discussed here offer unstructured backup at large scale, although Cohesity does have a scale-out storage offering.

Data Asset Management

I would suggest that businesses with very large amounts of data will already be running data warehouses, data lakes and other processes to analyse their historical content. Without a way to bring together all of this structured and unstructured content in one place, are we not just delivering data asset management?

We discussed the idea of storage management and data asset management on a recent Storage Unpacked podcast. Prior to that, we also discussed the process of managing unstructured data and how we need some better definitions and tools. It’s worth listening to both podcasts as it helps frame the discussion on where these companies are headed.

The Architect’s View®

So far at least, what I see from the gang of four listed above is actually data asset management, rather than full data analytics. This in itself, isn’t a bad thing. With data becoming so valuable (and being the core to almost every business) then we need to ensure continuous availability and meet compliance expected of enterprises. At the same time, backup sprawl needs to be avoided at all costs because of the risk and cost it represents.

However, at the moment, I don’t see any value being offered past this point. There’s no content mining being discussed. These vendors are not bringing together multiple data sources and creating new insights in the way that businesses like Hitachi Vantara are with Pentaho, for example.

The question is, do the VCs know more than we do? Is the promise of data mining just around the corner and is that the justification for spending $2 billion of their hard-earned cash? Or, is the data asset management market deemed valuable enough?