Data management has had a new lease of life, driven by a focus on data protection. New start-ups have introduced products that exploit “secondary data” as a source for seeding test/development environments and other non-production uses.
In the past (see podcast below), we’ve questioned whether these solutions are delivering true data management value, or simply providing better data asset management. With the introduction of the Cohesity MarketPlace, are we seeing that next step to exploiting the value of data, rather than just managing it more efficiently?
Storage vs Data Management
The difference between storage and data management is an important one. Historically storage administrators have worked to provide reliable and predictable access to block, file and object storage. This work wasn’t focused on the content, but rather simply on the infrastructure.
Secondary data management products have also had a similar focus. These solutions move up the stack a little further, providing consolidation of similar data types (so-called copy-data management) and a more flexible approach to data reuse.
- Exploiting secondary data with NDAS from NetApp
- The $2 billion Gamble on Data Management
- The Britpop Battle of Rubrik and Cohesity
There’s no doubting these systems have value. After all, the investment community wouldn’t have gambled over $2 billion on investments. However, even at this point, the solutions in the market haven’t actually exploited the value of the data by implementing add-on services. That is until Cohesity introduced Marketplace.
MarketPlace is a solution offering that allows customers to run AI/ML, analytics or other data-intensive applications on content stored in the Cohesity DataPlatform. At a very basic level, we can see why this could be beneficial. Simple functions could include virus and malware detection, identifying potentially compliance-managed data or potential ransomware.
In practice though, the opportunity for MarketPlace applications is much greater than executing tasks that could have been performed on the primary copy of data in the first place. Instead, DataPlatform and other solutions like it should provide a single copy of data against which more complex searches can be performed. By bringing data together in one place, the benefits of searchability should be greatly improved (more on this later).
Can processing secondary data be genuinely useful? Think for a moment what might be in a secondary data platform. The origins of solutions from Rubrik and Cohesity initially focused on backing up virtual machines. VMs typically run traditional databases like SQL Server and manage structured content. There may be some unstructured content here, but it’s unlikely to sit within a virtual machine itself, but rather be on a connected file system or object store. The wrapper of the VM is useless in terms of data analysis and may as well be excluded from any processing.
For businesses to see any value, is access to this relatively structured data going to be enough? In the specific case of Cohesity, the DataPlatform also offers the ability to store unstructured content, so there is the ability to bring in more data sources. However, I don’t think a single secondary data platform alone will be enough for organisations with large volumes of data. There will need to be some process for getting data from elsewhere in the organisation.
You can learn a little more about what Cohesity is thinking in this recent podcast episode with Rawlinson Rivera, Field CTO at Cohesity.
Let’s also stop for a moment and think about the complexity in processing structured data stored on a secondary platform. Imagine a single SQL Server (or any SQL) database that is constantly updated with new records and is both logically and physically deleting records too. As each backup is taken, what version of that database should we include in searches? Should it be the latest one? Should we include older versions in order to incorporate deleted data? How can we identify logically deleted records that sit within the database and maybe shouldn’t be included in search? GDPR anyone?
The traditional solution has been to use ETL (Extract, Transform, Load) processes to take data from existing platforms and use only the data needed for analytics. Solutions like Hitachi Vantara’s Pentaho platform provide the ability to aggregate data in this way from multiple disparate data sources. How exactly will the likes of Cohesity perform this task in MarketPlace applications?
There’s a lot to overcome in order to make use of secondary data. There’s also a question as to whether the secondary platform should even be the right place to perform this type of task. After all, secondary platforms are limited in scale and specifically designed for processing backup data. While there might be some “spare cycles” when other work could be done, it doesn’t seem logical that this will be enough to perform complex analytics work.
It also doesn’t make sense to build out a secondary platform just for analytics work, when a dedicated cluster could be built instead. In fact, it would make more sense to push secondary data to the public cloud where compute is cheap and can be purchased on-demand, rather than to meet a high water-mark of application performance.
Perhaps, though, there’s a more long-term strategy in play here. DataPlatform already supports the public cloud as a deployment location. Cohesity could have a long-term strategy that will use the public cloud for more demanding analytics workloads and keep the simple stuff on-premises. This hasn’t been announced and is pure speculation but does seem to be a logical progression.
The Architect’s View
I’m still yet to be convinced that data protection solutions can be transformed into effective data warehouses. There seems to be some basic ETL processing missing from the equation, as well as access to a large amount of unstructured data.
Looking at how Cohesity Marketplace has been implemented, 3rd parties will be able to write to a standard set of APIs that will presumably expose data to applications and put a framework in place to ensure that resources aren’t swallowed up by poorly written applications.
But without the ability to run in the public cloud, on-premises analytics applications look set to be severely limited in their application. There’s definite promise here, but I think there’s still a lot of work needed to deliver it. This video from Storage Field Day 18 provides an introduction into MarketPlace.
Post #44ad. Copyright (c) 2019 Brookend Ltd. No reproduction in whole or part without permission.