Pure Storage Acquires StorReduce

Congratulations to the folks at StorReduce for being acquired by Pure Storage. Although terms of the deal were not disclosed, hopefully, this represents a successful exit for an interesting technology.

Background

I first looked at StorReduce in March 2015, including a trial of the software. The premise of the technology is simple. StorReduce acts as a gateway to the public cloud, exposing an S3 interface to the end user. At the same time, the software de-duplicates the data and stores it on S3. If you’re not aware, AWS S3 doesn’t offer any de-duplication savings to the customer. Data written to an S3 bucket may be de-duplicated at the back end, but the customer pays for logical consumption. What this means is, store the same data twice, you pay twice, even if the content is identical. Store it 100 times, you pay 100 times.

Where things can get expensive is as data is written as part of something like a data protection process. Imagine standard backups that repeat a full copy every week. Although most of the data won’t have changed between one weekly full backup and another, conventional wisdom is to repeat the full, because restoring from a first initial backup and thousands of backup fragments would take forever. This is, of course, a throwback to the days of tape. Modern data protection should be able to create synthetic fulls – a full backup that is reconstructed from many incremental backups to look like a full copy, but you get the idea of why de-duplication in the cloud is important.

Data protection is an easy example to understand, however, the benefits of de-duplication occur at scale. In reality, savings of 90-95% will be achieved with huge volumes of similar or related data. It’s unlikely that this will simply be backup, but a range of data including analytics, archives and machine created information.

That begs the question as to what Pure will use the technology for.

Implementation

The value of the StorReduce technology is in the performance and efficiency of the de-duplication engine. The ability to successfully emulate the S3 API comes as a side benefit. An original StorReduce press release claimed a single EC2 instance of the software could process up to 600MB/s of read or write traffic and manage 10PB of capacity. The actual de-duplication savings quoted ranged from 50-95%. This is a wide variation but obviously dependent on the type of data being stored.

So what would Pure Storage use the technology for? Today, FlashBlade offers on-premises scale-out file and object storage. The first, most obvious idea would be to extend the on-premises namespace to the public cloud and offload data into AWS S3. This could create tiers of active and inactive data, with inactive content stored efficiently in the public cloud. De-duplication in this instance wouldn’t simply be about saving capacity, but also about the speed of read/write from the cloud and minimising egress traffic.

Pure could offer their own SaaS platform based on EC2, automating the offload process and storing customer data in public cloud. This one is perhaps less likely, as many customers would expect data encryption to be in place if the SaaS solution was multi-tenanted. Encryption immediately defeats most of the savings from de-duplication.

A third option could be to de-duplicate customer data within FlashBlade. This seems less likely as Pure’s business is all about selling capacity.

Finally, a fourth option could be to extend cloud offload for array-based snapshots from FlashArray. This already exists with a product called CloudSnap. Adding StorReduce could make this service cheaper and more efficient. This would be in line with the offerings from Dell EMC and HPE (3PAR) and further strengthen Pure’s recent enterprise maturity move.

The Architect’s View

Whatever the plan, it’s good to see interesting technology being put to use and I look forward to seeing what happens. On a related note, we have a Storage Unpacked podcast in the works that talks about managing large volumes of unstructured data. You’ll also find links here to other content that talks about some of the S3 API issues and the problems of managing unstructured data.

Background

Implementation

The Architect’s View

Related Links & Blogs