Chris Mellor recently speculated on Compellent and the additional horsepower of their new storage product releases – series 40 controllers with Storage Center 5.4. Over dinner last night, we discussed how various features could be integrated into the current architecture. Many of them are surprisingly easy to achieve (or so we believe). One feature under the spotlight was primary data de-duplication. Here’s how it could be done.
What Primary De-Duplication is
De-duplication has typically been used to reduce the size of infrequently accessed archive data in dedicated archive appliances. Blocks of identical data are “deduped” by removing them and only retaining pointers to a single physical copy on disk. De-duplication appliances benefit from storing lots of identical (typically read-only) data such as that generated by backup appliances. However de-duplication is also valuable in other environments were data can be duplicated, such as email archives and virtual environments.
How Compellent Could Do It
There are a number of features of the Compellent architecture that could enable de-duplication including:
- Snapshot Support – Storage Center already supports snapshots. This functionality creates point-in-time images of LUNs, only retaining pointers to shared blocks of data in the way de-duplication works.
- Metadata – The architecture already retains metadata on referencing LUNs, I/O activity and so on. It wouldn’t be difficult to extend that to include a unique hash code per block.
- Write New – All changed blocks are written as new blocks of data. Old data is simply invalidated unless it forms part of a snapshot. Therefore, if a block of data is referenced by multiple LUNs and any LUN is updated, the changed data would be re-written as a new block and the old retained for other LUN references. Over time the level of de-duplication would reduce.
- Background Processing – the existing storage controllers already run scheduled tasks to manage data progression, moving blocks between tiers of storage dictated by historical usage patterns. It would be simple to add another task to scan and consolidate blocks with identical hash codes.
De-duplicating primary data has certain risks. There is, for example, a risk of creating hot-spots of data access, where shared blocks become heavily accessed. This can occur in Virtual Desktop implementations, for example. Netapp introduced PAM (Performance Acceleration Module) cards to get over this kind of problem as their architecture isn’t capable of the granular i.e. block) level of data placement required to overcome this issue. The Compellent architecture can do this already by promoting “hot blocks” to a faster tier of storage within a single LUN. This ability is a key differentiator over other de-duplication implementations that would make Compellent hardware suitable for primary de-duplication.
Of course all this discussion is pure speculation as I have no prior knowledge of Compellent’s roadmap or futures strategy. It is fun to try and second-guess things, though, and you never know, maybe this will become a feature in the future.