Home | Featured | Compellent: New Features Speculation

Compellent: New Features Speculation

0 Flares Twitter 0 Facebook 0 Google+ 0 StumbleUpon 0 Buffer 0 LinkedIn 0 Filament.io 0 Flares ×

COMP_H_3C_300_JPG.ashxChris Mellor recently speculated on Compellent and the additional horsepower of their new storage product releases – series 40 controllers with Storage Center 5.4.  Over dinner last night, we discussed how various features could be integrated into the current architecture.  Many of them are surprisingly easy to achieve (or so we believe).  One feature under the spotlight was primary data de-duplication.  Here’s how it could be done.

What Primary De-Duplication is

De-duplication has typically been used to reduce the size of infrequently accessed archive data in dedicated archive appliances.  Blocks of identical data are “deduped” by removing them and only retaining pointers to a single physical copy on disk.  De-duplication appliances benefit from storing lots of identical (typically read-only) data such as that generated by backup appliances.  However de-duplication is also valuable in other environments were data can be duplicated, such as email archives and virtual environments.

How Compellent Could Do It

There are a number of features of the Compellent architecture that could enable de-duplication including:

  • Snapshot Support – Storage Center already supports snapshots. This functionality creates point-in-time images of LUNs, only retaining pointers to shared blocks of data in the way de-duplication works.
  • Metadata – The architecture already retains metadata on referencing LUNs, I/O activity and so on.  It wouldn’t be difficult to extend that to include a unique hash code per block.
  • Write New – All changed blocks are written as new blocks of data.  Old data is simply invalidated unless it forms part of a snapshot.  Therefore, if a block of data is referenced by multiple LUNs and any LUN is updated, the changed data would be re-written as a new block and the old retained for other LUN references.  Over time the level of de-duplication would reduce.
  • Background Processing – the existing storage controllers already run scheduled tasks to manage data progression, moving blocks between tiers of storage dictated by historical usage patterns.  It would be simple to add another task to scan and consolidate blocks with identical hash codes.

De-duplicating primary data has certain risks.  There is, for example, a risk of creating hot-spots of data access, where shared blocks become heavily accessed.  This can occur in Virtual Desktop implementations, for example.  Netapp introduced PAM (Performance Acceleration Module) cards to get over this kind of problem as their architecture isn’t capable of the granular i.e. block) level of data placement required to overcome this issue.  The Compellent architecture can do this already by promoting “hot blocks” to a faster tier of storage within a single LUN.  This ability is a key differentiator over other de-duplication implementations that would make Compellent hardware suitable for primary de-duplication.

Of course all this discussion is pure speculation as I have no prior knowledge of Compellent’s roadmap or futures strategy.  It is fun to try and second-guess things, though, and you never know, maybe this will become a feature in the future.

About Chris M Evans

Chris M Evans has worked in the technology industry since 1987, starting as a systems programmer on the IBM mainframe platform, while retaining an interest in storage. After working abroad, he co-founded an Internet-based music distribution company during the .com era, returning to consultancy in the new millennium. In 2009 Chris co-founded Langton Blue Ltd (www.langtonblue.com), a boutique consultancy firm focused on delivering business benefit through efficient technology deployments. Chris writes a popular blog at http://blog.architecting.it, attends many conferences and invitation-only events and can be found providing regular industry contributions through Twitter (@chrismevans) and other social media outlets.
  • Pingback: Tweets that mention The Storage Architect » Blog Archive » Compellent: New Features Speculation -- Topsy.com()

  • http://www.aranea.nl Ernst Lopes Cardozo

    Hi Chris,
    What you describe is necessary for deduplication, but not sufficient. The essence of dedup is that the hash code for each new block has to be compared to the hash codes of all the existing blocks to check if it is a duplicate. However, the hash codes for subsequent blocks are completely unrelated., there is no locality in the hash table, so the best one can do is keep a table with sorted hash codes in memory. We have two choices: either we use a long hash (i.e. 256 bits) where the chances that two different blocks have the same hash code is very small (like 10E-70), and we end up with table entries of about 200-300 bytes per block. Or we use a shorter and thus weak hash code so that a the chance of a false match is 1% or more; now the hash table is smaller, but we have to read the suspected duplicated block to do a byte by byte comparison to make sure it actually is a duplicate.
    The size of the dedup table (with the hash codes plus pointers) is a critical issue: If the block size is 4KB, 100TB storage means 2.6E+10 entries. At 250 bytes per entry, that is 6.1 TB of RAM – a bit unpractical. We can chose a larger block size, e.g. 512 KB (reducing the deduplication factor), and still have a table of 48.8 GB.
    Deduplication is easy if done on a relatively small amount of data, but of course, the benefits are equally smaller.
    For system images in a virtual machine environment, a much simpler method is to clone a master image and create any number of writable snapshots: these all share the common blocks and have private copies of the changed blocks: configuration parameters, swap and temp files, etc.
    The hot spot “problem” you describe is actually the best part of cloning or deduplication: with N virtual machines accessing the same blocks we are sure that these will reside in cache and can be served much faster than if each machine had a private copy that had to be read form disk. Call it cache-deduplication if you want.
    For real time (on-the-fly) dedup, take a look at ZFS – you don’t have to speculate since you can read the code as it is open source.

  • Pingback: 30-Minute Websites for Teachers (5-User Pack) for Mac,Win - Wordpress 201()

  • Pingback: Safety 1st Eurostar Travel System – Lexi()

  • Pingback: The Storage Architect » Blog Archive » Compellent – The Inevitable Acquisition()

0 Flares Twitter 0 Facebook 0 Google+ 0 StumbleUpon 0 Buffer 0 LinkedIn 0 Filament.io 0 Flares ×