This is one of a series of posts discussing the new features in Windows Server 2012, now shipping and previously in public beta as Windows Server 8. You can find references to other related posts at the end of this article. This post reviews the storage feature – data de-duplication.
Everyone wants to reduce the amount of primary storage they consume and keep growth as low as possible. There are many space reduction technologies that can be used and one of the most beneficial is de-duplication. In a nutshell, de-duplication removes multiple physical copies of identical pieces of data on disk and replaces them with pointers to the single copy of the content. A read request to any “logical” copy of the data will access the single physical copy. If any of the logical copies are updated then the update will be written as new data (leaving the duplicate intact) and the pointers updates accordingly.
Windows Server 2012 introduces the use of de-duplication on primary storage. The feature is implemented as a post-processing task, meaning the I/O path to disk isn’t interrupted to de-duplicate the write, but rather the duplication process is done as a scheduled background task. There are pros and cons to both solutions. De-duplicating inline takes additional processor and memory resources and could add to response times. Post-processing takes more disk space as new writes are not initially de-duplicated.
Microsoft have spent time developing their own IP with respect to de-duplication. Their technology known as ChunkStash uses a variable chunk size to de-duplicate data at the block level. Microsoft had initially determined that file-level de-duplication wouldn’t provide sufficient savings. Whilst de-duplication technology is nothing new, the benefit here is that it’s integrated into the filesystem, so can be used on any underlying storage, whether that be internal DAS or an external appliance or array. In addition, the metadata for de-duplication is retained within the NTFS structure, so if a logical volume is moved to another host, then the data can still be accessed.
I configured the de-duplication option on one of my lab servers and used some Hyper-V virtual machines as test data. Configuration is pretty simple. De-duplication is a role under the File & Storage Services heirarchy. Once enabled as a feature, data de-duplication can be configured on a per volume basis, as shown in the screenshots in this post. As mentioned earlier, de-duplication runs as a background task and can be configured to wait a number of days before considering files eligible for de-duplication. The task itself is scheduled to run at a fixed time, enabling the process to run at quiet times.
With a few virtual machines and other data, I easily achieved a 51% de-dupe rate and have since managed to increase this saving further. It’s simple to show that the de-dupe feature works, but the benefit is dependent on the data itself. So, Microsoft offers a free tool that analyses a volume or UNC volume share and reports back on potential savings. I’ll look at this in more detail on a separate post and discuss some of the new Powershell commands that display more detail on de-duplicated volumes.
The Architect’s View
Few storage vendors have chosen to implement primary de-dupe and we shouldn’t be surprised by that, when their main source of revenue is selling more hardware. At the same time, Microsoft is turning Windows into a storage platform and using features like de-dupe to differentiate themselves and provide value-add. As I’ve said elsewhere, de-dupe of Hyper-V guests provides potentially huge benefits. But there are downsides. De-dupe creates more random I/O, which is great with SSDs but not so good with SATA drives. This means a degree of caution is required when tuning the de-dupe options and therefore benefits may not be as good as initially indicated. Still, this is a great feature and one that will benefit many customers.
- Data Deduplication in Windows 2012
- Eliminating Duplicated Primary Data (Microsoft Research)
- Introduction to Data Deduplication in Windows 2012
- Primary Storage De-duplication: Only for SSD Arrays?
- ChunkStash: Speeding up Inline Storage Deduplication using Flash Memory – Microsoft Research – (also PDF)
- Planning to Deploy Data De-Duplication
- Windows Server 2012 (Windows Server “8″) – Virtual Fibre Channel
- Windows Server 2012 (Windows Server “8″) – Resilient File System
- Windows Server 2012 (Windows Server “8″) – Storage Spaces
Comments are always welcome; please indicate if you work for a vendor as it’s only fair. If you have any related links of interest, please feel free to add them as a comment for consideration.