Efficient Immutable Snapshots

Efficient Immutable Snapshots

Chris EvansData Management, Data Practice: Data Protection, Data Protection, Enterprise

The risk of ransomware attacks has made snapshot technology one of the primary tools for data recovery.  However, not all snapshots are the same.  Efficient immutable snapshots are needed to ensure recovery from any point in time, which these days could mean months into the past.

The Problem

Ransomware attacks have become big business.  The average impact of an attack runs into millions of dollars, while the total impact in 2021 via one estimate was $159 billion in downtime alone.  The dwell time (the time hackers spend in the network) is increasing and could be as long as 60-90 days or more.  Backup systems are being targeted to prevent the ability to restore following an attack.  Essentially, the criminals have become more sophisticated in approach and impact.

Snapshots vs Backup

IT organisations should take a “defence in depth” approach to ransomware or other malware attacks.  This means creating a series of defensive layers, each of which offers an option for recovery.  Rather than rely solely on backups or snapshots, for example, both techniques must be employed.  Snapshots provide a fast recovery time compared to backups, while an “offline” backup or one air-gapped adds additional security.  Neither solution offers full coverage, but each is needed to address specific weaknesses or requirements for recovery.

Immutability

One key feature of snapshot technology should be immutability.  By definition, snapshots are naturally immutable.  A snapshot represents a point-in-time copy of data and is generally not made directly available to the application or host server.  If the contents of a snapshot are needed, then this is achieved either by cloning a new volume from the snapshot or by reverting primary data back to a copy of the snapshot from a point backwards in time (see figure 1).

Figure 1 – typical snapshot timeline

However, we can apply the immutability term another way.  When a snapshot schedule is established, the frequency and retention time are set based on the requirements of the application.  For example, a snapshot schedule may take a copy of data every 4 hours and retain the contents for one week.  After that, application recovery is expected to use backups.  During this period, snapshots should be immutable in the sense that they cannot be deleted from the primary system until the expiration date has been reached.

Override

Why is snapshot immutability so important?  Obviously, a critical attack vector of ransomware hackers is to expire snapshots manually and remove the capability to recover from data deletion or malicious encryption.  Most storage systems (and for that matter, virtualisation platforms) have a facility that allows an administrator to freely delete snapshot images without any additional validation. 

There’s a good reason for this capability.  Snapshots increase the volume of data stored, and the efficiency of that additional capacity is based on the underlying system block size.  For example, vSphere still uses a block size of 1MB as the minimum allocation.  Any change of data smaller than this will result in an entire 1MB block being retained within a snapshot, even if the backing storage can work with smaller granularity (more on this in a moment). 

Snapshot Requirements

If IT organisations intend to rely on snapshots for recovery, then five factors come into play.  These criteria are critical for efficient snapshot implementations. 

Granularity

Storage systems should be capable of managing the smallest possible block size, ideally aligned to the file system block size.  For NTFS and ReFS (Windows), this value is recommended to be 4KB.  For ext4 systems on Linux, this figure is also 4KB.  Remember that the industry moved HDDs to 4K format many years ago, while SSDs generally work on page sizes that are a multiple of 4KB. 

Snapshot Count

Systems must provide the capability to take hundreds of thousands of snapshots.  For example, imagine a system with 1000 volumes, where snapshots are taken every 4 hours and retained for three months.  This basic capability represents 360,000 snapshots.  However, IT organisations may want to extend snapshot retention into the 6 to 12-month period and/or take snapshots much more frequently (hourly or even every few minutes).  Modern storage systems should support an effectively infinite number of snapshots and be limited only in the additional physical space the retention of snapshot data (and metadata) introduces. 

Snapshot Efficiency

We can look at efficiency from two viewpoints.  Firstly, there’s the block size granularity just discussed.  The smaller the block size, the less data is retained unnecessarily. 

For example, a 1TiB file system formatted with a 4KiB block size represents 268,435,456 blocks.  If 1% of the data on the file system changes between snapshots, 2,684,356 blocks will be updated (10.24GiB).  A system with 4KiB granularity will retain exactly 10.24GiB of additional space. 

However, with 1MiB blocks, the amount of wasted snapshot space depends on the distribution of the updates.  In the best-case scenario, the 10.24GiB of changes aligns to 10240 1MiB blocks, whereas in the worst-case scenario, the 4KiB block updates are distributed randomly across the 1TiB file system, with 2.5 4KiB updated per 1MiB block, resulting in the snapshot storing the entire contents of the file system (1TiB) for a 1% change. 

In the real world, the amount of additional data stored will be somewhere between the two extremes.  However, we know that the bigger the file system (or storage system) snapshot block size, the greater the amount of wastage.  In addition, with larger blocks, the amount of snapshot data stored will be unpredictable and only measurable as data is written to the file system and snapshots are taken.

One final comment on snapshot efficiency. Wherever possible, snapshots should tier down to the cheapest tier of storage. Most snapshots will never be used, so there’s no benefit leaving the data on fast media. If and when a snapshot is needed, any data not on the fastest tier of storage can be re-promoted, to maintain performance.

Metadata Management

The second efficiency metric is the capability of the storage system to manage the metadata associated with millions of snapshots.  As we discovered back in 2014 with the destructive upgrade required for XtremIO XIOS 3.0, managing metadata associated with thin provisioned file systems requires a lot of DRAM (also discussed here).  If your storage platform architecture needs all the metadata related to snapshots to be in memory, then there’s a direct limitation on the scalability of snapshots.

Management of metadata is also important from the perspective of snapshot consolidation.  When a snapshot is deleted from within a chain of linked snapshots, for example, some storage systems struggle to complete the process in a timely fashion.  Ultimately, the process of creating, managing, and deleting snapshots should have no impact on production operations.

Immutability

Snapshots must offer immutability to prevent accidental or deliberate deletion.  Once set, the administrator should not be able to override snapshot settings.  However, we need to consider the scenario that snapshot growth risks compromising the available physical storage capacity within a system.  In this instance, an override is necessary but should require additional validation, preferably with authorisation through a route that involves trusted members of the IT team in contact with the vendor.  The process must be foolproof enough to prevent spoofing by a hacker.

The Architect’s View®

Ransomware has the power to destroy businesses, but with the right data protection strategy in place, the risks can be mitigated.  Snapshots provide one line of defence that enables fast restores and, if implemented correctly, should give peace of mind that data can be recovered in any attack event.  Remember, though, that storage systems are not all built with the same capabilities.  When picking a platform, look out for our five criteria as your minimum requirement for data recovery.

Copyright (c) 2007-2022 – Post #6f40 – Brookend Ltd, first published on https://www.architecting.it/blog, do not reproduce without permission.