My Love/Hate Relationship with Snapshots

I recently recorded a “Restore it All” podcast with Curtis Preston in which we discussed the benefits and disadvantages of snapshots. While I’m not against snapshot technology, I do think the use of point-in-time copies needs to be managed with care. Otherwise, we continue to create technical debt that will need to be unpicked later. Here’s why.

The History of Snapshots

I tried to find the first occurrence of snapshot technology, as a guide to how the feature has been used over the years. It’s not clear which product first implemented snapshots. However, NetApp with WAFL/ONTAP and StorageTek with Iceberg look to be front runners. If anyone has documentary evidence of the first implementations, I’d be interested to hear from them.

I used snapshots on Iceberg around 1995, evaluating the first system in the UK and helping IBM understand the technology when they resold it as RVA (RAMAC Virtual Array). Iceberg was a mainframe storage solution so presented what looked like LUNs (volumes) to the system, although volumes inherently stored files.

NetApp initially operated only as a network file system platform, with snapshots offering the ability to take point-in-time copies of volumes and file shares. Later, of course, the platform evolved to expose block-based LUNs.

In both instances, (Iceberg & ONTAP), the implementation of snapshots is determined by the underlying metadata that represents a volume. For ONTAP this is 4KB blocks, and for Iceberg I believe this was 2KB. Co-incidentally, both platforms use a redirect-on-write approach for new data. As updates occur, this data is written into free space, meaning snapshots don’t have to perform a “copy on write” to preserve the snapshot image.

Platform-Specific

I’m not going to go into detail on the specific implementations of snapshot technology. However, to be space-efficient, snapshots have to offer incremental capabilities, that is, to store only changed data and keep track of the changes over time. This means any individual snapshot can be constructed from unchanged and protected changed data, typically blocks within the architecture of the filesystem or storage implementation. The smaller the blocks, the more efficient the snapshot.

Making snapshots efficient is officially a good thing (compared to a clone, which is a full image copy of a volume). But here’s where we have a problem. To understand the content of a particular snapshot requires understanding how to reconstitute the data, either from access to the blocks or by having the volume presented to a host using the original access protocols (e.g. block or file).

Why is this an issue? Well, imagine moving from a system that uses 4KB internal blocks to one that uses 6KB. Assuming it was even possible to extract the raw snapshot data, how would the 4KB changes align on a 6KB system? This is a contrived example; however, many storage solutions also implement thin provisioning and compression, so moving a raw snapshot from one system to another, while retaining the space efficiency is effectively impossible.

Data Types

Let’s look now at how data is presented to the host. In volume snapshots of file data (e.g. an ONTAP volume), the snapshot will contain changed files from the previous snapshot or original copy. We can identify changes by examining each file and directory to see what’s changed since last time – a so-called “tree walk”, which isn’t that efficient. An alternative is to have the file system tell us what’s changed. For example, ONTAP SnapDiff API provides access to changed data without the need to walk the directory structure.

Unfortunately, SnapDiff type solutions can be inefficient from a storage capacity perspective. Imagine a 1GB file in which one 4KB block changes. If the API can’t report back which part of the file changed, the whole file gets backed up again. This is one reason why people love snapshots; their efficiency is based on physical file-to-storage mapping. Also file watching technologies can be resource intensive on highly active file systems, so these aren’ a great solution either.

Rehydration

Whether using snapshots on block or file volumes, when the data comes out of the original platform, some degree of rehydration is needed. This may be to reconstitute the volume and extract changed files or to put a restored copy of data onto another storage system. While the technology exists to create synthetic volumes and mount those directly, building a metadata store of individual file content still requires a volume to be rehydrated in some form. This is inevitable because our data is in a logical, not physical format.

Structured Data

What happens if our data is in a structured format like a database? At this point, there’s a second layer of indirection to consider between physical storage and the data itself. The first layer is between raw blocks and the file system. The second is how the files of a database combine to create an application that is accessed through APIs like SQL.

The interactions between individual files that underpin a database are obviously complex, to the extent that we need to implement backup APIs on databases to flush data from internal caches. This makes it difficult (or almost impossible) to process individual file updates as a way of determining a database has changed. It would be much better if the database itself provided a stream of backup data in a form that can be used incrementally.

Application-Focused

As we move to hybrid computing, data within applications has to become less dependent on the underlying storage platform. In the public cloud, for example, we simply don’t know or have any exposure to how data is being stored within the infrastructure. We rely on vendors to provide access to data in a form we can use it within an application.

Look at AWS snapshots as an example. We can take snapshots of EBS volumes to S3 that are incremental and only consume S3 storage when data changes. But, AWS doesn’t expose the raw data to us, so we have to use AWS APIs to restore those snapshots back to somewhere (another EBS volumes) they can be used.

What if we want to move one snapshot to another platform, or extract some data from a snapshot? We have to restore the entire volume first. So, we can’t just move our snapshots between platforms as we move the application.

The Architect’s View

As I said at the outset, I’m not against snapshots per se. As an extra protection mechanism, they’re quick and easy. Most storage platforms offer quick backup and restore with snapshots. Most also provide ways to move those snapshots to a secondary platform to protect against hardware failure.

However, from an application perspective, snapshots represent platform lock-in. Snapshots aren’t cross-vendor portable and require post-processing to extract out logical content. I wouldn’t advocate avoiding snapshots. Instead, start to focus on backups that work at the application layer. This can include using snapshots to feed the backup system. The backup software can then do the storage optimisation tasks. That way, if restore needs to go cross-platform or vendor, then there’s no lock-in remaining.