The need for self-describing secondary data

As primary data becomes ever more portable, we need to ensure that our backups have the same levels of flexibility to be moved, stored and used wherever we need them. Why is this important, and what features should we look out for?

First, we need to go back in time, look at data protection and think about how we got here.

The Age of Tape

Anyone who’s been working with backups for any length of time will know the pain of dealing with tape media. Although tape offers low-cost and (today) high-performing sequential access, one issue with using tape for backups is the ability to understand the tape contents.

Many enterprise IT organisations have tapes that go back years if not decades. Tracking tapes against a source backup system is fraught with challenges. I’ve seen a range of scenarios, including where the bar code on the exterior of a tape cartridge has been lost or faded, making it impossible to determine the media contents.

Although tape has lots of good portability and cost benefits, media capacities have increased to a level where the sequential nature of tape makes it hard to be efficient in managing timely access to data.

LTFS

LTFS or Linear Tape File System was developed by IBM as a solution to make tape content self-describing. The idea was to integrate metadata along with data, by using the multiple tracks on tape media. An LTFS formatted tape can be read from a suitable device and operating system that supports the format. This gives the appearance of a traditional “random access” file system, but on sequential media. Obviously, the random access nature is matched with potentially terrible latency.

The Age of Disk

Although LTFS was a cool idea, the technology doesn’t perform well, if multiple passes of the tape media are needed (that latency problem again). The challenges of rewriting data are also huge, with an active backup or archive.

In any case, as LTFS was being developed, VTLs or Virtual Tape Libraries were making bigger inroads into the enterprise data centre. Backups moved to disk systems that were able to optimise content through techniques like data de-duplication and compression. As a result, LTFS wasn’t as successful as the technology promised to be. However, VTLs were successful and arguably this was the start of enterprise backups moving to disk and away from tape.

Metadata

Even if LTFS had been successful, the requirement to enable portable backups is to have both metadata and data integrated together within the secondary content. This enables the data to be read without going back through the original backup platform, and crucially, without needing the backup software database that describes the content and its place on media. Without the metadata that describes backup data, the only possible solution to understanding backup content is to scan it in its entirety in the hope that some useful data can be retrieved from it.

Self-Describing

What could “self-describing” actually mean? There are two main components. The first is to understand the format of the backup data itself. Whether the data is stored as a series of files or as big chunks of data like a tarball or ZIP, we need to know how to read that data back. In this instance, ZIP and tar are good examples as they are encapsulated self-describing data formats. They can be read and understood by most operating systems as both have become de-facto standards.

The second requirement is to understand how to translate data back to an origin. While a ZIP file might describe a set of files, it provides no context as to where that data came from. We need to know the source server or application, the date and time of backup, PII and any security credentials needed to access the content. We should also be storing details of how long the data is expected to be kept, either as an expiration date or retention time.

Secondary Data Standards

Why should we care if backup data is in a standard, portable format? Let’s consider where application deployments may head in the future.

Currently, we’re starting to see multi-cloud deployments, initially using either separate IaaS solutions or a mix of IaaS and SaaS. IT organisations might have data in Office 365, use a CRM tool, have traditional applications on-premises and be experimenting with micro-services and containers.

As organisations move to a model where the underlying platform is fully abstracted enough to make use of a range of services, decisions on application deployment will be based on cost, efficiency, reliability and proximity to the customer or end user. We also have to consider that existing public cloud providers will continue to evolve their service offerings to a point that service choice may be decided by the features on offer.

The result is a sliding scale or spectrum of scenarios where we move from inflexibility in service choice to total flexibility as to where applications are deployed and data processed. Achieving total flexibility will depend on a range of issues being solved, such as efficient service orchestration, application latency, resiliency and of course, whatever data framework is in place.

One other aspect of portability is worth considering. If backup data can be made portable and self-describing, then it makes the process of creating an “air gap” between live data and backup data more practical. This can be really useful to protect against ransomware or other malicious hacking attacks.

A Data Framework

What do we mean by data framework? The idea is perhaps worth exploring in a separate post but imagine for a moment the scenario we already presented. Messaging and collaboration data is stored in Office 365; traditional applications on a virtual server farm; new applications running as micro-services.

What happens if we want to move any of these workloads to a new provider? What happens if we want to search across the entire data set of content? These are relatively easy questions to answer. We can simply move the content and start working in the new location. Some data, like messaging/collaboration content, would definitely be more challenging, whereas others, like unstructured data (files and objects) would be relatively easy.

Now think 6-12 months down the line. Imagine a range of services have been running in AWS and using the new native backup service. If these applications move to Azure, how will historical backups be managed? Is it even possible to restore a backup taken 6 months previously in AWS, back into Azure? This is just one simple example, but we can imagine more complex ones where a mix of different services are used to deliver a single application.

Secondary Data Portability

Here’s where portability becomes important. Backup data should be as portable as application code and primary data. We need the ability to cater for movement between platforms, including our secondary data. Without this, IT organisations will have to either retain infrastructure to cater for future restores (a “museum” environment), spend lots of money moving secondary data between platforms, or accept much less flexibility in deploying services between service offerings.

VTL technology from the 2000s was a good solution to eliminate tape from the backup cycle. However, they created lock-in. Although the data access format (e.g. a virtual tape) is standard, getting to that data means going through the “front-end” access protocol and reading the content as a backup. Backup data can be moved by the backup application from one location to another, but the entire contents have to be “re-hydrated” to move to another storage medium (or vendor), before being stored again on the new solution. With de-duplication rates of 95-99% on these platforms, wouldn’t it be easier to be able to move the physical content directly, rather than having to extract it in full format and recompress and store again?

Backup Service Portability

There’s one more consideration to think about. If our data is portable, why not have portability of the backup solution itself? After all, when we move applications around, the chances are that the definitions that determine how frequently data is protected, including levels of security and compliance, will likely be the same. Backup software is just another application. So how about standardising on the backup service definitions, or at least allow backup to easily and seamlessly be deployed on multiple types of infrastructure?

The Architect’s View

To be fair, backup software vendors have been moving away from proprietary storage for some time. Most, if not all, support object and NFS targets. The problem here is the ability to pick that data up and use it elsewhere. Backup software has been morphed into appliances. Perhaps we need to see that software becoming easier to deploy across private and public clouds.

We already have standards bodies across the technology industry and even have de-facto standards like the S3 API. Is it too much to ask that we get some standardisation on either backup data or the definitions by which we manage backups? This might not be popular with backup vendors that have a desire to lock us into solutions. But surely this has to be the right approach as we move to a multi-cloud world. Otherwise, we’re going to be creating an ever increasing pool of unmanageable data, increasing costs and generating unending technical debt.