For as long as I can remember, we’ve protected data by referencing physical or pseudo physical entities. In the mainframe days, we backed up physical volumes. In the client/server era, we backed up physical servers. Over the last 20 years, we’ve backed up virtual machines – effectively software instances of a physical server. As we move into a more distributed world, we need to start thinking differently and organising our backups based on the application and data, not on the device being protected.
The Backup Inventory
Look at pretty much all of the backup software today and you will see that the main index for referencing objects under protection is a physical or logical entity. We back up virtual machines (usually in their entirety), we used to back up physical servers before that. Even file server data protection references physical appliances and volumes.
Why did we end up here? Historically, applications were deployed on physical machines that rarely changed. An application was deployed on one or more servers that could be in place for 3-5 years before being refreshed. In general, system admins knew what each server was used for and most were assigned a single application. Keeping track of the server/application relationship wasn’t that hard.
As we moved to a virtual world, things became a little more dynamic. Virtual machines are typically long-lived but are likely to get replaced rather than refreshed, as it’s (in many cases) easier to move data to a new VM than do major upgrades to one in place.
With the ability to create hundreds or thousands of virtual machines a month, aligning ownership to a VM requires good naming standards, the use of efficient tagging or custom attributes. As anyone who has tried to assign naming standards and structures in freeform fields will know, it’s hard ensuring people adhere to the rules and inevitably resources get mis-attributed. Of course, the backup application needs to support the collecting and tracking of these tags too, otherwise, future search for backups will be useless.
The use of container technology will make the backup process even more complex. Now we have container definitions to maintain, plus the data within the container itself. Exactly how persistent container-based data is protected will depend on the application being used. Databases, for example, could just be protected at the application level. Other container apps may be more complex or require backing up a file share or a local host directory.
It’s also possible that we’re protecting applications directly from the application platform. We’ve already mentioned databases and file shares, but there’s also data running in the public cloud too, either as SaaS or IaaS. In total, there’s a lot of variation in formats and content, all of which could be refactored at some point.
It’s clear to see how we got to where we are. Implementing data protection on physical and virtual entities is just an extension of the way backup was originally built. It was easy to simply back up an entire virtual machine, just like we did physical ones. In the physical world, it made sense to back up an entire boot disk and data. In the virtual world, the rationale is less so, as we can easily rebuild VMs or virtual instances from master copies.
The legacy view that still exists today is that we should continue to be protecting data based on the package used to deliver the application. But increasingly that will be a flawed assumption.
What’s the Problem?
Why should we care if we’re backing up via virtual instance names? Here are just a few issues.
- Abstraction – Application packaging is changing, which means it’s now as easy to run one application on-premises as a VM, as a container or in the public cloud. We’re starting to care less about the package (like the Windows or Linux platform) and more about the application and data.
- Longevity – Application packages will come and go, but the data will remain. The data will outlive the delivery mechanism, so it makes more sense to protect the data itself, not protect the “package”. Of course, it would be folly to simply throw the packaging away, because sometimes we might want to recover part of the O/S for some reason. However, we should be able to easily separate packaging from the data.
- Scale – IT organisations are scaling from hundreds of servers to thousands of VMs and tens of thousands of containers. Keeping track of exactly what is being used for each application becomes exponentially harder with each instantiation. As we consider serverless technologies, this situation isn’t going to get any easier.
- Automation – nobody wants to track applications manually. We want automation to do this on our behalf. Backup software should know what application each virtual server or container instance belongs to.
- Refactoring – typically called “modernising”, but really refactoring from (for example) VMs to containers. There are genuine savings to be made in repackaging applications. Just imagine though – how would you restore an application from a VM backup to a container that was refactored three months ago? Restore the VM first?
- Portability – today, moving applications to and from the cloud isn’t straightforward. Even if we’re not moving back and forth on a weekly basis, there is a need to be able to encapsulate application data and backups to move them easily between platforms and in the future, this requirement will be more likely.
- Compliance – being able to search across data sources where the data source can be tagged and mapped to an application is a real benefit for compliance and audit. Where backups and secondary data are being used for this purpose, it makes sense to align the data to an application from day one.
So, there are lots of good reasons why focusing on data from an application perspective makes more sense. How are we going to do it?
An Application View
Probably the most obvious step is to simply restructure backup platform GUIs to show systems in an application-focused order. This means building a structure like Active Directory which shows business functions and applications in a hierarchical or grouped structure. Each application then has one or more sub-components against which backup policy can be applied.
- Application-Focused Backups
- The need for self-describing secondary data
- The Need for APIs in Storage and Data Protection
If a component moves to another platform, for example, a database moving from on-premises to AWS RDS, then the backup process may change, but the backup software simply tracks the source as a different package. This also allows backup proxies to operate on behalf of a central backup platform. In this example, the on-premises software could trigger the RDS backup using AWS documented APIs.
An Application Format
One of the challenges, of course, that this transformation would encounter is a need for portable backup formats. A snapshot copy of a traditional database isn’t going to easily restore into a cloud instance. Vendors are already addressing this problem, with some using content independent formats, while others are “cracking open” the backup image and extracting data from the contents.
Update: Since writing this post, I’ve watched the following video, where the VMware Project Pacific team highlight the benefit of treating a group of VMs or containers as a single application. This describes exactly how we need to start understanding our backup definitions.
The Architect’s View
There are a lot of steps needed in order to realistically transition to application-focused backups. The ideas and challenges outlined in this article apply equally to primary data and we already see roadblocks in taking advantage of multiple platforms and cloud offerings. I’d like to see vendors agreeing on standards that would allow application mapping at the virtual machine/instance and container level as a starting point. This would at least provide a way to automate the building of an application map for data protection.
There’s still some thinking to do about exactly how this would all work, however, it is achievable and should be part of data protection modernisation. Let’s hope we can start to see this thinking being developed by our data protection vendors in the not too distant future.
For more data protection related content, check out our dedicated Data Protection Microsite.
Copyright (c) 2007-2019 Brookend Ltd, no reproduction without permission. Post #7F92.