Data Protection in a Multi-Cloud World

We’re moving to a multi-cloud world, there’s no doubting it. I hear from end users and vendors alike that increasingly, IT organisations are looking to use on-premises and multiple cloud service providers. From an operational perspective, how do we protect and maintain ongoing access to data and applications? What’s the right strategy for dealing with data protection in a multi-cloud world?

Defining Multi-Cloud

First, we should put context around what multi-cloud actually means. There are lots of definitions in play these days. Hybrid Cloud and Multi-Cloud come up time and time again. Here’s my attempt at putting a stake in the ground on how we should define these concepts.

Hybrid-cloud – a mix of on-premises and public cloud services used to deliver applications. Typically, hybrid cloud referred to a single cloud provider, such as AWS or Azure, providing an extension of on-premises computing. This could mean “cloud-bursting” or temporarily shifting some data or application workload permanently into the public cloud. In general (my personal definition), I see hybrid cloud as meaning on-premises to a single public cloud provider.
Multi-cloud – a mix of multiple on-premises and public cloud providers used to deliver applications. Isn’t that just how we defined Hybrid? Yes – and no. Multi-cloud could mean using only the public cloud and having nothing on-premises. Alternatively, it could mean using many interwoven services. It could also mean using multiple disparate services.

It’s easy to think that hybrid is a subset of multi-cloud, but I think the two simply intersect. In addition, a multi-cloud strategy may mean no direct interoperability between provider platforms. For example, Microsoft’s Office 365 SaaS solution could be used for email, while AWS is used for traditional databases and Google is used for analytics. All that being said, I do think that over time, workloads will become more dynamic and in its ultimate conclusion, applications will span clouds and/or move between them for resiliency, performance and cost considerations. If we do ever get there, we can easily see there will be challenges. More on that in a moment.

Risk vs Benefit

Why are hybrid and multi-cloud solutions proving popular? I think there are a number of scenarios aligning, each of which contributes to the transformation businesses are undertaking.

Cost – By this, I don’t mean simply the price of acquiring a solution, but also the division of capital vs operational expenditure. Public Cloud allows businesses to be much more efficient in consuming technology, only paying for what they need.
Flexibility – hybrid solutions can be more dynamic than either on-premises or public cloud alone. The ability to burst workloads for a short period is a good example. Another is being able to make use of the latest hardware features when they become available.
Combinatorial Services – OK it’s a mouthful, but effectively the public cloud service providers are innovating around AI/ML and data management quicker than enterprises could deploy new solutions themselves. With a good hybrid strategy, businesses can exploit the best that each service provider has to offer.

RPO – maximum tolerable age of recovered data (compared to the original copy) in the event of a disaster or data loss event.

So a hybrid strategy is worth pursuing.

Defining Data Protection

We all know what data protection is, I’m sure. However, it is worth re-iterating what we expect from a data protection solution, especially when looking through the lens of multi-cloud. Data protection is about:

Protecting and recovering data assets from loss in the event of hardware or facilities failure
Protecting from malicious damage (hacking), user error (fat fingers) or logical corruption (software bugs)

We measure data protection requirements using metrics like RPO (recovery point objective) and RTO (recovery time objective) which put in place SLOs – service level objectives that might be enforced with service level agreements – SLAs.

The Traditional Model

RTO – expected duration of time to restore data from backup or recover a business process in the event of a disaster or data loss event.

The traditional view of data protection is to run a backup solution that periodically takes point-in-time copies of data that can be used for subsequent restores. We also supplement this (where finances allow) with other tools like snapshots, array-based replication and application-based replication. Backup products have historically been deployed and operated close to the data, usually in the same data centre, with some way of moving backup copies (or images) offsite. Point-in-time copies are important, because they create an air gap between live data and the state of data at the time the copy was created. This helps to mitigate the corruption and malicious damage issues.

This process has worked well for many years, although we’ve recently seen a minor revolution in data protection as new start-up vendors have come to market with appliance-based solutions that aim to consolidate secondary storage use cases. Although the appliance model is currently popular, it is still the data protection software that delivers the value of any backup platform. Hardware packaging is simply a means to an end. We discussed the distinction of data management and data asset management in the context of data protection platforms in a recent episode of Storage Unpacked.

#81 – Storage or Data Asset Management?

Data and Metadata

Pieces of lego - is this your data? — Is this your backup data?

Whatever the solution, all backup platforms have at least two things in common – backup data that exists as copies of the original data and metadata to describe the backup copies themselves. The data we keep for restores could be in any format – on tape, on disk, on flash. It can be de-duplicated and compressed – as long as the metadata explains how to reconstitute it when needed. You can think of backup data as a set of LEGO bricks, while the metadata are the instructions for making a particular model. As long as we have the pieces, we can make pretty much anything. Remember too that with LEGO, bricks can be used for multiple purposes, analogous to de-duplicated content and backup images.

Backup in a Distributed World

Ensuring ongoing data protection has a few challenges when we start distributing applications across multiple platforms. Here are some of the most obvious.

Disparate source and target platforms. The source platform ends up being different from the recovery platform. Imagine having backed up a virtual instance on AWS, then needing to restore it on Google Compute Platform. Depending on exactly how the backup was taken, there would be problems restoring the image between environments because the virtual instances would be created, stored and operated in different ways. This is more than simply a problem at the level of the hypervisor, with differences in place around licensing, drivers, volume sizes, networking configurations and so on. This disparity applies even more when we use platform-specific constructs like databases. If the backup solution can’t recover cross-platform, then restore would need to be a two-stage process – first recover to the original destination, then move the data between platforms.

Security. Cloud providers (and on-premises solutions) each use different security models. This makes it hard to move workloads between systems but also creates challenges at an application level because logical security design may include segmented security models, like separate active directory domains.

Access to the Platform. Exactly how is the backup and restore performed? Cloud service providers like AWS and Azure, use snapshots to secure data. These are dependent on the native snapshot capability of the platform and a “create VM from snapshot” function for restore. Cloud providers simply don’t provide access to the low-level storage functions to make cross-platform backup & restore possible. Snapshots also need to be synchronised with the application, otherwise, the result is a “crash copy” with the risk that the data may not be fully recoverable.

Platform Constraints. Here we are talking about the restrictions or constraints imposed by the platform. For example, most public cloud providers charge for data egress from their platform. Going all-in with one provider then moving petabytes of backup to another would simply not be cost effective. This might preclude the use of platform native data protection because the backup data isn’t easily portable.

Inventory. Maintaining an inventory is probably the biggest challenge. Active virtual machines, by their nature, are documented within an implicit inventory. However, what about virtual machines, databases or other content that existed three, six or 12 months ago? Somehow those data sources need tracking and associating with a source application. This scenario becomes even more complex as we think about container-based workloads where the application is encapsulated in a potentially short-lived wrapper.

There are, of course also challenges around compliance, privacy and cost. Also, don’t forget lock-in. Legacy backup platforms committed their customers to stay with that solution forever, or at least until the backups were no longer required.

Strategies for Multi-Cloud

Now we know the issues, what strategies can we take to make data protection work for us? Here are a few thoughts.

Separate O/S and Application Data. This probably seems a pretty obvious standard, but wherever possible, separate the data and the operating system itself. Virtual machines are bulky and awkward to move around, so move just the data instead.
Create portable backups. Write backup data to a medium that can be moved around, centralised or replicated (object stores are a good example).
Self-referential backups. Create self-referencing backups that contain both metadata and data.
Integrate deeply. Use solutions that offer deep integration to get application data out of a platform effectively.
Treat backup as an infrastructure service. Data protection is a service feature, just like DNS or authentication.
Maintain an application inventory. Use tools that map applications and data to the containers that were used to deliver them.

The last point is probably the most under-served in the industry. Site knowledge of the people running infrastructure is used as a substitute for proper tools but risks exposing organisations to at best the inability to recover, at worst data loss.

The Architect’s View

The benefits of going multi-cloud have to outweigh the disadvantages. It’s clear from this discussion that using multiple compute platforms can lead to a disparate and disjointed backup strategy that doesn’t bring protection solutions and data together in a single logical view. But it’s more than just having a joined-up view. Without mobility of backup data (which is just another piece of our data asset), we can’t make applications agile and this puts an artificial constraint on the ability to fully exploit multi-cloud.

One thing this post doesn’t touch on is having a data framework and strategy. For example, do I need a joined-up process to search data across multiple platforms, that could be Saas, PaaS or IaaS solutions. We’ll cover this in another post, as part of the ongoing discussion on data management.