The Performance Requirements for Instant Restores

Instant restores make light work of recovering applications in virtual server environments and help businesses meet their recovery time objectives. To make the restored virtual instance truly useful, storage platforms need to have specific characteristics. We look at those requirements and show how instant recovery can deliver more than just backup & restore.

Background

Modern data protection has moved to a model where snapshots are used to capture data for possible future recovery. Whether at the hypervisor or cloud instance layer, the process is essentially the same – take a copy of the virtual volumes mounted to a virtual machine or instance. When the time comes to restore, the snapshot provides a “crash copy” of the running application, which should include the capability to quiesce running applications and ensure buffered data is flushed prior to the snapshot. The benefit of an instant restore is initially in meeting the recovery time objective, enabling the business to get back and running within minutes rather than hours.

A correctly implemented backup and storage solution should ensure that only the deltas are retained with each snapshot, that is, the changed blocks since the last copy was taken. Behind the scenes, the metadata aggregation process could also visualise the backups to appear as individual full copies, depending on the application requirements.

Recovery

A complete virtual instance restore might be required for several reasons, not all of which are data-loss specific. These could include:

Data recovery due to corruption – data has been lost due to some corruption issue, perhaps in application code.
Data recovery due to ransomware – a ransomware or malware attack has encrypted data.
Data recovery due to user error – a user mistake has deleted records that can’t be easily recovered through the application.
Data validation – a restore is required to check for malware before confirming the backup image is a “good copy”.
Failed upgrade – an application upgrade has failed and needs to be reverted.
Data extraction – data needs to be extracted for analytics purposes, and the production image must be kept separate to avoid performance impacts or perhaps to obfuscate personal information in the process of analysis.
Test/Development seeding – an image copy is required to push into development or acceptance testing of new code.

The interesting aspect we see here is that the recovery process extends the use of point-in-time data in a way that would have been cumbersome or impractical in the days before virtualisation. Even with virtualisation, instant restores need disk-based backup to be truly practical for everyday use. This is because the virtualisation process essentially randomises data on disk or flash media – the “I/O blender” effect. Note: some degree of mitigation on some hypervisors is possible, for example, VMware vSphere vVols treats each virtual machine as an independent set of volumes rather than a blended datastore.

Another characteristic of the restore scenarios is the temporary nature of the data. A restored VM might be needed for only a few minutes or a few hours. Conversely, if the use case is for application diagnostics, the recovered copy could be needed for days or weeks. In the case of a ransomware attack, the recovered image could be required indefinitely.

Workflow

Figure 1 shows a typical workflow for virtual machine recovery. Either the virtual machine or datastore holding virtual machines is mapped to the hypervisor. Individual VMs are then imported back into the hypervisor inventory and can be migrated onto production storage if the VM is to be kept long-term. In this workflow, we’ve assumed that data will be accessed through the application interfaces for a running virtual instance, but it’s equally possible to mount a volume directly to another host and access the data that way.

We generally assume that secondary storage is slower and cheaper than primary storage. However, when using backups for instant restores, the secondary system needs the ability to act like primary storage for the lifetime of the recovered VM. This aspect of restore introduces some interesting assumptions on the cost and delivery of secondary platforms compared to primary ones.

Requirements

So, what specific requirements should we expect of secondary storage and the management of instant restores?

Metadata management – The backup solution (including secondary storage) must be capable of efficiently manipulating backup images and presenting them back to the production platform in a way that can easily be used by end users.
Responsive QoS – the storage platform needs the capability to apply quality of service to the data that comprises the restored virtual instance. This process must be as granular and instantaneous as possible, which could mean “warming up” a fast tier of storage with a virtual image.
Discardable – images should be easily discardable once the recovery timeline is completed. Once again, this process needs good metadata management and dynamic storage tiering to place the right blocks of data in the right place at the right time.
Efficiency – restores must be efficient – essentially “thin restores” to match “thin snapshots”. If instant restores are to form part of production processes, then the additional storage overhead must be minimised.

Boundaries

As we start to think about the needs of secondary storage and virtual instances, we can see that the requirements align nicely with primary storage solutions. This shouldn’t be a surprise because virtualisation and abstraction, in general, have created randomised, de-duplicated storage profiles. As a result, primary and secondary storage could simply be the same solution, just implemented with different classes of storage media and (obviously) as two separate platforms to create distinct fault domains.

The Architect’s View®

The concept of instant restores is a powerful one. With it, we can easily recover an application to any point in the past, solely dependent on the number of snapshots that have been retained. Unlike some backup implementations in the public cloud, once the snapshot image is on a physically distinct platform, the copy is then a backup. With the right storage implementation, instant restores become a data workflow tool rather than just a backup/restore operation.

Could we apply the same process to data backed up and restored from a containerised application in Kubernetes? It’s totally possible, although the mechanics of introducing an external PV to match a PVC are still a little clumsy. However, this use case could be where instant restores shine because the application component would also be effectively instant. Just imagine the ability to create thousands of instant application clones and the use cases that could create….