Conflating Data Protection and Data Mobility

Conflating Data Protection and Data Mobility

Chris EvansCloud, Data Management, Data Mobility

Sitting on the sidelines, I’ve been watching some of the traffic and communications coming from VeeamON this week.  Chris Mellor just posted a quick article summing up the ambitions of Veeam as a company.  It struck me that we continue to conflate the idea of data protection and data mobility.  To me, these are two totally different concepts that really shouldn’t be associated with each other.  Here’s why.

Aims & Goals

Firstly, data protection is there to do exactly as it says – protect our data. That means continued access, even in the event of hardware failure, software bugs, user error or malicious attacks.  We measure the recovery time as RTO (recovery time objective) and hope for some systems this is zero.  We measure the accuracy of that recovery as RPO (Recovery Point Objective) and again, if it’s our bank details, we hope this value is zero.

The process of data protection is to provide business continuity.  The technique of data protection relies on taking facsimiles of our production data and storing it elsewhere, in order that we can restore it if required.  I deliberately didn’t use the terms snapshots, copies or replicas here, because these are implementation specific terms.  How we make the facsimile is down to a range of variables, not least of which is cost.

One key imperative of any data protection scheme is that it is a point-in-time reference to our production data.  If the protection is done via physical media (such as shared storage), generally this is a time-based reference point.  In some cases, the reference point is transactional.  Sometimes it’s a combination of the two, taking as one (snapshot) and recovering as another (transaction point).

Data mobility works towards a different goal.  It aims to make data and applications accessible wherever we want them to be.  These days, data accessibility isn’t a problem.  Networking is ubiquitous.  With the right security credentials, I can reach any device.  Of course, the data or application protocols may not work over distance.  I don’t recommend connecting an iSCSI LUN over hundreds of miles (although I’ve seen it done) or using a remote SQL client to insert database records into a structured database (which I’ve also seen).  So effective data mobility, retaining performance and not impacting availability is hard because we have to overcome the latency problem.

Copy Data Management

One solution to making data globally available has been Copy Data Management or CDM.  Copies taken from snapshots or replicas are frequently used to seed other environments, including moving data from one location to another.  CDM keeps on top of these replicas over their lifetime, ensuring they are kept on the right type of storage and retained only as long as necessary.  From a physical perspective, techniques like de-duplication ensure only the minimum amount of storage space is used in storing successive copies.

Concurrency

The problem with solutions like CDM is that they are secondary copies, which as soon as they are taken, immediately don’t represent production.  Any processing we do on these copies is intrinsically inaccurate.  How inaccurate depends on the rate of update of the data and frequency of copying.  Doing trend analysis on data that’s a month or more old may be fine.  Doing fraud analysis on the same data is probably an issue due to the time critical nature of the process.

One alternative is to suspend a production application while data is moved.  This means interrupting operations while the transition occurs.  Another is to accept either conflicts and use eventual consistency, where distributed data may not be entirely accurate.  Yet another option is to use a complex process like Google Spanner.

Granularity

An additional problem with taking point-in-time copies of application data is that they represent the entire dataset at that point.  Obviously, we can track changes at multiple levels between copies.  This might mean at the block level (which is useful for little more than saving space) at the file level (which has some degree of structure) or at the application.  But the reason we take an entire copy is that we might need to recover the entire data.  However, for data mobility, the only change tracking that is of any use is that which is transactionally consistent.  We don’t really need to keep moving the entire data set around all the time.

What is My Point?

What does this have to do with VeeamON?  Chris’ article highlights the final stage Veeam is looking to achieve:

Automation: Veeam’s idea of nirvana in which data becomes self-managing, via data analysis, pattern recognition, and machine learning, and so automatically backed up, migrated to ideal locations, secured during anomalous activity, and recovered instantaneously.

This is a laudable target.  If applied purely secondary data, it makes sense.  If this applies to production data, then we won’t get there from secondary copies.  Why?  Because the way we store and update data, whether as files, objects or records in a database is fundamentally connected with how we manage it.  Data protection has to be aligned structurally with the way we update data, not just how we store it on physical media.  Otherwise, we end up keeping way more data than we need or wasting huge cycles picking apart the data to structurally understand what needs to be kept and what can be thrown away.

The Architect’s View

I’ve argued that block is the wrong medium for persistent container storage and I think that CDM isn’t the right approach for managing data mobility.  Instead, if we want to protect, move and otherwise manage our data, we need to do it at the primary, not secondary level.  From this perspective, I think that distributed file systems will prove more practical, but there’s still work to be done and an opportunity that’s being missed.  Data could be more mobile, while kept in sync if we directed our AI knowledge to work on primary rather than secondary data.

All this of course, depends on how mobile we need data to be.  Will multi-cloud really be a requirement or will IT organisations and businesses simply operate with their favourite vendor of choice?

Comments are always welcome; please read our Comments Policy.  If you have any related links of interest, please feel free to add them as a comment for consideration.  

Copyright (c) 2007-2020 – Post #9EE0 – Brookend Ltd, first published on https://www.architecting.it/blog, do not reproduce without permission.