Copy Data Management - Product, Feature or Myth?

Technology fads come and go in this world – remember 3D and curved televisions? One part of data management I’ve been pondering over recently is CDM or Copy Data Management. Is this a product, feature or even just a complete myth?

Data Sprawl

It’s worth setting the scene for a moment and looking at what brought in the need to do CDM in the first place. Data has value to businesses. First and foremost, it is used to deliver primary services. This could be structured content in databases or unstructured in file systems and object stores.

Both the content and value of data changes over time as part of normal business processes. We protect data through backups and replication in order to recover from equipment failure, software bugs and both malicious and unintended corruption or deletion.

The use of data isn’t a straight-line process. Occasionally we take forks in the road and spin off secondary copies for a range of uses. This can be to seed test/dev environments, putting into sandboxes to test software upgrades or for analytics purposes (and more).

Secondary Copies

The challenge with secondary data copies is minimising the amount of additional storage space they consume. Estimates for the duplication of data claim anything up to 9 times as many copies of data are created compared to the original primary copy (and worse in certain industries). That represents a lot of additional waste if the process of managing these copies isn’t implemented effectively.

Where primary data is likely to have an (almost) infinite lifetime, secondary copies might be needed for hours, days or weeks. There may be a requirement to maintain copies as read-only instances or make them writeable. A copy (or backup/archive) may also be needed after the secondary use has been completed.

CDM Benefits

Copy Data Management, as a feature, steps in to optimise the creation and lifetime management of secondary data copies. The aim is to ensure that data is retained only for as long as necessary and that the additional storage required to keep those copies is the minimum possible.

There are also ancillary benefits to using a CDM process. The most obvious is that of data security and compliance. Using a CDM workflow allows IT organisations to assign owners to data from beginning to end, ensuring that both the use and users of secondary data can be easily identified and tracked.

Challenges

Here’s where things start to get a little opaque. Any data that contains personally indefinable information (PII) is going to need management with robust processes that maintain protection of that content. Security settings can’t be any lower than production standards, however, this hampers the work of developers.

This is where obfuscation comes in, through processes like ETL (Extract/Transform/Load) that hide personally identifiable information from developers. An ETL process allows test data to have a much lighter security regime and to be distributed more widely without the fear of releasing personal information. So any data taken from production systems will need to go through a cleaning process to make it suitable for use in test/dev.

Value

So, does CDM exist as a process, feature or product? Firstly, depending on the use case, I think copy data management as a process has significant value. There are three main scenarios in which copies of data are needed.

Operational – this includes reasons to make copies for operational purposes like testing software upgrades, identifying bugs or BC/DR processes. Copies are short-lived and usually destroyed on completion or validation of the task.
Application – copies are taken as part of normal application workflow and used to import data into secondary applications, such as reporting tools or data warehouses and data lakes. The lifetime of the data is dependent on the use case but could be short-lived for import tasks or long-lived for secondary application uses.
Development – copies are created to help developers test new software, upgrades and bug fixes. The process of creating copies may involve extracting a “golden master” that is then obfuscated and used for all subsequent tertiary copies. Replicas will have a finite lifetime, could be refreshed frequently and archived for future use.

I’ve excluded mentioning backup in these requirements because data protection for backup is generally done to recover production rather than for secondary use cases. Obviously, backup can be used as part of a copy data management process (discussed in a moment) and will be needed to re-protect copies that themselves become “primary”.

Process

How much of the above work is process and how much is reducing copies? Operational and application scenarios will probably see data sit within the same physical infrastructure and have the same security rules as production systems. Of course, we have to be careful with this, as it’s possible to impact production performance with too many copies of data using the same source.

Development use cases will need to go through obfuscation before being copied again. This could also mean moving data to another platform that is cheaper than production, including public cloud.

Strategy

CDM could be implemented without the need for an entire product. Instead, more process needs to be put in place. With the following strategies:

Time-limit the creation of copies that aren’t directly used for production applications.
Identify business owners for all copies; charge the owners for that data
Make use of space-efficient snapshots where possible or de-duplicating storage.
Make use of backup tools or other copy processes where snapshots are impractical (such as moving copies to another platform).

The last bullet point highlights the need to offer multiple “data sources”. These could be platform snapshots, data dumps or traditional backups. Each can provide a different vector, such as speed of creation, efficiency or data mobility.

Self-Service

Should end-users be given the ability to create secondary copies via APIs or automatically through GUIs? Personally, I think this is an important feature to offer and probably the one area where a specific CDM tool adds value. A managed API or GUI provides the ability to track the owners of copied data and to limit both the volume of copies and the security around them.

A self-service tool also offers the ability to enforce ETL processes and prevent the inappropriate use of production data. It would be interesting to see data protection solutions build in features that let “non-authorised” users recover obfuscated versions of production data and only true authorised users be able to recover the original.

The Architect’s View®

The premise of this post was to ask whether CDM is a process, product or myth. In most respects, I think CDM is a process and that can be enabled by tools if the amount of data management involved is significant. A tool provides the ability to add more metadata in order to track the lifetime of copies from creation to destruction. Is CDM a myth? In one aspect I think it is. Data has a journey within organisations and that process is and always has needed management. Copy Data Management is just another task that forms part of an overall data management strategy and in that respect has always and will always be needed.