The Great Cloud Repatriation Debate - Data Storage

In the first post of this short mini-series, we looked at the issues of compute repatriation from the public cloud back on-premises. In this post, we examine the challenges involved in getting data back to core data centres, especially with increased volumes of edge computing either in play or on the way.

Background

Compared to compute, managing data storage represents a unique challenge across the IT landscape. Data is the asset of value to an organisation, while compute is merely the engine to process it (we’ll discuss applications separately in a follow-up post). Businesses need to retain data much longer than the infrastructure on which it resides. In regulated industries, this requirement could be the lifetime of a customer (or patient) plus another 30 years.

Data (ideally) should exist as a single copy, or if copies are being taken, the businesses should understand which is the “golden copy” and which are forks that can, at some point, be deleted. At the same time, data grows, so in the future, a fork may become a primary copy.

Our data must be securely protected, free from malware, isolated from ransomware and withstand both hardware and software failures and user error. The demands on storage and data administrators are, therefore, significant.

Storage and Data

Modern IT environments offer a range of data storage media, both in the public cloud and on-premises. Generally (although not exclusively), block-based protocols are used for applications with low latency requirements. On-premises, this may be shared SAN storage or HCI, whereas, in the public cloud, solutions like AWS EBS (elastic block store) and GCP Persistent Disk are offered.

Although another generalisation, block storage is typically closely coupled with the compute platform, so generally not directly replicated between on-premises and the public cloud. As a rule of thumb, it’s generally easier to move an application or entire virtual image than replicate individual block devices. In fact, the public clouds don’t offer any capability to directly access block storage outside of the cloud infrastructure (without a lot of additional work).

Unstructured content, in the form of file and object storage, represents a much greater challenge to manage, as the volumes of data are usually much higher than block storage. Unstructured data is the growth area for businesses and will dwarf the volume of structured content being retained by companies. As a result, the remainder of our discussion will focus on this area.

Requirements

Ultimately, businesses want to do two main tasks with data.

Store it – data isn’t always actively used, so needs to be retained for future use. This includes primary data, secondary copies (in the form of backups or for post-processing) and archive, where the data is retained for extended periods of time.
Process it – Why keep data if you don’t plan to use it? There are regulatory requirements, but obviously, at some point, data will be processed as part of normal business activity.

These two requirements align with how we view the best way to store data. When storage is the focus, capacity and cost become our priorities. When we process data, performance and cost become more important. As a result, over the last 60 years, data has lived in a hierarchy of storage solutions, from tape to persistent memory.

Inertia

Getting in the way of our optimal solution is physics and, specifically, the speed at which we can move data between geographic locations. The challenges of data inertia (which some people incorrectly call data gravity) are well known. The gravity aspect applies to scenarios where we have massive quantities of data where the inertia cost is too high to move the data around in a flexible manner. So, the data “attracts” applications to it rather than the other way around.

There is a cost involved when moving data around storage infrastructure, both financial and environmental, so any data movement needs to represent useful work. The financial challenge when transferring data in and out of the public cloud (and processing it) is also different to on-premises. Cloud service providers (CSPs) charge for retention (capacity), egress (moving out of a platform) and access (I/O operations). Each CSP also has a unique charging structure, with biases towards one of the three metrics. However, some may not actually have access or egress charges.

Ultimately, the best I/O is the one we don’t have to do. With that in mind, does this influence our architectural decisions?

Abstraction

To answer that question, we must ask what data really is. At the simplest level, it’s a series of bits and bytes that, semantically, have meaning. If we look at how virtual servers have become an abstract concept, with paravirtualisation introducing virtual devices that have no physical world equivalent, so data is an abstract concept separate from storage.

Physical storage is simply the current location where data is stored. We’ve abstracted the storage of data so much that file systems can exist in system memory, be emulated on object stores, and be kept on multiple physical devices at the same time (through tiering).

It makes sense then that we should treat data with the same level of abstraction we use for application “containers” like virtual instances and Kubernetes. Metadata describes what we’re storing; the physical contents live on whatever media delivers the best capacity/performance/cost ratio for the access pattern required. This brings us to the conclusion that data must move around infrastructure to remain optimised. The big question is where the data moves to/from, how much we move and how often.

Practicalities

OK, so we’ve talked a lot of theory; what about the practical concept of data mobility? First, let’s address the issue of whether we copy or move. If data is being used in a read-only mode, then copying is perfectly reasonable. If there is some degree of change, then we must consider whether those updates can be tracked and re-aligned with the primary copy. If the rate of change is expected to be substantial, then data should be moved rather than copied. However, this decision is a spectrum of choices with no right (or wrong) answer.

Next, we should consider data pipelines and flows within a business. Modern data flows can be very fluid, with data coming from edge or remote locations into core data centres and then potentially being processed on-premises or in the public cloud. Data fluidity is a fact of life for modern enterprises.

Strategies

All the points we’ve discussed so far bring us to the following conclusions.

Data must be abstracted from the storage on which it is stored. Storage isn’t the value of the data, just a place to store it today (or tomorrow).
Data needs enough context to ensure when it moves around, we can readily identify the source application or process.
We need good (and efficient) tools to move data around and process it. Equally, when data is transferred out of a file system structure, we should move the entire data set, not individual files or folders.
We need a greater understanding of cost models and the implications of data movement, including visibility of the sustainable aspect of mass data movement.
Every business should have a data map showing where data resources are located, consolidating what can be done into strategic locations and solutions.

The Architect’s View®

This post has evolved into more of a walk-through of ideas than any specific strategy that can be applied consistently to all businesses. There’s no one solution to solve all problems, but there are vendors with good technology out there. The ultimate conclusion of the points raised in this post is that every enterprise organisation needs a data strategy, but the specifics of that strategy will be unique to that business. There is no generic model to apply. However, knowing where your data assets are, quantifying them and optimising them seems like an excellent first step.