# Data-Centric Architectures – Data Mobility

Chris Evans

This is one of a series of posts looking at building a Data-Centric Architecture for the enterprise.  In this post, we start the discussion on data mobility, looking at the perfect model of data accessibility and the short-term compromises that need to be made based on today’s technology.

We’ve talked a lot about data mobility in recent years, with almost five years of conversations on the topic (here’s an early post).  In an ideal world, data would be ubiquitous and have no penalties for access.  We would simply point our application at some abstracted reference to a data source, and off we go.  In reality, though, life isn’t like that.

### Data Inertia

There are lots of challenges in making data globally available.  Physics puts roadblocks in the way by enforcing an important rule known as the speed of light, which limits both optical connections and the speed of electrons in a wire.  The exact amount of latency incurred over distance is a factor of the transmission medium (e.g. optical fibre or copper) and the devices through which the traffic is routed.  Fibre optic cable introduces around 5µs of latency per kilometre, so a 200km round trip (100km each way) would have an overhead, at best, of one millisecond, plus additional travel time through the (storage) networking equipment.

The impact of latency on an application is dependent on how sensitive the application is to the latency overhead of each I/O.  Highly random workloads and those with a high degree of serialisation (those that can’t be highly parallelised) will suffer the most.

Data has inertia that increases with the volume (or perhaps more appropriately, mass) of data being moved, with distance and the I/O pattern.  The following pseudo-equation represents the parameters involved:

Where I is the inertia, m is the “mass” of data in total, d is the distance between data source and application, r is the degree of randomness of I/O, p is the level of parallelisation in the application, and v is the volume of data being transferred.

I don’t like the term “data gravity” as this takes no account of the distance between applications and data, with “classical” gravity diminishing over distance, the opposite of the challenges with data and applications.  However, data gravity has been adopted as a term, partially because it makes a good soundbite.

### Apparent Mobility

Physics precludes our ability to offer real data ubiquity, so we need to find ways that emulate apparent mobility to an application.  In a totally abstracted environment, data would be available to any application, any time.  In reality, we don’t run applications in many locations simultaneously or move applications around on a frequent basis.   It’s possible to compromise on the requirements of global data access to make data appear to be fully mobile.  We’ll go back to that in a moment.

### Patchwork Solutions

What do we see happening in enterprises today?   Typically, there are several techniques in operation.

• Clone – produce multiple copies of data and ship them to wherever they are needed.  This process introduces several challenges.  First, any update to the source data results in the copies being out of date.  So, unless the source is cold and not actively updated, all copies are immediately stale.  Second, keeping copies accurate for fast-changing workloads is a challenge.  Modern media operates at sub-millisecond speed, making it hard to keep remote copies actively updated in real-time and impossible over any distance if the updates need to be synchronous.
• Cache – Keep some data local to an application and only retrieve data from the source where necessary.  Caching is affected by the level of randomness of data and the volume of writes versus reads.  For example, 100% random read I/O has no predictability and can’t be cached effectively.
• Partition – divide data into subsets based on access profiles.  Here we’re assuming applications have a limited access landscape.  For example, on a VMware vSphere datastore, the blocks comprising a single VM are only accessed by one host at any one time, making it relatively easy to partition the data (logically) within a single datastore.
• Abstract – separate physical data from metadata.  Filesystems, for example, use metadata to visualise file layout, file names, access times and capacities.  It’s possible to distribute metadata dynamically and keep the actual data in one place.  This is what we did with Mobilus.io and the proof of concept we developed (more on this later).

There are many vendors offering solutions in this market.  We’ll touch on them in more detail in subsequent posts.

### Compromises

What can we accept in terms of compromises with global data availability?  Let’s quickly discuss CAP Theorem (and PACELC) as this helps to explain the choices.  CAP Theorem states that a distributed data store can only offer two out of three guarantees of Consistency, Availability and Partition tolerance.   We can abandon consistency and still offer availability and partition tolerance (the eventual consistency model of object stores like S3).  We can keep all of our data in a single “device”, avoiding distributed networking or hardware failures.  Or we can accept split-brain issues with a lack of partition tolerance.  PACELC Theorem goes a step further to introduce the impact of latency into the model.

So, what can we accept?

• Increased latency – if data isn’t local, latency may increase in solutions that cache or abstract data.  Latency can be a big problem for some applications, but in other cases may not be as much of an issue.  Latency challenges also occur with synchronous replication.
• Short Downtime – as data is moved between locations, stopping access in one location to restart in another may incur a small downtime to ensure a replicated copy is fully caught up or the metadata primary is flipped to be in the new location while data catches up in the background.
• Partitioning – some data may need to be partitioned to make one location the primary and others secondary.  The primary locations have the best performance, while the secondaries see some impact.  This asymmetric design may be more appealing than a symmetric model where all locations suffer a performance impact.
• Data Inconsistency – with high write I/O profiles, keeping data replicated across multiple sites can be a challenging or impossible.  Some degree of data inaccuracy may be acceptable, depending on the application.
• Reduced Availability – instead of making data available across every location, distributed data could be supported across only a few locations to reduce the impact of managing availability.
• Increased storage cost – one solution is to keep multiple copies of data across locations.  This strategy increases storage costs while introducing the challenge of keeping all copies consistent.

### Striving for Perfection

It’s clear that no system is perfect, mainly due to the physics involved in moving data around a distributed infrastructure.  However, we can establish a set of requirements and look to mitigate the challenges already listed.

1. Provide a consistent metadata view of data.  By this, we mean offer a global namespace that covers block, file, object and any other storage (or data) protocol.  By consistent, we mean the data looks as if it is local, with the same access controls in every location.
2. Extend metadata to be application-aware.  For example, allow block or filesystem data to include widely known database applications and I/O quality of service requirements.
3. Implement multiple consistency controls.  Allow the data owner to choose between eventual or strong consistency on a dataset basis.
4. Abstract logical data from physical storage.  This is an essential requirement to allow data characteristics to be separated from the persistent storage on which it is stored.

These principles were partly used in our thinking and design of Mobilus.  I developed a proof-of-concept file system that enabled replication between two locations with significant latency.  The replication seeded data by identifying the application and moving files (or partial files) that would be needed on start-up.  With more information on how an application processes data, then more accuracy could be gained.

This solution is not infallible.  Large data sets with highly random access would cause problems.  However, for a vast number of use-cases, the Mobilus solution would work just fine.

### The Architect’s View™

The four principles listed above offer a starting point in developing distributed data solutions.  We’ll use (and expand) these to cover platforms and products reviewed in upcoming posts.  In the next technical post, we’ll review why block storage isn’t a long-term solution for data mobility and what we can do about it.

Copyright (c) 2007-2021 – Post #1a7e – Brookend Ltd, first published on https://www.architecting.it/blog, do not reproduce without permission.