Block is Not the Solution for Persistent Container Storage

It appears we’re reaching a consensus that persistent storage is needed for containers. Despite early resistance, with an assumption that containers and their data should be transient, the logic of data persistence is starting to take hold. To be honest it simply makes sense. Yes, I could keep my data in sync between multiple nodes by copying data to a new container when one fails, but why do that when persistence takes that need away? Unfortunately, we have headed down a false path, with the use of block devices for container storage. Block is not the right solution and we need to realise this sooner rather than later.

Container Persistence

Today many storage vendors offer plugins that attach block storage to a container. If you take the view that a persistent container looks a lot like a VM, then it’s possible to see how this analogy leads to using a block device for a container. That’s what we do with VMs. That’s what we do a lot of the time with instances in public cloud and OpenStack. However, a container is not a VM. The container root file system doesn’t reside on a block device like a VM. Instead, it is part of a union file system that allows the underlying image to be changed without having to affect the customisation that may have been done in the container itself.

Storage for application data within a container doesn’t work well on the union file system. This is because writes to files have to parse the file system layers and determine if a previous file copy exists. The result is an impact on performance. So data needs to be stored on something mapped to the container. To date, the easiest way to do this has been to add a block device.

Mapping a Block Device

I say “add a block device”, but the process isn’t that simple. What actually happens is a plugin goes through the following process:

Locate/provision a block device – may be iSCSI or FC connected through a call to the storage platform.
Attach the block device to the host that will run the container.
Format the device with a file system (if it is a new volume).
Mount the volume to the host.
Mount the file system into the container.

The mechanics of how the process works will vary per plugin, but generally is the same across all vendors.

Block Device Issues

So, barring a little performance delay as the device is created, this process works right? Well, yes, but it has problems. Some of the issues include:

Physical Security

An external block device will use a protocol like iSCSI or Fibre Channel. These were originally intended as solutions for attaching LUNs/volumes to physical hosts and as such the security settings at the storage layer track host/LUN mappings. From a security and audit perspective, there is no inherent process for tracking and securing the mapping of a LUN to a container. What’s the actual problem here? Well, SCSI (either over iSCSI or FC) refers to a physical entity that sits on a real piece of hardware or logical piece of hardware. However, a container is the logical instantiation of an application that can exist on many hosts or in the public cloud. This means we’re attempting to map physical storage to a set of transient processes.

Imagine I kick off a container that runs a MySQL or MongoDB database instance. If that container dies and I restart it, there is no way to know that the new container is associated or related to the previous one. There is no “application token” associated with the container that says (for example) this is part of the Finance application and should have access to the LUN.

Logical Security

If you’re running Docker or your container environment as root, you have a problem. Any container can access any directory on the host, just specify it on a “docker run” command. If the system screws up, or the administrator makes a mistake, then data from one application can be mapped to another, with no protection. I doubt few organisations bother to map UID/GID settings to each directory and container – most probably run Docker as root. We shouldn’t underestimate how big a gap this is and the risk it exposes.

Metadata

So many issues in this area. LUNs have no metadata associated with them. The metadata is in the file system that gets formatted onto it. Somewhere details of the LUN itself (what application it was used for) needs to be kept and tracked. Storage arrays aren’t designed for this, which is why we’ve retained spreadsheets for years. Admittedly today’s storage platforms are better, but keeping track of which LUN is associated with a specific application is hard. What happens if the hardware is refreshed? What happens in DR? Here we would need some kind of name service that maps application names to physical/virtual storage entities.

Data Integrity

One of the biggest issues for LUNs is data integrity. If you present the same LUN to two hosts, you can expect data corruption without some kind of clustering or locking mechanism. This makes it hard to move an application from one host to another without shutting it down first. This also represents a dilemma for provisioning; should a LUN/volume be automatically available to all hosts that might access it, or should it be added “on demand”? Making a LUN/volume available across many locations through techniques like replication is hard. Point-to-point replicats are possible, but many storage appliances don’t support one-to-many or many-to-many relationships.

Scalability

There’s a limit to the number of block devices that can be added to a host. Thankfully that limit has increased in recent years, but it does exist. There’s also a limit on the number of devices external storage can create and a limit on the rate at which those LUNs can be created and destroyed. All of this limits the scale at which containers can be used.

Why File is Better

A file system makes much more sense for container data. Files have metadata associated with them. A file system can be part of a global namespace that can be aligned with business applications. A file system can be abstracted from the underlying storage hardware. A file system provides native locking, allowing the same data to be shared across multiple containers with relative safety. A file system can have security controls set and implemented at the application level, irrespective of the underlying storage.

Probably the most useful reason for using a file system for container data is portability. A scale-out distributed file system can make data available across multiple sites and geographies (including public cloud) at the same time. This visibility means an application in containers can be moved around with the data following it.

Why not use an object store I hear you say. Well for some data, yes, object is a great solution, however, objects are typically immutable, so updated/replaced in their entirety. An object store that emulates a file system could have performance issues when blocks of data down to a small granularity within a file have to be modified. Natively, object doesn’t implement locking, which is a problem if you want to enable concurrent access.

The Architect’s View

Data persistence is a good thing. Making containers look like virtual machines is not. Block devices for containers is an interim step, but we should be architecting for global file systems. Global may mean within a single data centre or on and off-premises in a hybrid solution. File is the future, with object for scaling immutable content. Let’s not get tied up over-engineering block support and start looking at global file systems instead.