Storage Management and DevOps

Ever since persistent storage was introduced into computing, we’ve had challenges with efficient storage management. The storage administrator typically held the role of gatekeeper in provisioning and reclaiming storage resources. This world is changing as scale, and time-to delivery demands a transition away from manual management practices. Are we approaching storage automation the right way and is this a symptom of a more significant challenge in the enterprise private cloud?

Background

Twenty years ago, storage arrays were big ugly beasts that needed lots of care and attention. This meant everything from understanding disk layouts, IOPS per spindle, RAID groups and many other factors. Storage Administration was akin to an art – and paid well as a result.

Thankfully, the world has moved on, and we have a different breed of storage solutions today. All-flash platforms and automated hybrid storage has reduced the need to understand the specifics of storage hardware. APIs and CLIs allow us to automate a lot of the tasks previously done through GUIs. Vendors are building AIOps-type features into their platforms that reduce the administrative burden of detecting and fixing performance problems.

Scale

All of these advancements in storage technology were inevitable as the scale of storage deployments increased. In the mainframe days, I extracted individual files from failed IBM 3380 drives. This level of management is unimaginable today, and rightly so. However, the increasing scale at which storage is deployed and consumed means that day-to-day manual administration isn’t practical. We need a new approach.

In a recent presentation at Storage Field Day 19, Audrius Stripeikis from Dell EMC shared findings from their latest research on storage management automation. The data showed that for customers with ten platforms or greater, automation was an essential component of service delivery. Customers were effectively dealing with three main challenges;

Scale – coping with the numbers of requests, volumes of storage and platforms.
Consistency – the ability to provision storage through a standard set of APIs as part of an “infrastructure as code” approach.
Self-Service – removing the human element and making the provisioning process automated.

I add a fourth challenge to this list, and that is the topic of risk. When IT organisations need to manage storage across tens or hundreds of devices in many data centre locations, the risk of mistakes increases.

DevOps

The evolution of IT organisations is to (apparently) move to a DevOps model. Instead of placing all responsibility with server, network and storage teams, DevOps and System Reliability Engineers take responsibility for managing access to infrastructure (see figure 1 from the Dell EMC presentation). In one respect, I view this approach as making sense. The diagram shows the use of a service catalogue to define the infrastructure capabilities available. APIs are used to control access to resources. Ownership of problems becomes shared between the operations and developer groups, aligning more accurately with the people making changes.

However, actual workflow in, for example, larger organisations is much more complex. Firstly, someone still needs to be responsible for deploying and managing infrastructure. Storage and other components have to be racked, stacked and interconnected. Failures need to be managed and rectified. Some team still needs to perform capacity planning and hardware refreshes.

Managers & Consumers

I would redraw this diagram and show things slightly differently. First, we need to divide responsibilities into managers and consumers.

Infrastructure managers take responsibility for deploying infrastructure into a framework from where it is consumed. This process includes the tasks we’ve already discussed of deploying, configuring and supporting the hardware. In modern environments, this doesn’t have to mean also provisioning individual resources to users.

Infrastructure consumers, by definition, consume storage resources. Traditionally the storage administrator would map resources to servers; however, today we can automate these tasks via APIs and CLIs. We can even use role-based access to partition and segment resources to groups of users.

APIs

The current Dell EMC strategy is to allow consumers access to APIs that automate storage provisioning. Exploiting these APIs depends on how the customer implements their workflow but could include using Ansible or scripting with Python.

Where previously Dell EMC may have advocated a framework that abstracted direct access to the storage (more on this in a moment), the current thinking is that this process is too complicated. So, DevOps & SRE’s simply have access to APIs and can directly provision resources.

Dell EMC Presents at SFD19

This simplistic approach seems to solve many problems. Time to delivery is shortened, and the human factor of involving a storage administrator removed. The risk of making mistakes could be reduced with automation. Unfortunately, life isn’t that simple.

Challenges

Here are just a few issues with providing direct access to infrastructure within a private data centre.

Resources aren’t infinite. One hugely important task provided by the storage administrator was that of gatekeeper. Storage can be expensive, so having a workflow step to validate requirements before provisioning is a vital step in controlling costs. In the public cloud, CSPs want to encourage unfettered access to storage and other resources, because that’s how they make money. This isn’t the case for most businesses that have limited IT budgets.
Accountability. How will consumed resources be tracked? The benefit of automation brings with it the ability to use storage for short periods (especially with containers). If the usage time is shorter than the billing (or measuring) cycle, resources can go uncharged (and appear to be unused), while capacity on any particular day may be constrained. I’ve seen this process happen (and used it myself) to reduce billing costs by migrating data to tape just before the billing data is collected, to then recall the data the next day.
Multi-tenancy. Platforms may be used by many IT groups because the shared model is efficient. Without some degree of control, that might include capacity and performance quotas, resource usage can have a significant impact on production applications.
Maintenance. Imagine the requirement to take systems down for support (or at least restrict certain activities). When end users have direct access to a platform, there’s no easy way to restrict the creation of new storage or prevent more advanced features (like failover) from being used.

Finally, in this list, there’s the challenge of placement. By this, we mean the process of deciding which storage platform to use for new storage allocations. Some choices are easy and determined by physical location or connectivity. Some are not so simple, such as picking the right storage platform based on free capacity or performance. There may also be a requirement to limit access during storage decommissioning. This isn’t a simple case of switching off access, because there are scenarios where extensions to existing capacity should be performed on the same array as a current allocation, even if that array is being decommissioned.

Framework

My personal preference is to build a framework around storage and other infrastructure. The SREs and DevOps teams still get programmatic and automated access to resources, but requests go through additional workflow that manages some of the challenges already identified, such as correct data placement and billing accountability.

EMC previously developed this kind of strategy with a solution called ViPR. The product used IP from the acquisition of iWave Software, acquired by EMC at the end of 2012. While the concept of ViPR was sound, the implementation wasn’t so good, and the product never really took off.

New Thinking

On reflection, I can’t help thinking that the problem statement as outlined at the beginning of this post is perhaps the wrong one. Instead of trying to shoehorn automation into the provisioning of traditional storage, perhaps storage itself needs to change.

Over the past ten or twenty years, storage platforms have definitely evolved. So much of the manual management of hardware has gone away. Software-defined storage in many guises has improved the efficiency of storage administrators.

However, we still face many challenges. Data is becoming more independent and detached from the physical form on which it is stored. We need to move away from LUNs, which were, in fact, the logical instantiation of physical disks. Data needs more metadata to track it because platforms like containers and Kubernetes are too ephemeral to use as the tracking device.

Some solutions that could solve these storage management challenges are already available. Start-ups like StorageOS, Portworx and Kasten are looking at data differently. In the traditional world, Tintri has a platform that abstracted any requirement to think about the storage underpinning virtual machines. There are others and undoubtedly more to come.

The Architect’s View

As a vendor of traditional storage products, Dell EMC is naturally trying to make the best of their existing portfolio. PowerMax, Isilon and Unity (in various forms) have had a long run. Dell EMC has been less successful in the software-defined space, failing to see traction with ScaleIO or ViPR.

Is this because the customer base is more complex and slow-moving than the rest of the industry? Who’s to say. While I think the attempts at automation are laudable, in the long run, Dell EMC needs some tools in the armoury to support more cloud-native style deployments. Of course, while the majority of customers continue to pay for high-end enterprise arrays, the transition to cloud-native will be a long one.

Disclaimer: Chris was invited to Storage Field Day 19, with GestaltIT covering travel and accommodation. There is no requirement to blog or produce any content from the event and no content is vendor-reviewed before publication.