The Need for APIs in Storage and Data Protection

The idea of programmatically managing infrastructure provides capabilities to scale, manage and control technology much more effectively than can be achieved through manual methods. But the use of APIs means so much more than simply providing an interface that can be automated. APIs introduce greater efficiency and accountability into storage platforms. We look at how that impacts the way in which various aspects of data management are being implemented.

Days of the Admin

Look back 20 years and you will see a method of operation where storage and backup administrators used GUIs and CLIs to do their work. As someone who has done both of these jobs over many years, I can say that the work was often repetitive and time-consuming. Where possible, I (and colleagues) made every attempt to automate and simplify the process. This included using techniques like building scripts and frameworks for handling large-scale or repetitive tasks.

Boundaries

Unfortunately, we could only work within the confines of the interfaces available. Some vendors introduced automation wrappers around storage functions. EMC’s ViPR was an obvious attempt at this, as were many storage management platforms. Backup software vendors introduced “master of masters” and other features to make it easier to manage multiple backup software instances. The real step forward though was in using APIs – a subject I discussed nearly 7 years ago.

Application Programming Interfaces

Why are APIs different? If we look at GUIs (graphical user interfaces) and CLIs (command-line interfaces), both were designed with the human in mind. They were created for administrators to interface with the software or an application. Building automation around a CLI, for example, introduces additional complications that could be eliminated if we take out the human design dependencies.

APIs are application programming interfaces that facilitate machine-to-machine interaction. It’s not to say that APIs can’t be used by a human operator, but the main premise of APIs is to provide a common standard for machines (and applications) to communicate with each other. The API has been around for probably 50 years in some form or other. In storage, we see examples like the POSIX API for file systems, the de-facto S3 API for object storage and NVMe, SCSI and ATA, all of which are forms of APIs.

So, APIs provide:

Standardisation – a common way to interface between applications, operating systems and infrastructure
Abstraction – functionality that doesn’t require knowledge of the underlying hardware or software implementation.
Automation – the ability to write code to interface with an application or system.

Exactly how these features are important, will become obvious later.

APIs and Data Management

We’ve already highlighted a few examples of APIs in the storage protocol layer. What about data management? There are two main scenarios where APIs can be used.

Management

Probably the most obvious is to provide management access to storage tasks, such as provisioning or making backups. A storage platform or backup solution exposes an API that provides the capability to do tasks that would otherwise have been done in a GUI or CLI.

The main benefit here is to automate tasks, either to reduce repetitive work or to install storage/backup as part of a larger automation framework. Storage, could, for instance, be automatically provisioned with each virtual machine created in public cloud. An API could be used to trigger a snapshot of that data when the application is quiesced as part of a backup.

The same logic can be applied to data protection solutions where management functions allow backup definitions to be created dynamically or ad-hoc backups to be triggered as part of a nightly backup.

Automation provides scalability and increases accuracy, by removing the human aspect of storage and data management. Well-written APIs will provide the ability to run many tasks concurrently or at least within a set time period. They will also allow the end user to mix and match management methods and allow tasks to be performed from either GUI, API or CLI at the same time. In many cases, the CLI and GUI implementations simply exploit an existing underlying API.

How far should this automation go? As I’ve already discussed in other posts, I think we should be able to fully deploy and configure software-defined storage with little or no manual intervention. This also applies to backup infrastructure as part of an on-demand or cloud architecture.

Data Access

The other area where APIs provide benefit is in data access. We’ve mentioned POSIX already. This is an agreed standard way to implement file system services. Another is SCSI, which is generally seen as a way to access storage devices like disk drives. However, SCSI commands can also be used for tape drives and have been used as a way of interfacing with automated tape libraries for many years.

In modern usage, we’ve seen two examples where APIs provide more efficient data access. The first is to extract backup data from virtual machine environments. VMware’s vSphere Storage APIs – Data Protection (which we will call VADP for convenience), for example, provides an access method to retrieve changed or updated data at the block level directly from the hypervisor running virtual machines.

Another example of APIs in backing up data can be seen with Acropolis File Services from Nutanix. AFS (now called Nutanix Files) provides a scalable file system that runs across Nutanix HCI infrastructure. AFS APIs offer the capability to receive a stream of changed data for data protection or run other functions like virus scanning, without impacting the timestamps on the file system.

Data and Control

In both examples here, we can see that the data access methods for management tasks like backup are separated from the way in which applications (including humans) will access data. This makes it easier to implement more efficient and secure access that can be tracked and audited separately to application-based traffic.

VADP

Looking back at the three tenets of APIs we quoted (standardisation, abstraction & automation), how do these apply to VADP? We can see that it’s not necessary to know which specific host is running a virtual machine guest. We can either query the hypervisor and find out or expect the API to provide the information from a cluster of hypervisor servers (in this case via vCenter).

Irrespective of whether we are running Windows, Linux or another operating system in a guest, the data stream is standardised. Obtaining data is easy to automate. It’s just a case of running a script or other piece of software that accesses the API when backups need to be taken. This is what backup software vendors exploit, as it replaces the significant overheads of deploying agents onto every guest.

File System APIs

Of course, we already have a process for taking backups from file systems. NDMP (initially developed by NetApp and Legato Systems) has provided a method for moving data from filers (NAS appliances) to media devices like tape. However, NDMP was built as a way to remove the need to push data through a backup server and allow direct filer-to-media access. AFS File Services offers a much more flexible approach.

In a similar way to VADP, AFS APIs provides standardisation and abstraction. They remove the need to know which specific node is storing file data. They provide a constant stream of changed data that doesn’t require scanning an entire file system. In that respect, AFS offers greater scalability and is less impactful on the data itself. We talked about the AFS APIs for backup on a recent Storage Unpacked episode that covers the work being done by HYCU Inc with Nutanix on developing the AFS backup APIs.

Security

One area in this discussion I see as extremely important is that of security. In many scenarios, data protection software had the “keys to the kingdom” in respect of global access to data across the entire infrastructure. Naturally, this was a pretty obvious requirement. Without global access, backup and restore becomes cumbersome or continues to need human intervention to enable.

How many IT organisations bother to fully audit the use of backup credentials? How many organisations fully audit backup operations? I class these as two separate tasks because backup software credentials to data are separate from credentials used by human operators to access backup software.

Having a separate API for backing up data provides the ability to reduce the potential attack surface if backup credentials are compromised. Data is being accessed only when it changes and through more manageable pipelines.

The Architect’s View

As we move to a more automated method of infrastructure and application deployment, then we will need efficient automation tools. Storage is getting there pretty quickly.

The idea that backup administrators should be manually configuring backups is also a thing of the past. Data protection is already another checklist in application deployment.

The question we need to ask is, with APIs as the preferred method of operation, how are vendors enabling storage and backup infrastructure to be both deployed and managed via API? How is the industry trying to standardise API specifications, to have, for example, a minimum viable set of API calls for any backup solution to create and query backup status?

In a similar scenario to the way in which S3 has become a de-facto standard for object storage that all vendors (except possibly Microsoft) follow, isn’t it about time we had a practical successor to NDMP, that the industry could align behind? That way, rather than waste time coding for so many solutions, we could start comparing them on the merits of their features and implementations.