Databases - The Fourth (Storage) Protocol?

We’re used to block, object and file as standard protocols or types for data storage. Could databases, or data structures, become a fourth protocol or data type that can be stored, retrieved, and managed as easily as the other three?

Background

Back in 2019, we suggested that databases were about to become a significant area of competition in the public cloud. Part of the reasoning was the emergence of cloud-based versions of existing open-source databases. AWS had just announced DocumentDB (based on MongoDB 3.6), and, as we discussed two years later, cloud vendors were able to charge more per virtual instance by running managed databases on top of them. Of course, AWS wasn’t first in the DBaaS (database as a service) market. MongoDB first launched Atlas back in 2016, for example, while Fauna (developers of FaunaDB) was founded in 2012.

DBaaS

We spoke to Evan Weaver, co-founder and CTO of Fauna, on a podcast back in 2020. The platform is interesting as the end user has no concept of how the database services are delivered but simply stores and retrieves data through an endpoint.

Our recent podcast with Mat Keep from MongoDB goes further into the options now available to developers. There’s a continuum of solutions available, from fully managed DBaaS using serverless technology to MongoDB Atlas running on the public cloud or the availability of self-managed services on-premises or in the public cloud.

Pure Storage announced Portworx Data Services back in September 2021. This solution enables end users to run common databases on Kubernetes. We looked at what the Portworx solution offered in a podcast (available here) and blog posts – here and here.

As an example of how easy it now is to deploy databases on Kubernetes, we ran some performance testing in 2022 that used common database platforms built within minutes using some standardised scripting. Here’s our discussion of the work we did, which was sponsored by Ondat.

Time to Value

For developers, DBaaS represents an opportunity to massively speed up the deployment of a test environment, which ten years ago could have taken weeks to put in place. Now, within minutes, a database can be created and easily accessed through an endpoint and typically some SDK tools that provide standardised commands in most common development languages. In return for less control, the developer gets to be more productive.

Arguably, offloading issues of performance tuning and management to a vendor that does this work across potentially thousands or tens of thousands of instances makes more sense than retaining those skills in-house. These solutions are also much more cost-effective for IT organisations where billing is consumption-based, either over time or transaction count. Legacy environments always had the risk of licence compliance issues, which the service-based model removes completely. Of course, there’s nothing to stop an IT department from using all licensing methods, where long-running platforms may be more cost-effective using perpetual or annual renewal options.

The Fourth Protocol

So, could “data structure” be a fourth data protocol? Today we have the following.

Block – the most basic form of data storage, offering high granularity (usually 512-byte blocks or multiples) but with no data awareness.
File – semi-structured content stored in a hierarchical structure (a file system) which adds data security, locking and metadata to content. However, most file data still need another application to understand the details of the contents.
Object – also semi-structured but with fewer data integrity capabilities. Objects are generally managed via CRUD (create, retrieve, update, delete), where update is a combination of delete then create. As with file content, object storage generally needs another application to interpret the contents (even if that is something as simple as a PDF reader).

A data structure type would add greater structure to content while still being managed and accessed through standard APIs or SDKs. We would need to have some protocol standard in place, similar to the difference between NFS and SMB, representing (for example) the difference between relational, document or key/value data stores.

The Architect’s View®

Databases are a perfect example of utility computing that can be consumed through a generic endpoint. Global databases are already available – for example, FaunaDB and Google Cloud Spanner. In a quick roundup, we also found SkySQL from MariaDB and Couchbase Capella in addition to the solutions already discussed. There are definitely many more, especially from public cloud platforms.

We believe that data structures should be viewed as the fourth storage data type. In the public cloud, the data structure access model can be delivered directly via an API (such as FaunaDB), as a managed solution (like MongoDB Atlas), or as a managed service from a cloud service provider (AWS, Azure, Google Cloud).

What about on-premises? For traditional storage platforms, we already have solutions that can “multi-task” and deliver block, file, and object at the same time. With Portworx Data Services, Pure Storage has a solution that could offer databases on top of a FlashBlade or FlashArray. Currently, that implementation would need servers to run a resilient Kubernetes cluster, but there’s no reason that functionality couldn’t be integrated directly into the platform in the way file services on FlashArray are now offered. For other storage vendors, a data structure protocol is also achievable, as almost all modern storage solutions are just servers with lots of physical storage.

To compete with the flexibility of the public cloud, on-premises storage vendors need to add new and compelling features. DBaaS in the public cloud is already so easy that many developers wouldn’t even consider using on-premises solutions if they could be avoided. Attracting developers to use on-premises on-demand databases seems like a significant opportunity that’s not currently being solved. Which vendor will step up first and fill the gap?