How to pick the right shared storage for modern databases

Gene Amdahl famously said that “the best I/O is the one you don’t have to do” and having experienced the challenges of building massive Oracle databases back in the 1990s, I agree. In today’s environment of diverse database solutions, what’s the best storage design that fits the requirements of modern databases?

Touching briefly on the Amdahl database challenge, around 1995, I worked on a project to run Oracle on mainframe (crazy I know), with a shedload of system memory on an Amdahl 5995 12-CPU system. The most noticeable issues came from disk I/O – initially at start-up to load the database into memory and again when the system regularly crashed, to write the contents of memory to dump datasets. Imagine running on today’s fastest Xeon processors and backing everything with 5400 RPM laptop HDDs and you get the scale of the problem. When everything was up and being read from memory, performance was great, but not so if we had to do any I/O.

The topic of storage and databases is one I’ve been discussing for years. In this article for Computer Weekly back in 2014, for example, I looked at the storage requirements for in-memory databases such as SAP HANA.

Requirements

Historically, database storage requirements have focused on the specifics of storage technology. Database administrators would typically ask for RAID-5 storage for data and RAID-1/10 for logs. It’s surprising to see this thinking pervading design considerations even today (look to the section labelled “SSDs for IO Intensive Applications” in this MongoDB requirements guide). The reason for this was simple; RAID-1/10 provides lower latency for write I/O, which is the majority of log data. Of course, modern application design looks to cast aside shared storage, hence the focus on the hardware specifics.

These days we’re moving further towards more in-memory computing where the entire contents of the database are kept in RAM. We’ve discussed this recently on Storage Unpacked, firstly with Apache Ignite and GridGain Systems (episode #169) and also in respect to Redis (episode #147).

Memory prices are far lower than they were 10-15 years ago, relatively speaking, compared to overall system cost. As a result, we’re seeing more RAM being used to keep as much data as possible close to the processor. How does this change the profile of external I/O?

Read & Write Bifurcated

We can think of in-memory or large memory-based databases having two main requirements. The first is to load data on start-up. In this instance, the aim is to get all of our data into RAM as fast as possible. This is especially true with clustered databases like Apache Ignite, where a node failure or software issue means reloading the entire data contents. With a multi-terabyte database, throughput speed is critical.

The second requirement is for checkpointing. At some point in time, a database has to write updates to persistent media. In-memory solutions might do this infrequently through checkpoints, while traditional databases with lots of RAM might write continuously (with minimal read I/O). In both cases, checkpointing becomes a write-intensive activity.

So, generally speaking, read I/O will be throughput-sensitive, write I/O latency-sensitive.

Asymmetric

If you want to build a system to manage high volumes of write I/O, then the traditional method to cope with the demand is to use DRAM caching on the storage platform. This design stretches back almost 30 years when EMC built the first integrated cached disk arrays, or ICDA for short. The cache acts as a buffer to soak up writes, coalesce them and flush to disk. In an age of spinning media, this makes a lot of sense, especially with RAID involved, where the data needs to be striped and parity calculated.

Scaling with DRAM can be expensive (as I discussed in this post two years ago) and puts constraints on how far systems can scale. Dell EMC’s newest platform PowerStore, for example, consumes up to 2.5TB of DRAM with just 4 CPUs (112 cores) and only 96 drives.

Persistent Memory

Some vendors have taken a different approach and bypass writing I/O to cache. Instead, these systems use Intel Optane as a fast write tier that eventually cascades down to cheaper NAND flash media. We can see this architecture in systems from StorONE and VAST Data. Pavilion Data does something similar but uses a large number of additional controllers to improve performance and writes directly to NVMe SSDs, not Optane. When Optane is used as a write tier, the bulk of data can be stored on cheaper NAND media like QLC, as I discuss in this recent post.

Eliminating DRAM has other benefits. You can read more about this in a recent post here.

The Architect’s View

Where does this position us when thinking about modern databases? First, we need to remember that we can get the best benefit from DRAM when it’s close to the CPU and used for “real work”. DataCore has used this approach in its solutions for quite some time. This process is about avoiding I/O, as we said at the very top of this article. To really optimise performance, the long-term trend is to keep as much data as possible in DRAM.

Storage is the backing store, doing what it does best in keeping data safe. When in-memory databases need to commit data, the write process needs to operate fast and with the lowest latency as possible, because the database will be waiting for that process to complete.

Looking back to the MongoDB article we referenced earlier, it’s clear that an assumption is being made to use local storage instead of a shared solution. Shared arrays may seem like yesterday’s technology, but that’s far from reality. In-memory databases need to replicate for resiliency, and that also includes the storage. Shared storage offers a much more efficient solution and also delivers to the needs of large DRAM-heavy databases that have more specialist I/O requirements. As I mentioned in this post last year, SAN 2.0 is about evolving and relevant new architectures, not throwing away the benefits of shared infrastructure that’s been a proven solution for decades.