Caching vs Tiering - Architecting IT

New technologies are coming to storage and data centres that adds to the hierarchy of choices available to system and solution architects. New media expands on the portfolio of hard drive and SSD technologies that have represented the bulk of persistent storage used in the enterprise. The use of these solutions can either be as a cache or tier of storage. We look at both to determine the pros and cons of each approach.

What is Caching?

Caching is a technique used to speed up the relatively poor access times of external persistent media. In a typical computer system with hard drives and no caching, each I/O event would take at best ten milliseconds to occur and could be longer with a high degree of random data. If we compare the speed of system memory and HDDs, the hard drive is around 100,000 times slower than DRAM (10 milliseconds compared to 100 nanoseconds). If DRAM latency were one input/output request per second, HDDs would take 11.5 days to answer each I/O.

Caching uses faster media (typically system DRAM) to store a copy of the data that would usually reside on disk or SSD. By accessing the data from the cache, overall access times are vastly improved, and this directly translates to better application performance and lower response times.

Cache all the Data?

Why not just put all data in DRAM? Well, DRAM is expensive and volatile. The amount of DRAM that can be deployed into a single server is also finite, typically a few terabytes. Even scaling to the terabyte level can require using multi-socket CPUs, which introduces extra cost and the potential for wasted resources. Fortunately, we don’t need to put all data into DRAM. Instead, caching techniques can rely on a characteristic of active workloads known as the working set.

Working Set

Applications read and write data from either structured content or unstructured files and objects. In both cases, only a subset of data will generally be active at any one time. This feature is known as the Working Set. In a database, the Working Set could be the most recently created transactions or table indexes. In unstructured data, the Working Set could be a frequently accessed set of files. An efficient cache will store the entire Working Set on fast media.

Cache Characteristics

We can summarise the characteristics of a cache as follows:

A subset of data to the original copy stored elsewhere on persistent media.
A fast type of media, typically DRAM (or as we will see, persistent memory).
A temporary copy of data that changes as the working set changes over time.

There are typically three types of cache used in applications, operating systems and storage systems.

Read Cache – the cache services requests for read I/O only. If the data isn’t in the cache, it is read from persistent storage (also known as the backing store).
Write Cache – all new data is written to the cache before a subsequent offload to persistent media. Write cache designs include write-through, write-around and write-back.
Read/Write Cache – a cache that services both read and write requests.

Caches must be managed efficiently to ensure data integrity and consistency. For example, a read cache has to invalidate entries for data that has changed or been updated. Write caches need additional protection to guard against data loss before data is written to permanent media. In scale-out solutions, cache management gets even more complicated as data across each node within a cluster of servers has to be aware of cache content and data updates of the other cluster members. Caches can introduce unpredictable I/O performance if the cache becomes full or if the working set isn’t accurately aligned in the cache itself.

What is Tiering?

Tiering is the process of using multiple persistent storage media solutions to optimise hardware costs based on the performance needs of an application and associated data. Tiering offers the capability to place data on the most appropriate media that delivers the right cost/performance profile.

Given a choice, we would store all our data on the fastest media available. However, as an old storage adage says, capacity is free, but you pay for performance. For example, solid-state disks offer much greater performance than HDDs but come at a higher $/GB cost. If the price of storage media isn’t a consideration, then placing all data onto SSDs offers the best and most consistent performance level. Unfortunately, as data storage volumes have increased, tiering is necessary to ensure inactive data resides on cheaper media such as HDDs.

Tiering Algorithms

Over the years, we’ve seen many tiering algorithms used to actively place data on the most optimal storage media. Within storage systems, automated tiering moves data up or down between solid-state and hard disk drives. This process is generally reactive, making placement choices based on historical data.

The results from these traditional tiering processes are less than desirable. Data can suddenly become active and be on the wrong tier of storage, while other data can go inactive and be using precious resources. Data movement between tiers typically needs some buffer space to work and also consumes media I/O cycles that could be given to the application.

Tier Characteristics

Tiering characteristics can be described as follows:

Tiering stores an entire set of data across multiple media types.
Media offers variations in performance (bandwidth/throughput) and latency.
Tiering algorithms attempt to place data on the most appropriate media based on I/O requirements.

The brief introduction here sets the scene on the basics of tiering and caching. Of course, we could write an entire book on the topics of caching/tiering design. It’s worth remembering, though, that both caching and tiering introduce some degree of compromise in different architectures. These compromises are generally around consistent performance, cost and complexity.

Choices

In storage systems, vendors have typically used caching to improve I/O performance, building in read and write cache functionality. Caching can smooth I/O performance of disk systems and effectively absorb peaks in demand. Tiering exists across the market of storage solutions, allowing vendors to offer customers cost-effective products.

As we highlighted at the start of this article, the choice of media is ever-expanding, offering new ways to implement tiering and caching technologies.

Persistent Memory

Persistent memory, sometimes called storage-class memory, is a new class of devices that deliver high performance, with memory-like access protocols. 3D-XPoint from Intel and Micron, for example, is available as either a solid-state disk or as a memory DIMM. The SSD format delivers latency lower than NAND-based SSDs, with much higher write endurance and greater throughput. You can read more about new storage technologies and their characteristics in the following series of related articles.

Note that there are now also many variations of NAND-based SSDs, including higher-capacity TLC/QLC drives and high-performance models like Samsung Z-NAND.

PM in Storage Arrays

Storage array vendors are now using 3D-XPoint (in the form of Intel Optane) as both a cache and tier of storage. Optane is particularly useful as a write cache due to the improved endurance it provides over NAND SSDs (using the Optane as a read cache doesn’t really do justice to the benefits of the technology).

Optane as a tier of cache delivers a high-performance low-latency storage tier for the most demanding of applications, perhaps where the Working Set size represents a large part of the data and where a cache would be ineffective. An example here could be AI/analytics workloads where most of the data is being actively accessed and benefits from low latency.

Repeating History

In many respects, the introduction of persistent memory is similar to the introduction of SSDs a decade ago. Many vendors simply replaced one or all of the HDDs in a system with SSDs and naturally saw an increase in performance. Unfortunately, this process didn’t always fully exploit the new media. Instead, vendors moved on to build storage systems specifically to exploit the characteristics of SSDs, resulting in solutions that were much more efficient and predictable than their counterparts. We’re at that point again today with the current implementations of persistent memory technologies.

Caching and Tiering

The characteristics of new media mean that traditional caching and tiering processes need to be revised. Optane, for example, is capable of acting as both a write cache and tier at the same time. Vendors including StorONE and VAST Data use a combination of Optane and QLC NAND to deliver performance and capacity tiers, landing data directly onto Optane persistent memory. You can find more on the VAST Data architecture in this series of podcasts and blog post. We’ve discussed StorONE in a recent podcast and will have more content coming soon.

One common thread with both of these solutions is the use of the technologies in new ways that modify the traditional views of tiering and caching. For example, caching can be dropped in favour of directly accessing fast media, removing the complexity of cache management.

The Architect’s View

New media with new characteristics offers the ability to review and revise storage architectural designs. As I discussed in the “20-Year Architectures” article, we can’t simply continue to bolt on new hardware. At some point a complete redesign is necessary.

I expect we will see the adoption of technologies like Optane drive demand, reduce costs and in turn, make the media more attractive to storage system designers. The most efficient solutions will exploit the unique characteristics of new media, justifying the continued use of dedicated storage appliances for many years to come.