Do we need Storage QoS? - Architecting IT

I recently had a discussion with a vendor (who shall remain nameless) as to whether we really needed Quality of Service in shared storage arrays. His thinking went as follows; if we have a storage array and network with sufficient bandwidth/IOPS, then why bother implementing QoS? At first this seems like a reasonable assumption; if I have more resources than required, what’s the problem as I can cater for all requirements. To think this through as to whether this makes sense, let’s step back and look at how persistent storage has been delivered over the last 15-20 years.

The Problem

Persistent storage has always been the bottleneck in computing because I/O to disk and tape occurs so much slower than operations in the processor and memory. The differences are huge, with storage being 3 or 4 times the order of magnitude slower than the speed of data being moved around in the processor (nanoseconds & microseconds compared to milliseconds). As a result there was a good reason Gene Amdahl said “the best I/O is the one you don’t have to do”. External I/O slows things down. Because of this, storage has always worked to deliver I/O requests as fast as possible. It’s the difference between the “McDonalds” method of food delivery compared to booking into a restaurant where the time slots are allocated in advance.

To extend the analogy further, with McDonalds, customers are served pretty much in order, even if their selection isn’t immediately available. Choose the wrong queue and you could be behind someone who is indecisive, is ordering for a coach-load of people or simply has a slow server. There’s no prioritisation or special treatment – delivery time is unpredictable. More bandwidth is provided by adding more servers (which has limits of scalability). Restaurants by comparison, book time slots to ensure that the food can be delivered by the chefs in a timely order. The cooking is spread out (hopefully) evenly across the evening to provide a more consistent experience. Slots are limited, curated and managed. Turn up without a booking and you will be turned away. Restaurants “scale up” by adding more covers (seats & tables) and matching this with more staff.

When storage arrays were built from hard disk drives (HDDs), I/O response was unpredictable and very variable, depending on the workload profile of the I/O requests. Vendors used techniques like caching, pre-fetch, queue re-ordering and destaging to mitigate the peaks and streamline the I/O. Some vendors implemented prioritisation techniques that were not QoS but aimed at getting as much backend I/O completed as possible. With flash, these problems have been less apparent, as SSDs provide higher throughput and much lower latency than HDDs, even with random workloads (subject to managing issues like garbage collection). I/O to hosts is more predictable and consistent but still occurs over shared components like front-end ports, internal software queues, back-end controllers and shared SSDs.

Noisy Neighbour

Because of this shared nature, it’s possible to experience the ‘noisy neighbour’ problem, where one host monopolises the I/O traffic at the detriment of others. Even with SSDs on the backend, front end queues in hardware (like FC HBAs) and software queues (like those updating metadata) will still see contention and potentially some delay. QoS allows that contention to be controlled and SSDs allow the I/O to those hosts to be delivered consistently.

So QoS does have a place, even with a system that appears to have plenty of I/O capacity, if for no other reason than to ensure the I/O capability is shared evenly between all the systems. QoS then comes into use even more when there is contention for other resources such as on SSDs, processors or system memory. Prioritisation can be used to determine which workloads are throttled first, protecting the mission critical systems. Finally we should remember that QoS also allows cloud-based deployments to ensure that customers (internal or external) only get the resources they pay for (politely known as a “consistent experience”).

This final point is quite important. We are moving to a model that delivers IT as a service for all components, not just storage. Today SSD is the fastest medium of choice currently widely adopted in the industry; tomorrow it could be NVDIMM or 3D Xpoint. Without some service-based controls, IT organisations will find it difficult to introduce new technology and not affect the user experience. Separating the two allows technology to be delivered in the most optimum way possible.