This is a series of post discussing storage array architectures. Previous posts:
In the first post, I discussed the shared storage model architectures typified by what we sometimes think of as Enterprise arrays, but I’ve called monolithic. This term harks back to the mainframe days of large single computers (see Wikipedia definition), hence it’s use to describe storage arrays with a large single cache. In the last 10 years we have seen a move away from the single shared cache to a distributed cache architecture built from multiple storage engines or nodes, each with independent processing capability but sharing a fast network interconnect. Probably the most well known implementations of this technology have come from 3Par (InServ), IBM (XIV) and EMC (VMAX). Let’s have a look at these architectures in more detail.
The VMAX architecture consists of one to eight VMAX engines (storage nodes) connected together by what is described as the Virtual Matrix Architecture. Each engine acts as a storage array in its own right, with front-end host port connectivity, back-end disk directors, cache (which presumably is mirrored internally) and processors. The VMAX engines connect together using the Matrix Interface Board Enclosure (MIBE), which are duplicated for redundancy. The virtual matrix enables inter-engine memory access, which is required to provide connectivity when the host access port isn’t on the same engine as the data. There are two diagrams in the gallery at the end of this post, one showing the logical view of the interconnected engines and the second showing how back-end disk enclosures are dedicated to each engine.
What’s not clear from the documentation is how the virtual matrix architecture operates, other than being based on the RapidIO. I’m not sure if VMAX engines have direct access to the cache in other engines or whether the processor of connected engines is required. In addition, can an engine access cache in another engine purely to manage throughput of the local host and disk connections? I’m not entirely sure.
3Par storage arrays consist of multiple storage nodes joined through a high-speed interconnect. They describe this as their InSpire architecture. From 2 to 8 nodes are connected (in pairs) to a passive backplane with up to 1.6Gb/s of bandwidth between each node. 3Par use the diagram shown here to demonstrate their architecture and with 8 nodes, the numbers of connections can easily be seen. I’ve also shown how connectivity increases in 2, 4, 6 and 8 node implementations. InServ arrays write cache data in pairs, so each node has a partner. Should one of the node pairs fail, the cache of the surviving partner is immediately written to another node (if one is present), so protecting the cache data.
The InServ and VMAX architectures are very similar but differ from each other in one subtle but important way. 3Par InServ LUNs are divided into chunklets (256KB slices of disk) that are spread across all disks within the complex. So as an array is deployed and created, all of the nodes in the array are involved in serving data. VMAX uses the Symmetrix architecture of hypers – large slices of disk – to create LUNs, with four hypers used to create a 3+1 RAID-5 LUN, for example. As new engines are added to a VMAX array, the data is not redistributed across the new physical spindles, so data access is unbalanced across the VMAX engines and physical disks. In this way, InServ has better opportunities to optimise the use of nodes, although within VMAX the use of Virtual Provisioning can help to spread load across disks in a more even fashion. In addition, a fully configured VMAX array has up to 128Gb/s of bandwidth across the VMA, exceeding InServ’s capacity.
In my opinion the tradeoff here comes down to increased scalability with dedicated nodes versus the latency introduced when data isn’t located on the local node. In the 3Par model, data is always being accessed across nodes. In the EMC model, nodes only exchange data when the LUN’s physical disks aren’t located on the local node. This leads to two problems. Firstly, as more nodes are added, the number of node<->node connections increases exponentially. For an 8-node array, there are at least 28 node to node connections (not including additional connections for redundancy). This increases to 120 for 16 nodes (nearly 6-fold increase in connectivity for double the nodes) and nearly 500 connections for 32 nodes, to which VMAX can theoretically scale. The second issue is that of diminishing returns. As more nodes are added, more overhead is required to service data not found on the local node. This leads to a situation where the benefits of adding additional nodes are so small to make it not worth doing.
The IBM XIV array takes a different approach to node configurations that are directly connected to the underlying data protection mechanism of the hardware. XIV uses only RAID-1 style protection, based on 1MB chunks of data known as partitions. Data is dispersed across nodes in an even and pseudo-random fashion, ensuring that for any LUN, data is written across all nodes. The architecture is shown in the XIV picture in the gallery at the end of this post. Nodes (known in XIV as modules) are divided into interface and data types. Interface modules have cache, processors, data disks and host interfaces. Data modules have no host interfaces but still have cache, processors and disk. Each module has twelve (12) 1TB SATA drives. As data is written to the array, the 1MB partitions are written across all drives and modules ensuring that the two mirror pairs of any single partition do not reside on the same module. Sequential partitions for a LUN are also spread across modules. The net effect is that all modules are involved in servicing all volumes and the loss of any single module does not cause data loss.
Whilst XIV might be tuned for performance, there is still the inherent risk (however small) that a double disk failure results in a significant data loss, as all LUNs are spread across all disks. Additionally the XIV architecture requires that every write operation must go through the Ethernet switches as data is written to the cache on the primary and secondary modules before being confirmed to the host. As a consequence, overall bandwidth of a single module will be limited to the available network capacity, which is 6Gb/s for interface nodes and 4Gb/s for data nodes. This value halves if either of the Ethernet switches fails.
The multi-node storage arrays on the market today are all implemented in slightly different ways. Each has positive and negative points that contribute to the overall decision on which platform to choose for your data. Whether any of them are suitable for “Enterprise” class data is an open question that continues to be the subject of much debate. From my perspective I would want a “tier 1″ storage array to provide high levels of availability and performance, something each of these devices are capable of achieving.
Next I’ll discuss modular arrays and the benefits of dual controller architecture.