XtremIO: What You Need to Know (Updated)

With much anticipation, EMC has finally gone GA on their all-flash array, XtremIO, based on the former company of the same name they acquired in May 2012. The eighteen months that have elapsed since then has seen the rise of an number of competitors and a lot of rumour, discussion and teaser presentations from EMC themselves. So why are EMC so enthused about their technology, what does it offer and how does it compare to the competition?

The Facts

EMC describes XtremIO as a scale-out storage array architecture. It consists of a basic building block known as an X-Brick. At GA, a system can consist of one to four X-Bricks, which together form a tightly coupled node-based architecture. A single X-Brick consists of two 1U controllers (x86 servers), a 25 slot Disk Array Enclosure (DAE) and two 1U battery backup units.

Additional X-Bricks consist of only two controllers and a DAE, plus the addition of a pair of Infiniband switches at first expansion. Therefore a single X-Brick occupies 6U; additional bricks occupy 5U of space. Within the DAE, each SAS drive is a 400GB eMLC SSD. The controllers each have two 8Gb/s Fibre Channel ports, Two 10GbE iSCSI ports, two 40Gb/s Infiniband ports (for node communications) and one 1Gb/s management port. In terms of capacity, todays’s XtremIO solution scales from a single X-Brick with 10TB of capacity (7.5TB usable) to four X-Bricks of 30TB usable. A capacity increase is expected in 1Q2014 with double capacity SSDs offering up to 60TB in a single four-node cluster. Eight-node clusters are also being talked about, but I have no timescale for that.

The Technology

EMC are claiming a number of distinct and “game changing” technologies within their architecture including:

Content-based Data Placement – As X-Bricks are added to an array, performance will scale out linearly. This is due to the way in which data is dispersed across all of the nodes; every node participates in delivering storage for every LUN, very much in the XIV model. Incoming data is divided into 4KB chunks which are then fingerprinted based on their content (think hashed) and distributed evenly across all nodes in the complex.
Dual-Stage Metadata Engine – As data is stored to disk, it is passed to the owning “engine” or node. This node then decides onto which physical SSD (and where) the data will actually reside. All of this metadata is kept in memory rather than read from disk, to improve performance.
Data Protection – XtremIO claims 8% overhead on data protection (which seems odd as the raw to usable figure is 10->7.5TB, which I make a 25% overhead. However, digging into the detail we see that the way the low overhead is achieved is through using a very wide-stripe RAID implementation that looks like RAID-4 with double parity, or a 23+2 stripe. This equates to our quoted 8%. The use of a fixed RAID-4 implementation is very reminiscent of NetApp’s Data ONTAP WAFL.
Shared In-Memory Data – the storage of all metadata in memory means copy services (snapshots, clones, VM copies) can be achieved extremely quickly with no overhead. Again, this is very reminiscent of WAFL, albeit that the data is all permanently in memory.

What these features translate to is 100,000 IOPS of random write 4KB data and a quoted average latency of 0.5ms with a 50% read, 50% write workload.

The Fine Detail

Of course nothing is ever what it seems on the surface and we should dig down to the fine detail. Let’s start with scale out. XtremIO is a tightly-coupled scale out solution. By that I mean every node is involved in the presentation of every LUN and there is a lot of data traversing the back-end Infiniband network. However there are some things to note about this design. First, data is spread across every node for performance purposes but not duplicated across it. There is no data redundancy at the node level. The loss of a node results in data loss for every LUN in the array. This is why EMC have gone to great lengths to protect a single node with dual controllers and no single point of failure.

Data Loss

Like EMC VMAX, node failure is a catastrophe. A comparison can be made here to IBM’s XIV in terms of data loss. None other than EMC’s Barry Burke took IBM to task for their flawed architecture design in this post (1.024: something you should know (about xiv)) where he talks about catastrophic data loss on disk failure; now EMC have designed this feature in to their own products. Another comparison can be made with HP’s 3PAR technology, also a closely coupled node-based architecture. However the difference here is that data is spread across the nodes in a redundant format and a 3PAR system can survive the loss of a node without impacting data availability (see The HP 3PAR Architecture, page 7).

Let’s talk next about that metadata. Keeping all the data in memory is a good move for performance, but what about dynamically scaling out? In 2-node XtremIO array, the calculation on metadata will pick one of two nodes for every fingerprinted block of data. If I want to move to four nodes, that calculation is now wrong and has to be modified. But what about both the placement and calculation of the location of existing data? How do I find data in an expanded 2->4 node cluster? Do I maintain two algorithms? Is data moved as the cluster is expanded? Does all the metadata have to be rebuilt? You won’t be surprised to hear that EMC doesn’t support non-disruptive expansion upgrades in version 1.0 of the product.

XDP

I touched earlier on the data protection mechanism, known as XDP by EMC. In fact it’s a dual-parity RAID-4-like implementation with a width of 23+2 (23 data blocks, 2 parity blocks). Like Data ONTAP, EMC wait to be able to write a whole stripe to disk, storing updates in memory until that point (which again explains the high level of hardware redundancy). However, unlike WAFL, XDP is apparently optimised for partial stripe writing. That surely means some stripes will be less than 23+2 in size, once the array becomes full (as the array doesn’t do garbage collection and simply overwrites data). If that’s the case then this means at best XDP will achieve 8% overhead, but as the array fills up, this overhead will increase when data is written to released or dirty pages. For workload with large block transfers, this may not be a big issue, but where hosts are doing small block (e.g. 2K/4K) I/O, the data layout could quickly become fragmented, increasing XDP overhead. One final comment about the RAID-4 like model. The achilles heel for NetApp with this design was the inability to implement block-level tiering. This will apply to XtremIO too.

Update: Based on comments from “Felix” from EMC the above section needs revision; EMC’s documentation states XtremIO does a partial stripe update, not a partial stripe write, which means data stays at 23+2 for all RAID stripes, so there would be no reduction in RAID efficiency as the array fills up. However it does mean that XDP has to do updates in place for stripes – described in the documentation as occuring “almost never”. I have left the original text in place, as a reference to explain the update and the details in the comments.

Advanced Features

One final thought on advanced features. Today XtremIO has no remote replication capabilities but can do snapshots (EMC use the example of VAAI-like full copy of a VM). This is a good feature (as it was in Data ONTAP 20 years ago), but has one fatal flaw. Today many customers of VMAX use volume clones for a form of local data protection. A LUN is cloned (i.e. a full copy) to a different RAID set of physical disks to protect against drive failure. XtremIO snapshots won’t provide that equivalent protection as all the data sits on exactly the same set of SSDs.

It’s No VMAX

At first, I looked at XtremIO as a future replacement for VMAX. It has the same hardware/redundancy design as VMAX and VNX and nodes scale out similarly to VMAX. Today’s customers would certainly be comfortable with that. However when you dig into the detail, there are design issues around the risk/impact of hardware failure, lack of tiering capability and limited practical scalability (unless EMC start expanding X-Bricks with additional shelves). XtremIO isn’t about to replace VMAX/VNX any time soon. So in the short term, XtremIO has to stack up to the existing competition. The startups haven’t been shy in coming forward to stick the knife into EMC, which they can hardly be blamed for, based on EMC’s previous marketing strategies (see my related links for details). XtremIO doesn’t offer the same capacity as Pure Storage; it isn’t as fast as Violin Memory; it doesn’t scale as well as Kaminario or SolidFire.

Compare XtremIO to the established vendors; Hitachi have a flash solution based around their FMDs, recently doubled in capacity; HP have the 3PAR 7450 architecture delivering 500,000+ IOPS with better resiliency. Both of these solutions provide access to existing features (snapshots, replication, tiering, resilience) without implementing yet another architecture and management platform.

The Architect’s View

XtremIO isn’t the game changer that EMC promised us. No doubt EMC will sell a boat-load of this product to their existing customers, but the technology keeps us in that monolithic shared-storage model, which will increasingly look outdated as time goes by. Is it then a stop-gap before VMAX-2? The problem there is that VMAX is 20+ year old technology and has its own issues. What are EMC to do? There must be some interesting discussions going on in the halls of Hopkinton these days; EMC could be at one of those famous inflection points and don’t even know it.