There have been many attempts over the years to develop new storage architectures that aim to resolve some of the presumed issues with using shared or local storage. Most recently we’ve seen hyper-converged, and hardware-based designs emerge as viable alternatives to SAN. Nebulon uncloaked from stealth in June 2020 to add a new solution, branded cloud-defined storage, to this market. What is it, and what problems is Nebulon looking to solve?
Shared storage in the form of SANs has been a dominant architecture for the past 20 years. Consolidation into a single appliance improved resiliency, maintenance and efficiency. The trade-off comes with cost and configuration complexity (although it is debatable whether these issues still apply). SAN does have scalability challenges, though, and that can be a problem for large enterprises.
- The HCI Definition Obsession
- Building Private Cloud Storage – HCI or Dedicated Array?
- VSAN, VSA or Dedicated Array?
HCI or hyper-converged infrastructure puts storage back into the server, taking advantage of the move to server virtualisation to run storage services in a local virtual machine on each node/server. Together, a cluster of nodes provides resilient storage, albeit at the cost of local CPU and memory resources that are consumed in each server to run the storage virtual machine (SVM).
The Nebulon solution is a hardware add-in card (AIC) that effectively implements the features of a mini-storage array within each server or node. The overhead of assigning server resources to an SVM is eliminated and moved to hardware. Storage within each node is connected via the AIC or services processing unit (SPU), with the SPU, in turn, emulating a local storage controller to the host. Servers in a cluster or “nPod” are connected over 10/25Gb Ethernet between SPUs, enabling mirroring for data protection.
As single nPod scales up to 32 servers, with a maximum of 255 volumes (LUNs) per SPU and a total of up to 10,000 volumes per nPod. Each server (effectively each SPU) can support up to 24 physical drives in 1TB, 2TB or 4TB capacities.
Services processing units plug into the GPU slot of a standard server and are PCIe 3.0 compliant, full-length, full-height and double-wide in size. A GPU slot is required as the SPU draws 85W of power, slightly more than the standard 75W permitted for a PCIe slot (GPUs have separate additional power connectors).
At first glance, the Nebulon solution looks exciting, but where’s the benefit? In eliminating the consumption of CPU and memory, instead, each server now loses a GPU slot and gains another piece of hardware. However, putting this aside for a moment, there are some positive benefits:
- O/S agnostic. SPU drives appear as standard SAS devices to the installed operating system. This means a single cluster can support bare metal or hypervisor installations, including mixed clusters if required. Traditional HCI solutions generally don’t support mixed clusters (except for NetApp’s HCI offering).
- Reduced licensing cost. Many enterprise applications licence per socket, so it makes sense to optimise the available server CPU resources for applications. As licensing can be expensive, there’s a significant saving in eliminating as much of this overhead as possible.
These benefits are useful, but perhaps the most significant difference between Nebulon and HCI/SAN solutions is the storage management process. Nebulon uses the term “cloud-defined storage”, and this refers to the similarities in operational management compared to how cloud service providers have implemented their storage solutions (more on this later).
The Nebulon solution is divided into two components. We’ve already outlined the data plane, which is implemented through SPUs. The management plane is based in the public cloud and used to provide remote management for the SPU infrastructure. The configuration of volumes, mirroring, snapshots, and other functionality is delivered through a SaaS management portal called Nebulon ON. As figure 3 shows, Nebulon ON communicates with SPUs through a secure, encrypted API. Through a “security triangle”, configuration changes can be issued to SPUs through authorised clients via a local web browser.
Many storage administrators may be concerned at the concept of allowing configuration access through an external SaaS portal. It’s worth looking back to the process of storage administration from a SAN perspective, to see why this architecture is more scalable than direct configuration.
In SAN provisioning, access to data is achieved through storage configurations (LUN creation, WWN and target mapping) in conjunction with configuration data from the host in the form of HBA settings and WWN (world-wide name) definitions. The provisioning process is incredibly static, with practical limits on scalability. For example, swapping out a failed HBA requires significant reconfiguration work because each HBA has a unique WWN identifier.
The Nebulon architecture is much more dynamic. If an SPU fails, the replacement is simply provided with the “personality” of the previous card and data is immediately accessible again. There is little or no involvement from local administrators. This ability is where the second aspect of “cloud-defined storage” is derived; the idea of cloud scalability. The cloud portal maintains all metadata, pushing configuration information to the SPUs in a significantly more scalable process than traditional SAN management. How does this align to cloud services providers?
Public cloud service providers (AWS and Azure in particular) have started to move many storage and networking functions back into hardware. We’ve discussed these before in the form of Nitro and Pensando.
To achieve scalability to thousands (or tens of thousands) of servers, enterprises need a better way to access and manage data. The cloud model of offloading storage and networking to hardware, while separating management from data planes provides this level of scalability.
Although not explicitly discussed in our briefings, it’s clear that the Nebulon architecture can provide higher levels of dynamic reconfiguration and workload mobility. The volumes from one server, could, for example, be replicated to another with more CPU/memory resources (or even GPUs if slots are available).
This ability perhaps, is where the Nebulon architecture offers the most potential. IT organisations gain the benefit of distributed storage across thousands of nodes, without needing to deploy shared storage. A single node can run any O/S, yet still be part of a cluster. Data mobility is easy to achieve, including the option to push boot images remotely to each server. This adds significant hands-off capability.
So where are the disadvantages of this architecture? Naturally, like any solution with locally deployed storage, fragmentation of resources is an issue. Nebulon SPUs can share storage across the network, but this introduces dependency and resiliency issues. The current use of the GPU slot is disappointing, and PCIe 4.0 won’t change the 75W limit per PCIe slot. This is a generation 1 product though, so I expect there will be some optimisations to get the SPU resources more compact (as we saw with the second generation of Nitro) and potentially into a standard PCIe bay.
I was expecting that host connectivity would be through NVMe. Siamak Nazari, CEO, has indicated that the choice of SAS volumes in the first implementation is to provide wider compatibility. Again, looking at the use of NVMe/PCIe, connecting internal server drives directly to the SPU does imply maintenance challenges (replacing cards), issues with adding redundancy (dual SPU configurations) and a potential bottleneck through the card itself.
Who does Nebulon compete against? We’ve already mentioned Pensando and the proprietary Nitro hardware from AWS, although this isn’t freely available to buy. The closest competitors are perhaps Excelero, with NVMesh. Here, the Excelero software uses add-in cards to provide direct access to remote NVMe SSDs across the network. This design is slightly different from the current Nebulon architecture but remember this is version 1 of the SPU technology. EXTEN and Lightbits Labs also have some similarities.
Then there’s the obvious comparison with HCI, vSAN and similar distributed storage solutions. However, I think the similarities are not that strong, as the O/S dependencies change model.
What could we expect from Nebulon in future product generations? Today, the SPU is based on an 8-core 3GHz ARM processor with 32GB of NVRAM (battery-backed). There are options for optimisation by incorporating Intel Optane or MRAM (as discussed in a recent Storage Unpacked podcast). NVMe support will improve performance and scalability, as it could be possible to eliminate some of the channelling of work through each SPU from remote nodes. NVMe also promised zoned storage and greater individual device scalability.
The Architect’s View
The Nebulon architecture is an intriguing solution and one that many enterprises may see as a great way to build a large-scale compute farm without the dependencies of either SAN or locked-in storage. As we move to more generic and dynamic compute (with containers and open-source hypervisors), pushing peripheral functionality to hardware does make a lot of sense.
Nebulon intends to sell SPUs through hardware partners like HPE and Supermicro. If the solution can be priced competitively, I can see a lot of demand from medium to large enterprises. This is definitely a company to watch, and with the previous track record of the founders, one that’s likely to be highly successful.
Copyright (c) 2007-2020 Brookend Limited. No reproduction without permission in part or whole. Post #ca0d.