Liqid’s PCIe Fabric is the Key to Composable Infrastructure

Liqid’s PCIe Fabric is the Key to Composable Infrastructure

Chris EvansComposable Infrastructure, Enterprise

Liqid Inc, a relatively new start-up has developed a software-composable infrastructure (SCI) that looks different to the solutions from Dell and HPE.  At the core of their disaggregated architecture is a PCIe switch that provides the fabric for connecting servers and devices together.  This switching technology could be the piece of the puzzle that allows SCI to become a practical reality over the course of the next decade.

SCI

For some additional background, I recommend reading my previous post on SCI.  In summary, Software Composable Infrastructure enables the physical components of a server to be combined programmatically using either a GUI or through APIs.  Once the infrastructure is in place, changing the configuration is done in software, without the need for manual re-cabling or re-racking.  This makes it possible to build up and tear down infinite combinations of infrastructure designs within minutes.

Fabric

The key component of an SCI solution is the fabric interconnect.  If we imagine a server with CPU, memory, persistent storage and FPGA or GPUs installed, the peripheral devices will be accessed through the PCIe bus.  The Liqid SCI solution disaggregates those PCIe devices, removing them from individual servers and placing them in shared enclosures.  The SCI software (in this case Liqid Command Center) is then able to map devices to servers across a PCIe fabric.  This includes hot-plugging devices to running servers.

The PCIe fabric is implemented as a switch (or fabric of switches), with either optical or copper connections to the peripheral enclosures and the servers using the devices.  At a high level, this is similar to how shared storage was consolidated in the early 2000s using Fibre Channel.

Consolidation

The storage analogy is probably a good one to use.  Storage Area Networks (SANs) provided a several distinct benefits:

  • Resource pooling – storage was consolidated in a single array, reducing waste and over-provisioning.
  • Reduced maintenance – with storage in a single place, maintenance was simplified to changing drives in a shared chassis.  This provided the ability to add spare drives and build in features like automated and predictive sparing.
  • Improved performance – data could be spread across multiple spindles, improving performance.
  • Improved resiliency – features such as RAID could be implemented more efficiently, improving the overall resiliency of solutions.
  • Increased availability – with storage outside the server, a single server failure wouldn’t strand data in a single place.

These benefits seem well understood today.  As we look to develop more complex hardware-based applications using GPUs and FPGAs, these challenges all apply.  GPUs are expensive and require powerful servers to drive them; FPGAs can be reprogrammed dynamically and so easily reused for new tasks on an hourly or daily basis.

If these devices sat within a single server, we would see resource wastage and have the old issues of maintenance and resiliency to deal with.  The ability to rebuild infrastructure dynamically enables resources to be targeted just when they are needed or for configurations to be built for short-term or regular workload patterns.

Performance

The challenge with implementing shared resources is ensuring the fabric interconnect operates fast enough to allow the components to be disaggregated.  Fibre Channel in the early 2000s was way faster than individual drive I/O and a negligible overhead to the application (things are different now, which is driving new solutions).

Similarly, any SCI fabric needs to offer low enough latency to justify the disaggregation.  Liqid’s fabric switch adds an overhead of around 150 nanoseconds, small enough to be non-impactful to the application.

Switch vs Chassis

Why a switch over the chassis-based solutions from HPE and Dell?  A switched fabric implementation removes the restrictions of using a chassis, which is inevitably proprietary in nature.  Customers can then choose their preferred server vendor and models.  Servers can be more easily swapped out and replaced over time. Chassis-based solutions will always have challenges of scale and restrict the types of servers that can be deployed.  A fabric can be agnostic to any vendor server solution.

As a side note, this recent report (registration required) from Supermicro highlights the potential benefits of disaggregation to delivering a green data centre, compared to embedded components that are discarded when the server reaches the end of useful life.

The Architect’s View®

Over the past 10 years, the public cloud has transformed our view on how IT should be delivered.  The idea of service catalogues has been around for over 20 years; however, the public cloud took that notion a step further and gave us true on-demand, scalable infrastructure services.  On-premises infrastructure needs new ways to stay relevant in this new world order.  Composable Infrastructure could be one solution to compete with public cloud dominance.

For SCI itself to be relevant, the benefits need to far outweigh the disadvantages.  Businesses are not going to accept poor performance as a substitute for additional flexibility.  Liqid seems to have the right approach in making the fabric agnostic to the server and peripherals being installed.  Using PCIe retains that level of performance and adds more flexibility than could be achieved with chassis-based solutions.

Note: Liqid also now supports Ethernet and Infiniband fabrics. For more background, I recommend watching Liqid presentations from Tech Field Day, the first video of which is embedded here.


Post #3408. Copyright (c) 2019 Brookend Ltd. No reproduction in whole or part without permission.