At VMworld 2021, VMware announced Project Capitola, an initiative to bring tiered memory to the ESXi hypervisor. We look at the requirements for this technology and how it compares to similar approaches that already exist in the market.
Ask anyone who’s spent time administering a VMware vSphere environment, and they will tell you that invariably, system memory is the most common bottleneck compared to processor performance or storage. In a three-way ratio between processor capacity, memory and storage, the evolution of system design and the demands of applications have resulted in system memory generally being insufficient.
VMware has done an excellent job in abstracting these three resources to the extent that storage is made more efficient with thin provisioning, memory is optimised with ballooning and CPU resources are only used when VMs are active. However, modern applications and especially those running ML/AI workloads will run significantly better with large amounts of memory. The logic is pretty simple; applications will execute faster where external I/O to disk is minimised. That process is achieved by putting more data in memory in the first instance.
Modern Intel x86 platforms are NUMA (non-uniform memory access) platforms, so the amount of system memory available is directly related to the number of CPU sockets. The latest Intel Ice Lake processors support 6TB each, with a limit on the practical deployment of system memory based on the capacity per DIMM and ensuring an even distribution to make use of all available memory channels.
The NUMA design means a processor can only directly address its system memory. Any requests outside of the memory managed via the processor go over an internal interconnect called UPI (Ultra Path Connect). UPI is the latest evolution of technology that replaced the Front-Side Bus (FSB), a design in which all system memory and external I/O was managed through a shared interface. FSB was more analogous to a symmetric multi-processing design. The transition to UPI provides for much faster bandwidth from memory to processor, with some performance impact when memory isn’t local to the CPU.
Modern operating systems like Linux are NUMA-aware. They divide the available memory and processors into nodes to ensure that memory requests (where possible) align with the processor on which instructions are being executed.
In an ideal world, all memory I/O would be symmetric and infinitely scalable. However, the practicalities of electrical design, performance requirements and cost all dictate the scalability of on-board DRAM and DIMM slots available in motherboard design.
These limitations are widely accepted in computing and mitigated through techniques like tiering and caching. Storage tiering has existed forever, and is still in use today, despite what some CEOs thought would happen. Tiering started in mixed HDD-based arrays, then introduced flash to the mix. Many modern systems now mix NAND flash and Intel Optane in ways that use a mix of caching and tiering techniques.
With the announcement of Project Capitola, VMware intends to do for system memory what has already been achieved with persistent storage, namely, to provide seamless access to a hierarchy of memory technologies. The aim is to extend the virtual memory “address space” and offer capabilities to meet the needs of memory-intensive modern workloads.
As with all newly announced VMware Projects, the details are thin on the ground. Session MCL1453 at VMworld 2021 discusses the concepts behind Capitola and the benefits the technology looks to bring. A VMware blog post also provides a little more information.
From the information presented so far, it looks like the technology being developed with Capitola will use a mix of DDR DRAM, Intel Optane Persistent Memory, CXL connected devices, remotely connected CXL memory and NVMe devices. We expect that NVMe devices will be Intel Optane or equivalently fast and resilient technologies, but NAND could be in the mix too (as could other solutions like ReRAM and MRAM).
Although the implementation details of Capitola haven’t been released, there appear to be two main operating modes. The hypervisor could choose to directly access DRAM or other memory-like storage on the DDR bus or could page between “primary” DRAM and “secondary” DRAM. These terms aren’t scientific but give an indication of what could be achieved. The ESXi hypervisor already performs memory management, so the new secret sauce will be the capability to bring all of these memory types together. CXL, for example, will provide persistent memory over PCIe with PCIe 5.0. This should remove the need to depend on connecting Intel Optane as persistent memory DIMMs.
Another aspect of CXL will be the ability to access remote memory in another cluster node, outside the server running a VM. VMware is calling this “phase 2” of Project Capitola. Setting aside the performance implications for a moment, a shared memory pool could help balance resources while making it possible to move virtual machines incredibly quickly between physical servers. Theoretically, at least, it could even be possible to create a VM that spans physical machines, although I can’t see the performance being that great.
Memory extension technologies already exist today. MemVerge is pioneering the use of Intel Optane as an extension to DRAM through a Linux feature called LD_PRELOAD. This mechanism allows MemVerge to “highjack” the Linux memory management features and implement features such as in-memory snapshots. We recorded a podcast last year with MemVerge CEO Charles Fan, which you can find here and embedded into this post.
We also recorded a podcast last year with Daniel Waddington from IBM that covered IBM Research’s Memory Centric Active Storage project, also worth a listen. From the hardware perspective, this podcast looking at MRAM from Everspin, gives an idea of how new technologies are gradually making it to the data centre.
The Architect’s View™
The storage and memory hierarchy continues to expand, with gains to be made by changing storage read/write semantics into memory-based load/store. The initial and most obvious benefits of Project Capitola could be extended with further integration into the SmartNIC market (see this post on Project Monterey). The CXL standards, for example, provide the capability to share memory between the CPU and a SmartNIC offloading complex data processing. Capitola could make it possible to natively share devices across VMs like GPUs can be today, albeit without additional software (VIBs).
One final thought. Technologies like Liqid and Fungible provide hardware-level disaggregation and re-assembly of “virtual physical servers” in designs otherwise not offered by vendors or possible to achieve. Initially, it may seem that these technologies appear to compete against VMware and the goals of Project Capitola. We see these technologies as highly complementary to each other, providing the ultimate degree of flexibility in the data centre.
Exactly how this project pans out is going to be one area of great interest in 2022.
Copyright (c) 2007-2021 – Post #27fc – Brookend Ltd, first published on https://www.architecting.it/blog, do not reproduce without permission.