The widespread adoption and general acceptance of containerised application deployment have made Kubernetes one of the primary platforms for modern enterprises. However, the current model is flawed due to a lack of integration between hardware and software. In this article, we argue that the future of Kubernetes and containers should be a version of the modern mainframe.
The story of improvement in information technology is one of evolving hardware, followed by software to exploit the new capabilities. This “yin/yang” process has spanned over 70 years since the introduction of commercial computing but was more structured with the introduction of the IBM System/360 mainframe in the 1960s.
- Mainframe – the Original Composable Infrastructure
- Google Cloud introduces new compute-intensive instance powered by AmpereOne
From the 1960s until the late 1990s, IBM mainframes (and their clones) formed the core of mission-critical computing for the largest enterprises on the planet. Those systems still exist today, albeit not with the same ubiquity. The evolution through System/360, System/370, System/390 and onto the “Z” branding saw a transformation from 24-bit to 64-bit addressing, a continuous expansion of the instruction set architecture (ISA) and constant improvement in performance and efficiency.
The IBM mainframe was nothing without software. Two main platforms provided application and infrastructure management capabilities that are aligned with technologies in use today.
MVS, which derived from the earlier MVT, was the most common platform for running applications. IBM rebranded MVS as OS/390 in 1995 and as z/OS at the start of the new millennium. However, the core capabilities still exist, namely:
- Support for multi-tenancy with individual address spaces and hardware-based security model.
- Support for concurrent workloads, with the capability to run OLTP, user sessions, batch work and long-running tasks all on the same platform.
- Hardware-based network and storage offloads.
- Built-in monitoring and observability, integrated access and identity management, workload, and performance load balancing.
MVS was (and is) capable of supporting thousands of parallel activities with a high level of resource efficiency.
The second platform is z/VM, which began as VM/370 in 1972. z/VM implements virtualisation capabilities similar to those from VMware (the VM in both solutions means virtual machine), enabling a single hardware platform to run multiple instantiations of MVS, z/OS or other IBM operating systems. Since the introduction of VM/370, the mainframe architecture has been extended to support hardware-based virtualisation capabilities (in the form of LPARs or logical partitions) and within the instruction set architecture. Much of this work was pioneered by Gene Amdahl, initially while working at IBM, then as the founder of Amdahl Corporation.
Although the mainframe represented a complete ecosystem for computing, the platform remained expensive to buy and operate. The natural monopoly IBM once had was eroded through the development and introduction of low-cost “departmental” servers, followed by the rise of x86-based systems running Windows and Linux.
The founders of VMware spotted an opportunity to develop technology first pioneered by IBM in the 1970s and use it to consolidate the “one application, one server” sprawl of Windows NT and its successors. VMware has since done an incredible job of transforming the computing landscape to one where server virtualisation is the norm rather than the exception. This transformation introduced new capabilities not possible with single servers, such as high availability, fault tolerance and software maintenance and upgrades without an outage.
In the modern data centre, the mainframe represents only a fraction of what was deployed a few decades ago. We would also suggest that many companies would be happy to replace their mainframe systems if equivalent platforms were available.
Drive for Efficiency
Each technology generation introduces benefits and eventually disadvantages. Mainframes standardised computing and built the framework for operational processes we still use today. The introduction of server virtualisation delivered consolidation savings and enabled 24/7 operations through clustering and technologies such as vMotion.
The introduction of containers has taken the efficiency paradigm a step further by eliminating the individual operating instances that run within a virtual server environment. Containers run at a level of efficiency we experienced in the mainframe days, with perhaps only function-as-a-service likely to offer even greater optimisation.
Unfortunately, the container model doesn’t offer native resiliency. If a container crashes, the application is down (and in the early instantiations, the data inside it was also lost). Kubernetes was developed to bring resiliency to container-based applications by building a framework that enables the developer to create in-built redundancy and workload orchestration. A Kubernetes cluster uses application definitions in code to maintain the desired state across non-resilient hardware.
The current design for Kubernetes clusters implements a loosely coupled architecture. Each server acts independently, running an isolated operating system and communicating between servers (or nodes) through the network. At the storage layer, block-based storage devices provide persistent storage resources to individual containers. However, to share storage, data must be presented using a network file system (object storage is also possible but in the early days of integration with Kubernetes).
Resiliency in Kubernetes is created at the application layer. The developer must implement redundancy through the duplication of processes, each of which consumes processing time, memory, and storage. For example, modern databases such as MongoDB or Redis will run multiple active images that coordinate with each other in a primary/secondary configuration. If the primary is lost, the secondary takes over. Theoretically, a third image could then be started to replace the failing primary. However, the entire data set supported by the application may need to be recreated, representing a significant cost incurred for managing the failure (or course, some of this redundancy is used to deliver scaling).
Data recreation isn’t needed only a failure scenario. During planned maintenance, for example, an application mirror will lose consistency while a server is down unless shared storage enables the mirror copy to be restarted elsewhere within a cluster and still access the persistent data image. All of this east-west traffic is an overhead that could be directed at customer-facing demand.
The concept of platforms running under Kubernetes was arguably driven by the idea of twelve-factor applications. There are lots of positive benefits to 12-factor apps, but the idea of running an entire application as a collection of stateless processes causes issues when attempting to provide long-term resiliency in an efficient way.
The ideology of how Kubernetes has been implemented introduces significant inefficiencies. Applications run many images of themselves in memory and on persistent storage media, replicating that data across the network in the event of a failure (or even planned outage). Is there a better way to combine the benefits of hardware and software, creating a “modern mainframe”?
Shared Storage & Memory
In a Kubernetes cluster, the loosely coupled model provides failure isolation between individual servers, including for the benefit of planned maintenance and upgrades. Unfortunately, the loosely coupled design is wasteful of resources, even in scale-out applications.
Shared storage can provide some benefits, delivering much more efficient storage utilisation (deduplication can be implemented across application mirrors), while enabling less impactful application restarts. For example, restarting an application mirror on an alternative node in a cluster should only require the re-synchronisation of any application data created while the mirror restarts. This shared configuration can be achieved today with network-attached storage, although the storage then becomes a single point of failure for the cluster. This in itself isn’t a problem. Shared storage has been supporting large-scale application deployments for decades. Vendors such as Infinidat and Hitachi Vantara offer 100% availability guarantees on multi-petabyte systems with gigabytes of throughput capacity.
What if system memory was shared? In the mainframe world, symmetric multi-processing was standard, with all system memory equally accessible to all processors. That concept has been designed out of the x86 architecture, which today uses a NUMA (non-uniform memory architecture) design.
Within the x86 architecture, system memory is mapped to individual processor sockets. On Intel processors, this design was implemented with the introduction of QuickPath Interconnect (QPI) in 2008 and superseded by Ultra Path Interconnect in 2017. QPI replaced the legacy Front-Side Bus (FSB) design and was aimed at broader processor interchangeability. This paper (link) provides a good overview of the QPI design compared to the implementation of FSB and the implications for memory design.
Memory sharing across servers has never been a viable option until the development of CXL (Compute Express Link). CXL is designed for CPU-to-device memory sharing using the PCI Express interface. We saw some early proof-of-concept shared memory designs using CXL at Flash Memory Summit in 2022, while this Blocks and Files article details several pooled memory designs at FMS 2023.
Recent announcements from the PCI-SIG indicate that PCI Express will see enhancements in external and optical connectivity, making it possible to build memory “fabrics” like those first introduced with storage area networking two decades ago.
Shared memory over CXL won’t be as fast as system memory but could be used for storing application data. MemVerge already offers software to pool shared memory, so why not use this type of technology to share application data between Kubernetes nodes in a loosely coupled cluster?
Of course, we’ll need some operating system modifications to enable cross-cluster despatching. Our cluster design isn’t totally symmetric, so some degree of locality management may be required. In addition, shared storage might need tweaking to take advantage of the shared memory model and reduce the external I/O (by implementing a shared storage cache, for example).
Other O/S modifications could include building in greater use of monitoring and load balancing (eBPF could provide some of this and already does). It should be possible to build asymmetric clusters with lower-powered nodes to run monitoring and management tasks (including data protection).
Intelligent Data Devices 2023 Edition – A Pathfinder Report
This Architecting IT report looks at the developing market of SmartNICs, DPUs and computational storage devices, as data centres disaggregate data management processes, security and networking. Premium download – $295.00 (BRKWP0303-2023)
Advances in SmartNICs provide the offload for networking and storage, so assuming we could add those components to the shared memory model, the server becomes nothing more than a shell for computing code execution. If CXL becomes truly fabric-aware, then we will have composability for almost every aspect of computing and could see a new breed of servers that are focused on processor efficiency, and little more than compute nodes that plug into a PCI Express backplane. Technologies such as Nebulon can even provide the remote O/S boot capabilities.
The Architect’s View®
So, theoretically, we could build a “modern mainframe”, but does it make sense to do it? Current thinking in the industry is to push as much resiliency awareness as possible into software and the application. In itself, Kubernetes doesn’t implement resiliency but provides the capability to restart and manage resilient applications, so some work is still needed to be done by developers.
It doesn’t seem unreasonable to build some resiliency back into the hardware if there is a good reason for this approach. Cost savings could definitely be made, while management overheads would be reduced (cluster sprawl, for example).
I’d love to see a prototype or proof-of-concept developed once we have more of the component pieces in place. Until then, perhaps the dream of a future mainframe will remain an interesting academic exercise.
Copyright (c) 2007-2023 – Post #cde4 – Brookend Ltd, first published on https://www.architecting.it/blog, do not reproduce without permission.