Optimising Kubernetes with StormForge

One challenging aspect of running multiple workloads on the same physical infrastructure is managing the resource utilisation and contention across competing pieces of code. At CFD13, StormForge presented two solutions to optimise for a range of metrics, including cost and performance. In this post, we dig into the requirements for the technology and how it might evolve.

Background

Multitasking is a key tenet of all general-purpose operating systems. The performance of modern processors, memory, networking, and storage has enabled competing workloads to sit on a single set of infrastructure. This capability has been used for nearly two decades with x86 server virtualisation through VMware ESXi and open-source solutions like KVM and Xen. Before that, we had decades of workload virtualisation on IBM mainframes.

Despite significant savings from server virtualisation, the implementation mimics physical servers, with multiple operating system installations to support and maintain. Each of these consumes resources to manage what are generally single-application workloads. Containerisation, which was popularised first through Docker, then made mainstream through Kubernetes, provides a platform to run hundreds of containerised applications across a resilient infrastructure. This implementation reduces the overhead of running multiple operating systems.

Load Balancing

As virtual server environments developed and matured, the need for load balancing and workload optimisation became more critical. VMware, for example, provides the capability to overcommit on CPU and memory resources, while storage can be thin provisioned. In tandem, the hypervisor implements techniques such as memory ballooning, page sharing, compression and swapping to optimise memory usage. The configured size of a virtual machine dictates the maximum usable memory, providing a natural limit for any applications running in the VM.

In a Kubernetes environment, CPU share and memory usage is defined either at the namespace or pod level (if defined for a namespace, the pod definitions are mandatory). Both pods and namespaces use two metrics for CPU and memory usage. A resource request specifies the minimum CPU usage or memory a pod requires when scheduled. A resource limit determines the maximum amount of CPU or memory. If available, pods can use more than the request value, up to a maximum of the limit.

CPU and memory are treated slightly differently. Memory values are absolute and measured in bytes. CPU resources are also absolute, representing the number of physical or virtual CPU cores needed to run the pod. However, during times of contention, the CPU resource specifications of multiple pods are used to weight CPU usage. Pods exceeding memory limit will get killed with OOM (out of memory). Both CPU and memory resources limits are implemented using Linux cgroups.

Note: we’ve only included a brief introduction to Kubernetes resource management here. Further information is available online at https://kubernetes.io/docs/home/.

Optimising Kubernetes

In production Kubernetes clusters, running applications unconstrained is not desirable. With poor coding and software bugs, applications can suffer memory leaks that would eventually consume all the resources on a node (remember, nodes run with no swap space). An application could be a “noisy neighbour” and starve others of CPU resources. Setting resource limits on pods and/or namespaces helps mitigate these problems.

However, memory settings that are too aggressive risk OOM errors, while aggressive CPU restrictions could directly affect throughput and performance.

Choices

As highlighted in the first of three presentations from StormForge at CFD13, Kubernetes users typically choose one of three options; over-provision resources (or deliberately underutilise), risk performance and reliability issues with aggressive settings, or spend time and effort manually managing configurations.

In large and complex environments, manual management isn’t an option. We’ve seen this readily demonstrated in the storage world, where vendors like EMC introduced solutions to automate tiering (FAST – fully automated storage tiering). When a Kubernetes environment runs dozens, perhaps hundreds of individual pods, at each instant of observation, the variables needed to set the correct configuration will be enormous and well beyond human intervention.

No developer wants to push an unreliable application into production, so the easy answer is overprovisioning. Very quickly, costs can escalate and offset some of the efficiency benefits of containerisation.

StormForge

StormForge has developed two solutions to address the optimisation challenge. These are based on experimentation and observation, respectively. Optimize Pro offers the capability to load test applications before deployment into a production environment, finding the optimum resource configuration through load testing. Optimize Live looks at production environments, recommending resource patches and optionally automating the patching process (or using manual approval).

The Optimize Pro solution looks at CPU, memory, and a range of internal application-specific parameters to find the optimum setting for resources based on goal setting. Developers choose from typical metrics such as cost versus performance, then run a series of load-test simulations, which at each iteration uses machine learning to vary configuration parameters towards reaching the desired goal. The result is a series of graphs (which do require some interpretation) that show the potential correlation in parameter settings. Embedded below is the second video from the CFD13 presentations, which explains the whole process in more detail.

Optimize Live, which was announced on 23 February 2022, wasn’t presented at CFD13 (as this was before the release date). However, a video demonstrating Optimize Live has been posted at the time of writing and is embedded here.

The Architect’s View™

As usual within IT, what’s old is new. The challenge of running multi-processing systems at scale requires an intelligent approach to automate the efficient use of resources. The problem today is no different to those from four decades ago, except that arguably, the level of complexity is massively increased.

I liked the StormForge approach as it allows developers to ensure applications are tested and optimised before being put into production. Goal setting also enables applications to be re-tested and tuned after major code updates. Once in production, Optimize Live ensures applications stay optimised, based on the inevitable unpredictability of production versus test scenarios.

Where will this technology go next? If most Kubernetes deployments are expected to run in the public cloud, then StormForge seems a natural acquisition for one of the hyper-scalers or even perhaps an outlier like NetApp, looking to build their compute management business. This is one start-up to keep watching.

Disclaimer: Cloud Field Day was an invitation-only event. Tech Field Day contributed to flights and accommodation costs. There is no requirement to blog on or discuss vendor presentations. Content is not reviewed by presenters before publication.