Kubernetes Clusters – Pets or Cattle?

As Kubernetes and container-based workloads mature, typical “day 2” operations result in many management containers being added to a single Kubernetes cluster. This can include storage, backup, monitoring, workload balancing and more. With so many extra services installed before useful applications, are we moving to a configuration where Kubernetes clusters become pets rather than cattle?

Background

The Pets vs Cattle scenario is summed up nicely in this post from Randy Bias and is based on the “Cattle, not Pets” analogy used by Bill Baker in a SQL Server scaling presentation ten years ago. Essentially the term compares the traditional model of application deployment where a single server is maintained and managed closely through the lifetime of the application it supports. In scale-out environments, individual nodes can be thought of as dispensable and simple replaced on failure.

If we look at another well-worn IT quote – “Hardware eventually fails, software eventually works” we can see why software-based resilience provides an increased level of availability over a single hardware system. In lots of places in IT, we add redundancy to supplement the components most likely to fail, including power supplies and persistent storage. The alternative is to deploy hardware (or virtual hardware) with an assumption of failure, then use application resiliency to mitigate the problem.

Kubernetes

Containerised environments provide resiliency using many worker nodes and multiple master nodes. A failure of any individual node (virtual or physical) shouldn’t impact running applications, especially if those applications themselves are also redundantly deployed across the infrastructure. Kubernetes provides the tools and features to monitor containerised applications, ensuring the desired state meets the deployed state.

Loaded

In a basic environment, applications can use all the available resources for “useful” work. Increasingly though, we’re experiencing a maturity in the use of Kubernetes that means “infrastructure applications” get added to the container ecosystem, carving out some of those resources to run value-add services. We’ve previously looked at container-native storage and are in the process of evaluating container-native backup, where, in both instances, the container infrastructure supports the service. Many other solutions add in monitoring (Prometheus), visualisation (Grafana), configuration management, networking, and security.

Vanilla

I don’t have a problem with the need to deploy additional services into a Kubernetes cluster. After all, the design of Kubernetes is specifically constructed to make the platform extensible. The challenge is whether these additional components push developers to build Kubernetes clusters that are now treated in the same way as “legacy” virtual instances and physical servers.

As a case in point, take this example from Cloud Field Day 13. The team from Kasten demonstrate the ability to use Hashicorp Vault for key management of the K10 platform and to use Grafana for visualisation.

The processes shown by Onkar Bhat in this demonstration aren’t hard to execute, but when taken in conjunction with the previous steps to build and configure a cluster, add persistent storage capabilities, and add K10, there’s a lot going on.

Configuration & Process

There are three main pieces to a cluster build – the definitions that specify applications and how they should be deployed, plus the scripting process (and tools) to take the definitions and apply them to a cluster. The configurations, plus the data backing them, needs to be stored somewhere. This represents the third aspect of a build; data to import into a cluster or data created within a cluster that needs to be protected against complete cluster failure or in the event of a need to rebuild.

In the video posted above, around the 40-minute mark, we discuss the level of manual configuration used to set up the Grafana dashboard. In this instance, the data wasn’t exportable elsewhere.

Data Mobility

A Kubernetes cluster build is a well-understood process that can easily be automated. In fact, I create and destroy clusters quite frequently as part of infrastructure testing. Protecting data (and configuration state) between builds is a little more complex, although platforms such as K10 are designed to solve this problem. But the end-to-end process of protecting then rebuilding an entire cluster is fraught with many steps that the owner is probably familiar with, but for other users, could be a challenge to recreate.

The Architect’s View™

With many processes required to build a working Kubernetes cluster, should we be treating clusters like pets and ensuring they live as long-term infrastructure entities? Alternatively, should we be rigorous and ensure the entire deployment can be recreated through code? In my view, we must first question what benefits containerisation provides. Taking a stance that Kubernetes clusters should be 100% code-driven is like following the “stateless apps” mantra. Containerised apps were never going to be entirely stateless. The final end state isn’t black or white but some shade of grey.

I think there’s a middle ground that takes a view on the longevity of Kubernetes clusters while ensuring that rebuilds can be completed relatively painlessly. To do that, we need to be sure that any data created by a cluster can be easily stored and recreated. If data protection software is deployed within a cluster, then that software needs to be easily re-installed and mapped to some offline storage (like an S3 bucket) that makes the re-instantiation process easy.

I think we’re increasingly likely to treat Kubernetes clusters like pets because increasing complexity makes tear down and rebuild that much risker (and longer). However, I think we should make cluster management as automated as possible because, after all, that’s the benefit of infrastructure as code.

Disclaimer: Cloud Field Day was an invitation-only event. Tech Field Day contributed to flights and accommodation costs. There is no requirement to blog on or discuss vendor presentations. Content is not reviewed by presenters before publication.