The Great Cloud Repatriation Debate – Compute

The Great Cloud Repatriation Debate – Compute

Chris EvansCloud, Cloud Practice, Enterprise, ESG & Sustainability, Opinion, Processing Practice: CPU & System Architecture

In the past week, we’ve seen two announcements from cloud companies that talk about extending the life of existing hardware.  “Sweating the asset” is a well-known strategy to reduce costs, so in a world of “as-a-service” public cloud infrastructure, is there anything a typical business can do to reduce the bottom line bill?

Background

Meta (the owner of Facebook, WhatsApp and Instagram) has announced plans to reduce costs by extending the lifetime of servers and networking equipment from four to five years.  Google is following a similar path, extending hardware lifetime from four to six years, with an anticipated saving of $3.4 billion for FY2023.  Microsoft has also extended to six years (as reported here), while AWS announced intentions to extend hardware lifetime back in February 2022. 

We should highlight that much of this infrastructure will be for back-end services rather than being directly used by cloud computing customers.  However, the logic of this article applies in both cases.

Sweating the Asset

The concept of running hardware for longer, or “sweating the asset”, is a well-known strategy in IT.  It also applies outside of IT, where businesses choose to retain assets for longer, pushing back the replacement cycle and amortising costs over an extended period.  In our personal lives, we do the same thing, opting to keep vehicles or even mobile phones for a little longer as a way of saving money.

In the data centre, the logic is relatively simple.  If a business buys a server or storage array (or networking equipment, for that matter), the capital cost gets amortised over the lifetime of the hardware.  Extending the lifetime reduces the annual amortisation charge while reducing the actual capital costs. 

However, hardware vendors have always been keen to “encourage” regular forklift upgrades of hardware because sales teams are incentivised on boxes shipped.  Typical 3-year leases/purchases generally include maintenance but significantly increase this cost from year four onwards to “encourage” a refresh.  In many cases, the business case is justified because the year 4/5 maintenance cost is artificially inflated, making the refresh look more attractive.  When hardware solutions were bespoke, the additional cost to the vendor could (perhaps) be justified because the solution provider (vendor or reseller) would be required to hold inventory to replace failed parts.  Today, when server and storage solutions are built from off-the-shelf components, this stance is no longer fully justified. 

Refresh

So, why did we refresh hardware on a three or four-year cycle?  Reasons include:

  • More bang for the buck.  When Intel and other processor architecture vendors refreshed every three or four years, systems built from those components would see a similar refresh cycle.  The customer gets a new platform with faster processors, more cores, faster memory, and perhaps faster interfaces like PCI Express. 
  • New features.  New hardware means new features, like NVM Express, CXL or trusted computing.  Intel used the hardware refresh cycle to introduce Optane DIMMs and persistent memory support, for example.  The current range of servers has introduced new media form factors to improve efficiency and power/cooling issues.
  • Increased failure risk.  We know that many hardware components fail based on a statistical model called the “bathtub curve”.  The question to answer here is when does the increasing failure rate occur?  We could be refreshing hardware that still has a constant failure rate for many years, not adding any additional risk. 
  • Efficiency.  Storage is one area where the efficiency model was pushed very hard.  Storage hardware vendors justified upgrades based on space, power, and cooling savings from the rapid increase in HDD density.  With SSDs, this justification no longer applies, to the extent that some vendors offered lifetime warranties for hardware in support. 
  • Availability of Parts.  Eventually, we must accept that components will be refreshed with new solutions.  HDDs, SSDs, HBAs, and network cards all get superseded by new models.  When alternative components need to be used, we start introducing risk and changes to support, the deployment of new drivers, and other issues.

We shouldn’t forget that for hardware vendors, product sales are all-important.  Each financial quarter introduces new targets to meet.  No sale means no quota met and no commission.  This is one of the reasons vendors (and sales folks) are keen to move to recurring revenue models.  The service is disconnected from the hardware used to provide it.  This transition means the hardware vendor can extend the life of on-premises hardware without the three-year refresh cycle – assuming the solution still meets SLAs. 

Saving the Planet

Before we move on and discuss the public cloud, we should address the environmental impact of endless refresh cycles.  Computer hardware takes resources to manufacture, resources to recycle and isn’t a closed loop.  We still need to mine for new rare earth metals, for example, because IT use is expanding, and recycling isn’t 100% efficient.

Rather than blindly refreshing technology on a fixed timeline, we should be looking at a range of factors, including the environmental cost.  Two obvious metrics are the cost of manufacture and the cost of operation.  When the cost to operate becomes greater than the cost of replacement (over a fixed period), then an upgrade is potentially justified.  However, when we say “cost”, we’re not referring exclusively to dollars, pounds or euros.  There’s a cost for using raw materials that is increasingly being measured by the environmental and human impact. 

We’ve kept this quick discussion deliberately vague, as this isn’t the time to discuss this argument in detail.  You can find more thoughts in our recent post on sustainability

The Cloud

What happens when we run our infrastructure in the public cloud?  One of the most acknowledged benefits of cloud computing is to consume resources as they’re needed and give them back when finished.  Charging is usage-based, typically on capacity and time. 

The “as-a-service” model is great if you have variability in demand.  The option to scale up and down, rather than retain hardware to meet a high watermark usage, can result in significant savings.  Conversely, if your IT workload is constant, then the on-demand offerings of the public cloud aren’t as price competitive.  If we look at a typical business and its IT organisation, most server, networking, and storage is deployed and used daily.  It’s only the burst in demand that must be managed (organic growth should be predictable). 

If we’re paying monthly for services, there’s no equivalent to sweating the asset.  The cloud vendor owns the asset and gains the benefit from leaving the hardware on the floor for longer.  The cloud service provider (CSP) could choose to reduce the hourly charge for older equipment, but there’s no requirement to do that.  In reality, the customer has no idea whether the underlying infrastructure was deployed yesterday or two years previously.  That’s the nature of the public cloud.

Repatriation

On reading this article, the immediate response may be to suggest the repatriation of workloads back on-premises, where hardware assets can be kept for longer.  However, it’s too late to move back to a private data centre if the infrastructure doesn’t already exist (unless there’s spare capital sitting around unspent).  In any case, the CSPs are ahead of the game and have been offering long-term commitment services for quite some time. 

Almost all the long-term commitment options available are based on one or three-year terms.  The customer commits to a contract over that period, benefiting from a reduction in the equivalent monthly on-demand charge by as much as 80% (typically between 30-70%). 

  • AWS offers reserved instances (RIs) on EC2, with discounts of up to 72% compared to on-demand pricing.  There are three choices – standard, convertible and scheduled RIs depending on the usage profile.  Rather than tag a specific instance as being reserved, AWS matches running instances against RI profiles and applies the discount to those that match; the remaining workload is billed at on-demand rates.  This enables rebuilds and repurposing of EC2 instances.  AWS also offers Savings Plans, which reduce the cost of EC2, and other compute services based on long-term commitments and upfront payments.
  • Microsoft Azure offers savings of up to 80% when using Azure Reserved Virtual machine Instances, compared to pay-as-you-go pricing.  Customers have the choice to change instance configuration and can terminate early for a fee.  Microsoft also offers On-Demand Capacity Reservation where customers want to guarantee service availability but don’t have a workload ready to run (for example, a DR scenario). 
  • Google offers two scenarios for Committed Use Discounts (CUDs) in Google Cloud Platform.  Resource-based CUDs provide savings based on resource usage, while flexible CUDs are targeted at customers with predictable spending.  Discounts apply to both hardware and software offerings (for example, O/S licences).  Google also offers Sustained Use discounts for virtual instances that run more than 25% of the time over a measured monthly period. 

If you’re not already using a discount scheme, it’s worth looking into the options available.  All the cloud vendors have tools to help optimise costs and make recommendations on commitment options.  However, one word of caution; the committed discount model is like returning to the days of forklift upgrades.  At the end of the discount period, another decision point is reached – extend again or re-platform (or pay on-demand charges). 

The benefits of committed use discounts only work well with a high degree of standardisation (using virtual instances of the same configuration) with restrictions in place on how dynamically the instances may be used in multiple regions.  This constraint removes some of the flexible nature to choose from a range of instance types.  However, there are options to work around these problems (which are out of the scope of this article), so check the small print for details. 

We suggest investigating the use of discounts and reviewing their usage regularly. 

Repatriate or Not?

With the choice of discounting available in the public cloud, does it make sense to consider repatriation?  Unfortunately, the choices and decision points are unique to each business, so picking a single rule of thumb is impossible.  Having said that, here are some points to consider. 

  • Every IT team must have a TCO framework that shows the unit cost to run a single instance in the public cloud or on-premises.  Without this, it’s impossible to make any planning decisions.  Pick a time period and see how the costs change when extending past a traditional three years.  Look at the options of purchase/lease, lease contract or “as-a-service”. 
  • Use on-premises for non-critical workloads, including test/development and services with less stringent SLAs.  These systems can be deployed onto older hardware with cheaper maintenance contracts (e.g. only 9-5 Mon-Fri rather than 24/7).
  • Extend the life of storage systems where the power/cooling saving is negligible.  There’s a considerable cost incurred when performing a refresh, so the benefit must be worth the effort.  Most storage networking (typically Fibre Channel) is undersubscribed when designed, so it should be retained for as long as is practicable (standards and features don’t change much either). 
  • Standardize.  Keep cloud instances consistently similar and align across clouds and on-premises.  We don’t expect that IT organisations can make configurations identical, but standardisation on processor family/generation, system memory, networking performance, and storage type/capacity can all be aligned.  This process makes it easier to compare on-premises costs with the public cloud.
  • Rightsize.  Use tools to determine whether resources can be optimised better, using smaller instance sizes and reducing storage capacity.  Rightsizing is generally easier to do on-premises, where resource over-subscription can be managed more easily. 
  • Be wary of long-term on-premises commitment subscriptions.  At this time, we don’t see much value in committing to long-term compute-as-a-service subscriptions unless the offering is packaged with software to build a software-defined data centre.  Check to see what happens at the end of the 1/3-year commitment in terms of pricing, as these solutions look to have the same forklift upgrade challenges as converged infrastructure and traditional storage. 

In modern IT operations, virtualisation is intrinsically linked with server hardware.  If older hardware is to be retained, software licensing will also need to be considered.  The same three-year rule generally applies here too, although vendors like VMware and Nutanix have introduced subscription-based pricing.  An alternative scenario is to consider multiple hypervisor vendors.  This is an option we’ll discuss in another post.

The Architect’s View®

Cloud companies are clearly looking to make savings by extending the lifetime of hardware, which on-premises IT organisations can also achieve.  We also believe there are several other aspects in play.  Firstly, Intel was late with Sapphire Rapids, so systems will only just start rolling out into the public cloud.  Rather than replace like-for-like, which is a spend for no benefit, CSPs have chosen to extend the lifetime of existing infrastructure.

Second, there’s the sustainability aspect with frequent technology refreshes.  CSPs want to appear in tune with a general desire to reduce the environmental impact of cloud computing.  Third, there’s the general slow-down in demand.  By extending the hardware lifetime for resources used in the public cloud, CSPs can save money (in many cases, quite substantially, as we’ve already mentioned).

We don’t expect mass repatriation in the face of increased costs and reduced demand for end users.  Instead, we expect there will be a “rebalancing” as businesses learn to understand where best to place workloads and applications.  This rebalancing process is likely to increase as IT organisations become more acquainted with cloud choices and build more mature hybrid architectures. 


Copyright (c) 2007-2023 – Post #da6b – Brookend Ltd, first published on https://www.architecting.it/blog, do not reproduce without permission.