Google Cloud introduces new compute-intensive instance powered by AmpereOne

Google Cloud introduces new compute-intensive instance powered by AmpereOne

Chris EvansCloud, Enterprise, ESG & Sustainability, Google, Processing Practice: CPU & System Architecture, Processors

Google Cloud has announced a new compute-intensive virtual instance powered by Ampere Computing’s AmpereOne Arm processor.  What does it offer, and where does it fit into the computing hierarchy?

Background

At Next 2023, Google announced the preview of a new C3A instance powered by the AmpereOne processor.  Ampere Computing was founded in 2018 by long-time Intel employee Renee James, first developing the Altra family of Arm processors, followed by AmpereOne.

Altra offers 32 to 128 Arm cores running at up to 3.0 GHz, based on the Arm Neoverse N1 architecture, with eight-channel DDR4 memory and 128 lanes of PCIe Gen 4 I/O connectivity.  The processors are divided into two products – Altra and Altra Max, depending on the number of cores.  The processors also offer slightly different levels of system level cache and I/O connectivity.  Oracle Cloud (OCI) introduced the A1 instances, powered by the Ampere Altra in May 2021. 

AmpereOne uprates the specifications of the Altra significantly, with 136 to 192 cores, eight-channel DDR5 memory and 128 lanes of PCIe Gen 5 I/O bandwidth.  This processor drives the new C3A instances on Google Cloud, with 1 to 80 vCPUs, DDR5 memory and 100Gb/s Ethernet networking.  The AmpereOne is a custom design based on the Armv8.6+ instruction set, maintaining compatibility with the previous Altra products.

CISC

The commercial justification for Arm in the data centre focuses on the balance of price versus performance compared to traditional x86 processors.  However, the ability to introduce Arm as an option has only been possible due to the slowing of Moore’s Law and the transition to much greater parallel processing.

If we look back at processor architectures of the past, there have been two primary choices in design – complex instruction set computing (CISC) and reduced instruction set computing (RISC).

Side note: there’s also VLIW too, but we’ll set that aside for the moment.

In both RISC and CISC, the architecture defines an instruction set of low-level machine code instructions that perform data processing tasks.  CISC instructions require multiple clock cycles to execute what are, by definition, complex instructions.  Over time, CISC designs have steadily added to the range of available instructions, with some notable enhancements to the x86 architecture, including Intel-VT and AMD-V, which provided hardware-based virtualisation (and drove the adoption of VMware technology), and the AVX, AVX2 and AVX-512 extensions for vector processing (used in AI).

Adding custom instructions to an instruction set architecture (ISA) improves straight-line performance while enabling more complex processing (such as the tasks required to develop AI models).  However, they come at the cost of increased die complexity, additional transistor count and power draw (TDP, or thermal design power). 

RISC

The reduced instruction set architecture, or RISC, was initially developed by John Cocke of IBM, with early work in the late 1960s and through the 1970s.  Cocke’s team developed the ACS (Advanced Computer Systems) supercomputer.  However, IBM cancelled the project in 1969, as the instruction set (using a RISC-like approach) wasn’t compatible with the mainframe System/360 architecture, where IBM generated most of its income.  IBM also developed a simplified processor (the 801) for telephone call switching, where the speed of individual instructions was critical and complex instructions weren’t needed (another RISC example).

The detailed proof that shows RISC systems can outperform CISC is outside the scope of this blog post, but we suggest reading the paper “The Case for the Reduced Instruction Set Computer” and the book Computer Wars by Ferguson and Morris.  Both highlight how a reduced instruction set architecture can be used to implement more complex instructions while eliminating processing bottlenecks and waiting times.

Moore’s Law

As Moore’s Law has started to fail, due in part to the power and complexity issues with placing ever more transistors on a single CPU die, we’ve seen Intel and AMD implement multi-core processors, each effectively an individual CPU in its own right.  Intel 4th generation Xeon processors scale to 60 cores at 1.90 GHz with a TDP of 350W, with many power/cores/frequency combinations available (Intel’s website lists 55).   Compare this capability to the specifications of the AmpereOne we discussed earlier, with up to 192 cores at up to 3.0 GHz in the same power bracket. 

Offload

We can’t simply compare the clock speed of CISC and RISC processors as being similar because an application complied for a RISC system will require more instructions than a CISC system for the same tasks.  However, modern computing has evolved in two ways that make RISC more practical.

  • High parallelisation – modern workloads, including AI processing and those based on micro-services architectures, are highly multi-threaded and well suited to massive multi-processing. 
  • Bespoke Accelerators and Offloads – GPUs, FPGAs and SmartNICs all enable the offload of complex tasks to custom hardware, where a bespoke component can perform work like encryption or vector processing more effectively than a generalised CPU. See a link to our “intelligent data devices” paper, embedded here, for more details.

So, Arm may be more practical when workloads are highly threaded and can make use of dozens of cores and/or where computationally expensive tasks can be offloaded to add-in-cards (AICs) like GPUs.  Google, Oracle (and, of course, AWS already) are hoping that the price/performance calculation will make Arm a more favourable architecture for certain types of workloads compared to x86.  This could include databases, infrastructure applications or AI inferencing.

Intelligent Data Devices 2023 Edition – A Pathfinder Report

This Architecting IT report looks at the developing market of SmartNICs, DPUs and computational storage devices, as data centres disaggregate data management processes, security and networking. Premium download – $295.00 (BRKWP0303-2023)

The Architect’s View®

The introduction of the C3A is another example of how the IT market is moving away from the dependency on the Intel x86 architecture.  Intel did very well with x86 over the 2000s and 2010s, partly due to the addition of hardware-specific extensions that further enabled server virtualisation, plus the rise of Linux as a free, stable, and open operating system.  However, we’ve seen this kind of processor cycle occur before when the ubiquitous mainframe was displaced by SPARC, POWER (IBM disrupting itself) and HP’s PA-RISC designs. 

The slowing of Moore’s Law has demonstrated a need for alternative processing architectures, which has led NVIDIA to become a $1.1 trillion company, greater than Intel and AMD combined ($161 billion and $170 billion, respectively).  Arm will quickly catch up, with a rumoured $52 billion valuation after IPO.

What does all this say about the future of processor architectures?  In the public cloud, competition is intense between providers.  Customers want to manage costs, so one option is to offer more price/performance optimised instance types.  The major cloud vendors can afford this approach due to economies of scale, so we’re likely to see this trend continue.

What about private data centres?  In the 2000s, the leading on-premises infrastructure vendors sold systems based on the Itanium architecture, which hoped to displace x86.  Itanium was eventually discontinued in 2020 (people may remember Windows NT support for Itanium processors) and x86 remained the dominant design. 

HPE has a current series of Proliant servers based on Ampere Altra processors (the RL series), although you’d be hard-pressed to find them on the website without knowing they exist.   Dell had some Arm-based servers ten years ago but only offers Intel and AMD-based servers today.

Bamboo Systems, one vendor offering Arm-based systems for on-premises, is no more.  Only SoftIron appears to offer Arm-based nodes for the storage part of its HCI infrastructure.

There’s a risk that cloud customers become very comfortable with the Arm architectures in the public cloud, including the improved performance value they represent.  As a result, moving back on-premises becomes more expensive than expected. 

Arm in the public cloud may be about more than just price/performance and represent an opportunity for long-term lock-in. 


Copyright (c) 2007-2023 – Post #f32c – Brookend Ltd, first published on https://www.architecting.it/blog, do not reproduce without permission.