ARM —from RISC and Embedded systems to Supercomputers
ARM is a 30-year old UK-based semiconductor IP (Intellectual Property) company that started as Advanced RISC Machines Ltd., a joint venture that included former UK company Acorn Computers, Apple Computers (now Apple Inc.) \cite{venture}, and VLSI Technology (a US company later bought by Phillips for USD 1 billion and now part of the Philips spin-off NXP Semiconductors). ARM was bought in 2016 by the Japanese conglomerate and investing company Softbank Group, which also has a main stake in US Sprint and also has significant T-Mobile stock via the merger of Sprint and T-Mobile that was completed in April 2020. In September 2020, NVIDIA announced that it is buying ARM \cite{ai}, but it may take time to complete since it is running into some licensing issues regarding its subsidiary ARM - China.
Unlike Intel, AMD, and Freescale, ARM does not produce any computer chips, but licenses its architecture designs and product ecosystem to other vendors. It has particularly focused on low-power CPUs and GPUs for mobile and embedded devices, and has over the last several years acquired dozens of smaller companies, such as Falanx, a Norwegian startup that is now known as ARM Norway and is behind their Mali GPU processor.
ARM now offers a wide range of IP design products, including CPUs, GPUs and microcontrollers. ARM CPUs and NPUs (Neural Processing Units) include Cortex-A, Cortex-M, Cortex-R, Neoverse, Ethos, and SecureCore. Mobile companies such as Apple and Samsung often license the designs from ARM to manufacture and integrate them into their own System-on-chip (SoC) designs with other components such as GPUs (sometimes ARM's Mali) or modem/radio baseband components (for mobile phones). In fact, due to the proliferation of mobile and handheld devices such as tablets, ARM was claimed to have 75% of the world's CPU market in 2005. As the demand for processing power of these devices has increased so has the demand for their CPUs; since 2018, ARM has also offered server-class CPU designs. If NVIDIA's purchase of ARM goes through, it will be very interesting to see how it impacts future ARM designs that are already influenced by AI workloads.
ARM-based Astra
The first ARM-based supercomputer to appear on the the Top-500 list was Astra at Sandia National Labs, announced at SC18. The HPE (Hewlett Packard Enterprise)-built system ranked at number 204 with 1.529 petaFLOPS, and made number 36 on the HP Conjugate Gradient benchmark, a benchmark more targeted to its use. Astra was the first system which that was part of the US DOE NNSA (National Nuclear Security Administration) Vanguard program that looks at prototype systems for advanced architectures.
One of the Astra nodes is about 100 times faster than the ARM-based CPU chips found in cell phones. The Astra system has 5,184 ARM-based Cavium ThunderX2 28-core processors. Cavium Inc. grew from designing networking processors to itself being bought in 2018 by Marvell Technology group, a Bermuda-registered conglomerate.
Michael Aguilar, the main systems administrator of Astra, gave a nice introduction to the system in 2019 at Stanford \cite{insidehpc}. He describes how they have ported many of the familiar packages ranging from OpenMPI to UCX, where one of the interesting challenges they faced was that, unlike Intel cores, the ARM cores did not appear next to each other, so they behaved in a very non-uniform fashion. Socket-direct, a new feature on the architecture, seems to alleviate many performance issues associated with core locality, but requires a new awareness at this level. They have had good results with the system so far, achieving from 1.6X speedup on Monte Carlo to 1.87X on linear solver versus the Intel Haswell-based Trinity ASC platform they used as a baseline.
However, as far as power consumption, they did not see it particularly more efficient than the Intel ones, but have not been focusing on this either.
Fugaku and ARM
In Summer 2020, the new Japanese Fugaku Supercomputer at RIKEN gained the top spot on the Top-500 list. Named “High Peak” after Mt Fuji, its ARM-based A64FX 48C chips are jointly developed by RIKEN and Fujitsu and uses Fujitsu's Tofu (torus fusion) interconnects. The A64 chip is a TSMC 7nm FinFET with 8.8 billion transistors and 594 signal pins. It also has SVE 512 x 2 vector extensions \cite{global}.
Note that one of the biggest differences between x86 chips from Intel and AMD, from a vector operation point of view, is that modern Intel chips have one 512-bit AVX2 vector unit, whereas AMD chips have 2-3 128-bit units.
Fugaku maintained its number-1 spot in November 2020 \cite{supercomputer}, but had, unlike the 3 next systems (Summit, Sierra, and Sunway), added over 300,000 cores (from 7,299,072 cores to 7,630,848 cores). Its A64 chip has in addition to the 48-core compute unit, 2 or 4 assist units that run the OS (operating system).
The Fugaku system is also very energy efficient, ranking number 10 on the Green500 list, unlike Summit, Sierra, and Sunway. Selene, the Nvidia DGX A100 system at NVIDIA with 555,520 cores, ranked in Nov. 2020 at number 5 both on the Top-500 and Green500 lists.
The Fujitsu A64fx is also featured in Stony Brook's HPE Apollo 80/ Cray computer Ookami ("wolf" in Japanese), announced in Nov. 2020 . (Cray was formally acquired by HPE in Sept. 2019.) The Ookami system is the first system announced outside Japan that will feature the A64fx.
Atos and AMD EPYC
Another interesting entry on the Nov. 2020 Top-500 list is the number-7 JUWELS Booster Module, an Atos Bull Sequena XH200 based on the AMD EPYC chips installed at Forschungzentrum Jülich (FZJ) in Germany.
Atos is a French company that emerged from the merger of two French companies and a Dutch IT company. It has since grown with the merger/acquisition of several companies, including Siemens IT (2011), Bull, and Xerox ITO (2014), the latter a well-known HPC vendor in Europe. It has also strengthened its US presence through the acquisition of Xerox ITO (2014) and Syntel (2018).
As a historical note, Heide \cite{Heide_1991} \cite{Heide_2008} describes how both Bull and IBM have their roots dating back to punch-card machines. Bull is in fact named after the Norwegian pioneer, Fredrik Rosing Bull who in 1918 started to build punched-card machines. Bull died in 1924, but his work was continued by others in Norway, and later (1931) it was relocated to Paris in France where it, after several reorganizations, emerged as Bull, a subsidiary of Atos Technologies.
In May 2020 Atos announced that it was adding the new NVIDIA A100 GPUs to the Jülich system \cite{atos}, which clearly contributed to its number 7 Nov. 2020 ranking on the Top-500.
The Nanometer Race —Intel versus TSMC
Currently, just three chip manufacturers in the world are using less than 14nm node technologies: Intel, TSMC (Taiwan Semiconductor Manufacturing Company), and Samsung. China's SMIC has yet to do smaller than 14nm processes. The nm (nanometer) denomination previoulsy referred to gate length, but has in the last several years been tied to "technology node" size and become more vendor-dependent. For instance, Intel's 10nm nodes are assumed to be compatible with foundries’ 7nm nodes.
So how is this affecting Supercomputing?
Fugaku and AMD's recent EPYC 7nm chip are both manufactured at TSMC, which also produces the new 5nm M1 chip for Apple. However, TSMC reports it is struggling to meet the demand for the Apple chips, so Apple may have to again also use Samsung.
Intel announced through its earnings statement in July 2020 that it is having issues with its 7nm manufacturing, thus causing an expected 1-year or more delay of its GPU Ponte Vecchio chips. This may again delay the Aurora system planned at Argonne National Labs \cite{foundries} \cite{issues}, one of the two new exascale systems planned in the US in 2021. The AMD-powered Frontier system at Oak Ridge National Laboratory is supposedly still on schedule, and will likely become the first exascale system in the US.
Of course, if we allow for low- or mixed-precision and look at the HPL-AI benchmarks—differing from the Linpack benchmarks used for the Top-500 rankings that only allow for full 64-bit precisions—we are already in the exascale era. Fugaku also ranked number 1 on this benchmark in Nov. 2020 peaking at 2.0 exaFLOPS \cite{reloada}.
EuroHPC JU and the next wave of European supercomputers
Europe have had some systems in the top ten of the Top-500 list, including the current Atos systems, but is now pushing hard for their new EuroHPC JU (European HPC Joint Undertaking) with a roadmap that includes the following systems:
Pre-Exascale class:
- LUMI (Large Unified Modern Infrastructure) at CSC's center in Kajaani, Finland \cite{lumi} \cite{destiny}
- Marenostrum 5 at Barcelona Supercomputing Centre, Spain
- Leonardo at CINECA, Italy \cite{insidehpca}
Petascale supercomputers:
- LuxProvide, Luxembourg
- IZUM, Slovenia
- IT4Innovations National Supercomputing Centre, Czech Republic
- Sofiatech, Bulgaria
- Minho Advanced Computing Centre (MACC), Portugal,
LUMI will be an HPE Cray EX supercomputer with AMD 64-core EPYC CPUs and feature the new AMD Instinct GPUs, and is projected to be faster than the current number 1 Fugaku supercomputer. And Leonardo is now announced to be a 240-petaFLOPS Atos’ BullSequana XH2000 with ParTec’s ParaStation Modulo software. Marenostrum 5's design has not been announced as of this article's writing.
FPGA —finally maturing for HPC?
Field-Programmable Gate Arrays (FPGAs) that let you encode algorithms and functions in hardware offer great potential speeds for user application codes, but have been, and still are, fairly hard to program for scientists. However, two of the word's largest CPU makers have bought FPGA companies (AMD announced in October 2020 it will be buying the FPGA company Xilinx Inc.). Although Xilinx is headquartered in San Jose, it has a strong presence in Europe and Asia, and has also started to offer FPGA-based accelerator cards targeting AI and cloud workloads. The deal is expected to go through by the end of 2021. Xilinx´s new data center accelerator card, the Alveo FPGA U250, targets AI/ML, graphics computing and HPC computations \cite{card}.
FPGAs and other emerging accelerators will be discussed in a future department feature.