The computing APU Instinct MI300A achieves up to 4x higher performance compared to accelerators

--

There are workloads that are limited by current hardware’s computing power. However, there are also loads for which current accelerators do not have a limit on computing power, but on data transfers. In a situation where the processor and accelerator are separate and each has its own memory, there may be a situation where moving data between the processor memory and the accelerator memory takes more time than the calculations themselves.

The Instinct MI300A from AMD is the first powerful solution that overcomes the classic concept of a CPU with its own memory and a GPU with its own memory, which are interconnected by a relatively slow PCIe interface. With the MI300A, the memory is unified, shared, and the CPU and GPU part have equal access to it thanks to the unified address space. Therefore, if the GPU is to work with data, it does not have to be moved from one memory to another (and then possibly the result back), but everything takes place on one level.

In the case of tasks that are limited precisely by data transfers, the performance shift of the MI300A is huge and can reach up to four times the performance of a classic solution based on a processor / accelerator.

The next graph shows how much of the task processing time individual hardware solutions use for the calculations themselves (dark) and how much for data transfers (light). At the same time, this ratio explains why increasing computing power has only a minimal effect on the overall performance of accelerators for this type of tasks.


The Instinct MI300A is a solution that emerged from the original Exascale Heterogeneous Processor (EHP) aka Exascale APU project that was (already) talked about in 2017. In retrospect, it is interesting how AMD had to deal with changes in technology development. For example, the original assumption was that two quad-core processor chiplets would be used, i.e. a total of 8 cores per APU. In the end, there are 24 of them on the APU (three chiplets of eight each).

Amd Exascale Heterogeneous Processor Ehp

Exascale Heterogeneous Processor (AMD, 2017)

On the other hand, the development of HBM memory has been slower than originally expected. Which is a consequence of the fact that the memory manufacturers decided to make this a high-end solution that pays off only on the most powerful accelerators (instead of the originally intended widely applicable product). Instead of the originally considered HBM4, which were supposed to be layered on low-clocked graphics chips (so that the HBMs wouldn’t burn), HBM3 had to be used, which were finally placed classically “next to”. This eliminated the need to keep graphics chiplets at low clocks (~1 GHz) and AMD could afford clocks slightly over 2 GHz.

Despite this, the originally targeted energy efficiency was exceeded. Instead of the targeted 50 GFLOPS per watt, the Instinct MI300A achieves 80-111 GFLOPS per watt (both universal computing power in double-precision). What hasn’t changed significantly is the number of stream-processors, which was originally planned to be 16,384 and finally reaches 14,592.

However, what was not discussed at all in 2017 and what the MI300A manages very well in the end is AI acceleration. When it comes to AI calculations in double-precision, the efficiency compared to the original plan is even 2x higher than the values ​​mentioned in the previous paragraph.


The article is in Czech

Tags: computing APU Instinct MI300A achieves higher performance compared accelerators

-

PREV The Czech priest was detained in St. Peter’s Square in the Vatican iRADIO
NEXT “We are in the phase of open confrontation. Let’s build a missile arsenal.” Russia wants to scare the West