Apple M4: A boring chip with old cores from the previous generation, or a performance champion?

--

M4

We have already written about the unveiling of the iPad with the M4 processor. The processor itself has 4 powerful cores and the novelty is that it contains six small efficient cores instead of four – from this it can draw advantages in multi-threaded performance. It also got a GPU with 10 cores (1280 shaders) with ray tracing support, but this is apparently completely carried over from the Apple M3 generation and is therefore not new. The cheaper version of the M4 processor has only 3 large cores, while the 6 small ones and the 10-core GPU are retained.

On the contrary, new is support for faster LPDDR5X-7700 memories, which give the processor a higher throughput (120 GB/s, 20% more than in the M3) and probably help the performance of both the CPU and the GPU. Capacities can be from 8 to 24 GB. The NPU for artificial intelligence acceleration, which has 16 cores, should probably also be new. Apple states a performance of up to 38 TOPS (but we do not know if this is performance with data types INT8 or perhaps INT4).

A rather important piece of information may be that, although the M4 is also manufactured with a 3nm process, it is the second generation of this technology – it looks like the N3E, while the M3 is manufactured by the first “base” version of the N3B. The N3E process apparently solves some of the problems of the N3B technology, it is supposed to achieve better performance and energy efficiency, but perhaps at the cost of slightly worse transistor density. It’s possible that there will be better yield, which could be the reason Apple hastened the release of the M4 and didn’t keep it until the fall.

M4 processor (Apple presentation)

Author: Apple

Apple did not say in the M4 presentation that the cores would have a new architecture compared to the M3, and the company compared the processor to the previous M2, with its slide showing the same description of the architecture improvement that was reported between the M2 and M3 in the fall. This suggests that the processor has the same core architecture as the M3. So no radical increase in single-thread performance was expected.

Extremely high performance (in Geekbench)?

However, the first leaked benchmarks show a surprisingly big difference against the M3 – with performance that would represent a bigger incremental increase than the M3 itself. In the Geekbench database, results appeared where this processor achieves multi-threaded scores (which are not very relevant for this test) somewhere around 14,500 points, but what is important is its single-threaded score, which goes up to 3,750 to 3,800 points, which is an unprecedented result.

Different scores for Apple tablets with M4 processor in Geekbench

Different scores for Apple tablets with M4 processor in Geekbench

Author: Geekbench browser

For example, the Ryzen 9 7950X from 2022 has just over 3000 points in Windows, this year’s Core i9-14900KS has over 3100 – at least in the older version, while the latest one is tested here (results fluctuate a lot and can be improved with, for example, fast memories or turning off protection VBS on Windows). It’s likely that the iOS tablet platform automatically always gives the processor a slightly higher score than the same processor would hypothetically score under Windows (they’re generally higher on Linux), but that certainly doesn’t explain the whole difference.

Apple’s M3 processor, released in the fall, first introduced a new core architecture with a wider design and more computing units, while also deploying a 3nm process for the first time. The astonishment at the fact that another completely new architecture would come out so quickly is therefore justified.

And it looks like, despite that high score, the underlying architecture might actually be the core from the M3 processor. Basically, but not completely, as information has surfaced that this CPU additionally provides SME instructions for matrix calculations used in AI, which is the ARM equivalent of Intel’s AMX instructions.

Extreme frequency and SME

It is the SME instructions that should be used in Geekbench 6.3 for ARM, and this optimization seems to increase the performance of the core by more than 100% in one single subtest of the entire benchmark – Object Detection. Because it is such a big jump, the overall average is greatly affected. This seems to be the first of the factors that made the single-threaded score so much higher than the M3.

Breakdown of performance gains between M4 and M3 for individual Geekbench 6.3 subtests

Breakdown of performance gains between M4 and M3 for individual Geekbench 6.3 subtests

Author: Geekbench browser

The second reason why performance has increased so much is the clock frequency. According to the Geekbench 6.3 detection, it rose to 4.40 GHzwhich is respectable considering that at the time of the M1, clocks were only slightly above 3GHz and it was often assumed that it wouldn’t go much higher due to the wide core.

Apple M4 in the Geekbench database

Apple M4 in the Geekbench database

Author: Geekbench browser

However, Apple has obviously worked to make the new core that debuted in the M3 able to run at higher frequencies (some latencies have reportedly been increased, for example), and now it’s reaping dividends after switching to the higher-quality 3nm N3E process. Against the M3, which has a frequency of large cores of 4.05 GHz (already a lot for ARM CPUs), this is an increase of almost 8.6%.

IPC is only 3% higher, possibly due to memory. The kernel is probably not new

A calculation has surfaced on Twitter that this frequency gives the M4 a 7.6% increase in performance per 1 MHz (or IPC) in Geekbench 6.3, but without that Object Detection subtest, which uses SME instructions, the increase would only be by 3.0%.

This already quite minor improvement in IPC could partly reflect some minor changes or perhaps fixed performance-enhancing errata in the M4 architecture. But part of that 3% is probably just the work of LPDDR5X-7700 memories with their 20% increased throughput.

So both versions are probably possible. It is possible that the Apple M4 already has a partially new core, but in that case it would probably still not be possible to talk about a new architecture, but rather an improved version of the one from the fall. But it can also be a core that is almost exactly the same, only clocked higher, this is probably a more likely option.

SME instructions do not have to be executed directly by the CPU core, but are apparently served by a separate AMX acceleration coprocessor, which Apple has had in its cores for some time.

The AMX drive did not have a publicly documented instruction set, and programmers were only allowed to use it through Apple’s libraries, which provided implementations of the various algorithms accelerated on the drive. But if this block has now been reworked so that it can be programmed via SME instructions, then external code in third-party applications will probably be able to use it.

However, SME will probably be quite “single-purpose”. That is, it will be used mainly or exclusively in artificial intelligence applications performing inference on the CPU. Therefore, it cannot be expected that the 100% increase in performance from the Object Detection test would appear elsewhere than in AI tasks (which, however, also have the option of using GPU or NPU).

Apple could achieve a more generally usable performance improvement at 1 MHz by providing in-core SIMD units wider than 128 bits through the SVE and SVE 2 instructions. These would compete with Intel and AMD’s AVX2 and AVX10 at 256-bit width, or at 512-bit width and AVX-512 (such wide units, however, are not entirely likely).

So far, it looked like Apple is not interested in SVE / SVE 2, and neither is Qualcomm’s Oryon core (in Snapdragon X Elite and Plus) these instructions. It is speculated that ARM wants more money for ARMv9 and SVE / SVE 2, which blocks the expansion of these instructions. However, the fact that Apple has adopted SME instructions could probably indicate some chance that the companies have agreed on licensing and that these more powerful SIMD instructions could one day appear in Apple processors.

Of the 12% increase in single-threaded performance, it is double that in Geekbench

So leaving SME aside, the theoretical single-threaded performance gain for the M4 processor is 3% from IPC (thanks to memory and – possibly – other things) and 8.6% from the higher frequency. Taken together, this should mean that the M4 at 4.4GHz has an increase in general single-threaded performance of actually +11.9% against the previous M3 processor clocked at 4.05 GHz, although the score in Geekbench 6.3 shows a two-fold increase after including SME in the average (up to +23%, but it depends on which exact result in the database you choose for comparison).

So, as you can see, averaging the subtests in these sets of synthetic benchmarks is sometimes quite tricky. Without the SME effect, the score would probably be only around 3450 points in the single-threaded test, not 3800. In previous versions, the results of memory synthetic tests (GB4) or cryptographic expansion (GB5, in turn benefited there from Intel Ice Lake, Tiger Lake processor cores) and Rocket Lake).

Cloud24

Even with 3450 points, the M4 processor should have practically the highest single-threaded performance on the market (although we do not rule out that this score can be reached with Core i9-14900K and i9-14900KS with some optimal memory modules and in Linux, which gives a better score than Windows ), but it’s no longer an extreme lead against everything else.

Sources: AnandTech, Tom’s Hardware, Nguyen Phi Hung (https://twitter.com/negativeonehero/status/1788364108737466609, https://twitter.com/negativeonehero/status/1788576876468007209)


The article is in Czech

Tags: Apple boring chip cores previous generation performance champion

-

PREV About the hope I lost, the promised land that no longer exists, and politicians who don’t care about people or the future
NEXT We will send Ukrainian refugees back to war, says Germany