The current eMAG (Skylark), though it is current, is not very new either. It's a 16 nm design from last year. They wanted to launch Quicksilver in 2019, but there's like one month left..
In multi-thread benchmarks of raw memory I/O we found a clear performance leader in Ampere’s eMAG, outperforming both AMD’s EPYC and Intel’s XEON CPUs by a factor of 6 or higher
That doesn't sound right. Neither AMD nor Intel get more than a handful of GB/s in basic memory I/O? Any idea what could be wrong?
"It should be noted that Kinvolk has ongoing cooperation with both Ampere Computing and Packet, and used all infrastructure used in our benchmarking free of charge. Ampere Computing furthermore sponsored the development of the control plane automation used to issue benchmark runs, and to collect resulting data points, and to produce charts."
I'm not saying anything was intentionally done, but optimizations were likely done on the Ampere side.
It doesn't sound remotely right to me either. It could be NUMA since the Intel and AMD systems are NUMA but the eMAG is not. The code for this benchmark appears to be https://github.com/akopytov/sysbench/blob/master/src/tests/m... which... is not an interesting way to benchmark a large server IMO. Running a single process with a lot of threads and a lot of RAM on a NUMA server is going to perform poorly (unless you do a lot of tuning which I don't recommend either). "Microservices" might run a lot faster.
That's only fine if you know all code running in all parts (containers) on the same hardware node. Code running on one container can influence data/code from other containers. (When some third-party has a form of code execution)
Their tests disabled hyperthreading on Intel due to the security concerns and also on AMD on speculation that security concerns might arise in the future (if I read everything correctly).
In the memcopy benchmark, which is designed to stress both memory I/O as well as caches, Intel’s XEON shows the highest raw performance
I am not surprised by that, given that x86 has a single instruction that will copy arbitrary number of bytes in cacheline-sized chunks --- something that ARM does not have.
I'm very impatient to look RISC-V coming to look performance/security.
Don't forget to disable a lot of features about Intel if you want a full secure environment like SMT/Hyper-Threading
Very impressive perf on ARMs side given it competes against decades of x86 specific optimisation in the code.
Intel for example for long intentionally made float performance close to integer of same size, so there was no perf difference in scripting languages that use float internally for all computations.
ARM sucks at web benchmarks because ARM never put any accent on fp perf. Many ARM cores simply don't have fp units at all. The most popular JS vm V8 does a lot of useless float>integer and back conversions under the hood, and that doesn't help either. They are almost free on x86, but degrade js perf on smartphones by double digits.
Second, vector math and vector float math have close to no use in web loads, but a lot of devs still try to put SSE instructions everywhere simply because SSE is many times faster than simple math and many binary manipulations on x86.
ARM on other hand is relatively good with making a lot of ops on byte and double data, because it was historically never aimed at number crunching with extra wide vector instructions.
For the same reason ARMs UCS-2 and UTF-16 parsing performance is that bad. All kinds of parsers exploit fast register renaming on x86 to run tzcnt with very good perf, but they have to revert to relatively slow SIMD bitmasks on ARM. You can feel that a lot when you work with VMs/interpreters that use UCS-2 as their internal unicode implementation.
Hardware peripherals were always x86 optimised too. Yes, almost every device you can hook onto PCIE has been extensively optimised to work well with x86 style DMA, and some higher level APIs like I/O virtualisation, DMA offload engines, and assumptions about typical controller, memory, and cache latency.
Yes, even endianness conversion is there to make x86 jump ahead. Almost all "enterprise hardware" intentionally uses little endian in its protocols, to avoid endianness conversion on x86. Of course at the cost of doing it on big endian machines, that include ARM.
P.S. On other hand, nearly all peripheral ICs aiming at embedded market prefer big endian for an opposite reason.
ARM is doing OK with the current generation of HPC systems, and the post-K system, whose name I forget, should be rather impressive at floating point. SIMD width is not all that matters, after all. (Obviously this is v8 and up, which requires floating point.)