You could probably to very close to this solution with C or C++, plus AVX intrin...

bigiain · on Oct 28, 2021

I suspect by the time you’ve disabled all the compiler “optimisations” that would lie knock a few orders of magnitude of performance off the directly translated algorithm, you may as well have written the assembler to start with…

And you probably can’t fine tune your C/C++ to get this performance without knowing exactly what processor instructions you are trying to trick the compiler into generating anyway.

kccqzy · on Oct 28, 2021

> without knowing exactly what processor instructions you are trying to trick the compiler into generating

In fact, in this case you know exactly what processor instructions the compiler is going to generate. You are using AVX intrinsics after all.

And no, compiler optimizations work well with these intrinsics.

akie · on Oct 29, 2021

I mean, did you see the very complicated extremely optimized C and C++ code lower on the page? Despite that, they "only" got to 10% of the performance of the ASM code.

saagarjha · on Oct 29, 2021

The speed of this program is partly that it's written in assembly, but mostly because it's written by someone who is quite clever and clearly put a large amount of time into this problem. None of the other solutions spend much time trying to fit their data into the CPU cache, nor do they have to drop to using slicing for zero copies, and not one is doing anything nearly as clever as this program is to generate its numbers. All of this would be possible to mostly translate to C++ with AVX intrinsics, but real accelerator here is not choice of language, it's the person behind the code.

cormacrelf · on Oct 29, 2021

Now that I have seen the power of madvise + huge pages, everything looks like a nail. Author reckons 30% from less page table juggling. There are techniques here that apply outside assembly.

DeathArrow · on Oct 29, 2021

It's not ASM that make the code fast, it's the way he laid data and code. C/C++ should be able to approach 90% the speed of this.

gpderetta · on Nov 1, 2021

most other implementations do not use splicevm, which is a huge win for this specific problem.

Of course all the AVX and cache optimizations are also exceedingly clever.