You could probably to very close to this solution with C or C++, plus AVX intrinsics. Some might consider that "cheating" since intrinsics occupy kind of a grey area between a higher level language and asm.
I suspect by the time you’ve disabled all the compiler “optimisations” that would lie knock a few orders of magnitude of performance off the directly translated algorithm, you may as well have written the assembler to start with…
And you probably can’t fine tune your C/C++ to get this performance without knowing exactly what processor instructions you are trying to trick the compiler into generating anyway.
I mean, did you see the very complicated extremely optimized C and C++ code lower on the page? Despite that, they "only" got to 10% of the performance of the ASM code.
The speed of this program is partly that it's written in assembly, but mostly because it's written by someone who is quite clever and clearly put a large amount of time into this problem. None of the other solutions spend much time trying to fit their data into the CPU cache, nor do they have to drop to using slicing for zero copies, and not one is doing anything nearly as clever as this program is to generate its numbers. All of this would be possible to mostly translate to C++ with AVX intrinsics, but real accelerator here is not choice of language, it's the person behind the code.
Now that I have seen the power of madvise + huge pages, everything looks like a nail. Author reckons 30% from less page table juggling. There are techniques here that apply outside assembly.