I agree about the scalar code, I know from experience JIT compilers can be aweso...

I agree about the scalar code, I know from experience JIT compilers can be awesome, sometimes they indeed outperform AOT.

Automatic vectorization, on the other hand…

Modern SIMD arrived in 1999, SSE1 in Pentium III. For the 20+ years which followed, very smart compiler developers tried to improve their automatic vectorizers. Yet they only achieved very limited success so far.

They do a good job when all of the following is true: (1) pure vertical operations (2) no branches or conditions (3) the numbers being handled are either FP32 or FP64.

I think building a sufficiently good automatic vectorizer is borderline impossible task. Even when the runtime is very sophisticated like modern Java, with several of these progressively better versions based on the real-time performance profiler data, the problem is still extremely hard to solve.

For instance, here’s a fast way to sort 4 floats with SSE https://godbolt.org/z/c97Yf5js8 I don’t believe a compiler could have possibly figured out these shuffles and blends from any reasonable scalar implementation.