Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Here’s an example of a simple C++ function, manually vectorized + unrolled for optimal performance on modern processors: https://stackoverflow.com/a/59495197/126995

Of all C++ compilers, only clang generates similar machine code from a straightforward source. And only with some special compiler switches, most importantly the -fast-math. Dot product of 2 vectors is a trivially simple algorithm, sum( a[ i ] * b[ i ], i = 0..N-1 ), I wouldn’t expect clang to auto-vectorize more complicated stuff. Finally, C++ compilers are designed to work offline so the optimizer can afford to spend substantial CPU time searching for a best optimization, a JIT compiler like Java runtime simply doesn’t have time for that.



That's not quite what I'm saying though. There's of course going to be exceptions, especially in highly specialized cases.

> Finally, C++ compilers are designed to work offline so the optimizer can afford to spend substantial CPU time searching for a best optimization, a JIT compiler like Java runtime simply doesn’t have time for that.

Not necessarily. A JIT can also see exactly what the program is doing, see that 1000 out of a million lines of code are performance critical, and throw all the effort on optimizing that, while armed with stats about how the program works in practice and which branches are taken how often.

You also don't need to wait for it, since there's no reason why that work can't be done on a separate thread without blocking execution.

It can also generate native code for your specific CPU, so it may well do much better than GCC there.


I agree about the scalar code, I know from experience JIT compilers can be awesome, sometimes they indeed outperform AOT.

Automatic vectorization, on the other hand…

Modern SIMD arrived in 1999, SSE1 in Pentium III. For the 20+ years which followed, very smart compiler developers tried to improve their automatic vectorizers. Yet they only achieved very limited success so far.

They do a good job when all of the following is true: (1) pure vertical operations (2) no branches or conditions (3) the numbers being handled are either FP32 or FP64.

I think building a sufficiently good automatic vectorizer is borderline impossible task. Even when the runtime is very sophisticated like modern Java, with several of these progressively better versions based on the real-time performance profiler data, the problem is still extremely hard to solve.

For instance, here’s a fast way to sort 4 floats with SSE https://godbolt.org/z/c97Yf5js8 I don’t believe a compiler could have possibly figured out these shuffles and blends from any reasonable scalar implementation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: