Great article. Very comprehensive and well written, while not skipping the gotch...

pron · on June 12, 2022

> In the JVM world, when are things added to the byte code as a new instruction, and when are they wrapped up in a native code call?

These aren't native code calls but compiler intrinsics. The Java compiler (the JIT compiler, that is) translates certain Java method invocations into specific machine instruction patterns.

> Java makes it so impractical to really reason about the code directly that users ends up relying a lot more on tooling for benchmarking and introspection (and hence the tools are top-class)

As others have mentioned, you can dump the generated machine code, but you are correct that Java is, by design, higher level than C++ in the sense that low level details are less under the programmer's control, but the compiler's. So when you see a `new Foo()` in Java, you cannot tell whether an object will actually be allocated on the heap, or perhaps "scalarized" and stored directly in registers. In fact, what happens is not even decided statically and could vary dynamically, depending on what other code does, as compilation is done JIT, and optimisations are decided based on the program's current execution profile. As a result, even though all instance methods in Java are semantically virtual, in practice they are very often inlined, and don't compile even to a call instruction, let alone a vtable call.

Unlike C++, Java's goal — generally speaking — is to get very good average-case performance for little effort rather than give the programmer a lot of knobs to control the worst case. Of course, sometimes there are knobs — just like the Vector API, which is useful in cases the JIT cannot automatically vectorise loops — but the balance is different from C++.

geokon · on June 12, 2022

Thanks for the explanation. So I'm gathering that the distinction is inlined native code will be explicit while an intrinsic is up to the compiler. The intrinsic provides a fallback if you run on a dumb/old JVM. That does seem much cleaner than adding new byte codes

pron · on June 12, 2022

Yes. Also, the compiler knows what the intrinsics do (or don't do), so they participate in various compiler analyses, unlike an opaque native calls, that the compiler is not usually able to move around.

Const-me · on June 12, 2022

> Vector API, which is useful in cases the JIT cannot automatically vectorise loops

For vertical-only operations (vectorizable loops), people often using GPUs instead of SIMD in CPUs. Graphics cards have way more resources, both GFlops and memory bandwidth.

I don't think sorting an array has anything to do with vectorizing a loop.

CHY872 · on June 12, 2022

It would not be hard to write a quicksort which could be automatically vectorized. For example, this is a perfectly valid partition implementation of the pivot operation.

    int partition(int[] array, int from, int to, int pivot) {
        int[] buf1 = new int[to - from];
        int[] buf2 = new int[to - from]; // buffers could be preallocated, as in the blogpost

        int idx1 = 0;
        int idx2 = 0;

        for (int i = from; i < to; i++) {
            int value = array[i];
            if (value < pivot) {
                buf1[idx1++] = value;
            }
        }

        for (int i = from; i < to; i++) {
            int value = array[i];
            if (value >= pivot) {
                buf2[idx2++] = value;
            }
        }

        System.arraycopy(buf1, 0, array, from, idx1);
        System.arraycopy(buf2, 0, array, from, idx2);
    }

It would be relatively straightforward for a compiler to automatically vectorize the two loops I've coded up. I don't know if there's one that would, but with the vpcompress instruction I think the absence of a compiler doing it automatically would simply be a function of it not being a common use case.

dzaima · on June 12, 2022

Seems that[0] clang doesn't succeed at vectorizing that, instead just doing a bunch of boring regular unrolling. I'd assume it wouldn't be too hard for clang to be able to understand that pattern, just noone has bothered to (and it won't pop up much in regular code because usually it'd be done with vector pushes and not a preallocated buffer).

[0]: https://godbolt.org/z/hzsdhGorr

Jap2-0 · on June 12, 2022

It looks like GCC and MSVC don't either, but ICC does.

dzaima · on June 12, 2022

good point on ICC! I've compared against it a couple times before, but never got any satisfying results so forgot about it.

Const-me · on June 12, 2022

I thing clang 13 has the best automatic vectorizer so far, yet it failed to do the job: https://godbolt.org/z/Y9796r1T1

This is despite unlike Java’s runtime, C++ compilers work offline, they don’t care too much about compilation speed, a JIT compiler simply doesn’t have time for expensive optimizations.

pron · on June 12, 2022

> a JIT compiler simply doesn’t have time for expensive optimizations

That is a common misconception, which might apply to some JIT compilers. For example, OpenJDK does not only need to compile only the hot methods, but even then it employs tiered compilation, where more powerful optimisations are made over time, and so the JIT can take as much time as needed (it runs concurrently with the already-compiled program) and has more profiling information than available to an AOT compiler. I.e. the same optimisation might be much quicker for a JIT than for an AOT compiler because the JIT doesn't need to do as deep an analysis of the program.

The advantage AOT compilers have over JITs is that they don't require warmup and they don't consume resources at runtime, but JITs have a greater potential for optimisation, and AOT compilers generally require more low-level information from the programmer to produce good code. There are various projects for caching JIT output and get something that's close to the best of both worlds.

Having said that, compilers aren't all developed on the same schedule, and different ones might prioritise different optimisations.

pjmlp · on June 12, 2022

Also another thing that critics forget is that all modern JVM implementations provide JIT caches with PGO feedback, so the quality of code can in theory be improved to some local maximum, after several runs.

.NET Framework 4.6 did provide some mechanism (MGO, Managed PGO), but did required calling into runtime APIs and was a bit clunky. Starting with .NET 6 they are using an approach more similar to the JVMs.

As far as I am aware, only Java (including the Android stepchild) and .NET ecosystems do have such JITs with PGO feedback between runs.

CHY872 · on June 12, 2022

Also, the JVM can frequently totally avoid compiling large swaths of the code (my methods ended up with about 10 uncommon traps in each method). https://shipilev.net/jvm/anatomy-quarks/29-uncommon-traps/

Someone · on June 12, 2022

I don’t see an easy way to vectorize those loops, given that the loop iterations have a dependency on idx1, respectively idx2.

Can you explain?

(Also, if you loop idx2 down from to, I think you need only one buffer and one System.arraycopy call, at the cost of no longer having a stable sort)

dzaima · on June 13, 2022

In AVX-512, there's an instruction vpcompressd, which literally does the equivalent of 16 iterations of that loop (the "store a subset of items to memory contiguously" part at least). The compiler would have to figure out how the conditional write & increment work together, but it's definitely possible as ICC manages to: https://godbolt.org/z/n68ca4aYe

For non-AVX-512, it gets more complicated to do the compression part, but you can still do it with a shuffle & lookup table of indexes from the mask.

pjmlp · on June 12, 2022

If you use Oracle's own IDE, it will support it out of the box, as it already did on Sun's days.

Then there are other ways depending on which JVM implementation is used.

On OpenJDK's case you can load runtime plugin to do it

https://github.com/AdoptOpenJDK/jitwatch

geokon · on June 12, 2022

Oh very cool. I'm really glad these things exist. I skimmed the intro video and it does looks like a separate tool though. Would be extra slick if these things got highlighted as you coded. But I get that usually you only really bring out the big guns when you're optimizing a tight loop somewhere - and then you probably want the whole toolbox and not just some annotations.

Thanks for letting me know about this :)

pjmlp · on June 12, 2022

The reasoning is that these are development tools and thus shouldn't be shipped by default.

I don't agree, but it is as it is, at least it does exist in some form.

Here .NET is better, you can just press F12 (Show Assembly) during debug session, and it will show it just like for C and C++ sessions.

Compiler explorer and sharplab make use of it.

Node allows to show some Assembly stuff, but if you want all the possible debugging options for Assembly code, you will need a debug build of nodejs.

dzaima · on June 12, 2022

You can't really have bytecode instructions for the vector API functions because they're generic. They're instead methods, which the JIT compiler can handle specially if it so chooses to. Effectively, for pure computational operations (i.e. not ones doing complex stack operations or jumps), there's no real difference between a method call and a new bytecode (except that adding a new bytecode is a lot more work because everything must be updated to support it, whereas a new method is just a new method).

chrisseaton · on June 12, 2022

> You should for instance be immediately able to see if a function is being inlining while you're editing

But it’s a dynamic decision, except in the most trivial cases. How can you know it without running the program?