Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Great article. Very comprehensive and well written, while not skipping the gotchas the author hit along the way

I don't really much about how the JVM works, but it's interesting that this is (if I understood correctly) in effect calling out to some native code. In the JVM world, when are things added to the byte code as a new instruction, and when are they wrapped up in a native code call?

I'd expect a new vector instruction for the stack machine, but I'm guessing that's less flexible and less performant..?

"Practically speaking, you end up with code that looks like it creates many objects, but actually compiles to code that allocates no memory. It is extremely disconcerting."

It's interesting comparing people working in Java vs C++. Java makes it so impractical to really reason about the code directly that users ends up relying a lot more on tooling for benchmarking and introspection (and hence the tools are top-class)

"Java’s ability to print its compiler output as assembly "

How do you do this? Is this integrated into the IDE?

While the tools are great and you can dissect all your performance problems, ultimately feedback from the compiler should be feeding back into the editor. I feel that some of the issues the author hit shouldn't happen. You should for instance be immediately able to see if a function is being inlining while you're editing and not having to run extra tools



> In the JVM world, when are things added to the byte code as a new instruction, and when are they wrapped up in a native code call?

These aren't native code calls but compiler intrinsics. The Java compiler (the JIT compiler, that is) translates certain Java method invocations into specific machine instruction patterns.

> Java makes it so impractical to really reason about the code directly that users ends up relying a lot more on tooling for benchmarking and introspection (and hence the tools are top-class)

As others have mentioned, you can dump the generated machine code, but you are correct that Java is, by design, higher level than C++ in the sense that low level details are less under the programmer's control, but the compiler's. So when you see a `new Foo()` in Java, you cannot tell whether an object will actually be allocated on the heap, or perhaps "scalarized" and stored directly in registers. In fact, what happens is not even decided statically and could vary dynamically, depending on what other code does, as compilation is done JIT, and optimisations are decided based on the program's current execution profile. As a result, even though all instance methods in Java are semantically virtual, in practice they are very often inlined, and don't compile even to a call instruction, let alone a vtable call.

Unlike C++, Java's goal — generally speaking — is to get very good average-case performance for little effort rather than give the programmer a lot of knobs to control the worst case. Of course, sometimes there are knobs — just like the Vector API, which is useful in cases the JIT cannot automatically vectorise loops — but the balance is different from C++.


Thanks for the explanation. So I'm gathering that the distinction is inlined native code will be explicit while an intrinsic is up to the compiler. The intrinsic provides a fallback if you run on a dumb/old JVM. That does seem much cleaner than adding new byte codes


Yes. Also, the compiler knows what the intrinsics do (or don't do), so they participate in various compiler analyses, unlike an opaque native calls, that the compiler is not usually able to move around.


> Vector API, which is useful in cases the JIT cannot automatically vectorise loops

For vertical-only operations (vectorizable loops), people often using GPUs instead of SIMD in CPUs. Graphics cards have way more resources, both GFlops and memory bandwidth.

I don't think sorting an array has anything to do with vectorizing a loop.


It would not be hard to write a quicksort which could be automatically vectorized. For example, this is a perfectly valid partition implementation of the pivot operation.

    int partition(int[] array, int from, int to, int pivot) {
        int[] buf1 = new int[to - from];
        int[] buf2 = new int[to - from]; // buffers could be preallocated, as in the blogpost

        int idx1 = 0;
        int idx2 = 0;

        for (int i = from; i < to; i++) {
            int value = array[i];
            if (value < pivot) {
                buf1[idx1++] = value;
            }
        }

        for (int i = from; i < to; i++) {
            int value = array[i];
            if (value >= pivot) {
                buf2[idx2++] = value;
            }
        }

        System.arraycopy(buf1, 0, array, from, idx1);
        System.arraycopy(buf2, 0, array, from, idx2);
    }

It would be relatively straightforward for a compiler to automatically vectorize the two loops I've coded up. I don't know if there's one that would, but with the vpcompress instruction I think the absence of a compiler doing it automatically would simply be a function of it not being a common use case.


Seems that[0] clang doesn't succeed at vectorizing that, instead just doing a bunch of boring regular unrolling. I'd assume it wouldn't be too hard for clang to be able to understand that pattern, just noone has bothered to (and it won't pop up much in regular code because usually it'd be done with vector pushes and not a preallocated buffer).

[0]: https://godbolt.org/z/hzsdhGorr


It looks like GCC and MSVC don't either, but ICC does.


good point on ICC! I've compared against it a couple times before, but never got any satisfying results so forgot about it.


I thing clang 13 has the best automatic vectorizer so far, yet it failed to do the job: https://godbolt.org/z/Y9796r1T1

This is despite unlike Java’s runtime, C++ compilers work offline, they don’t care too much about compilation speed, a JIT compiler simply doesn’t have time for expensive optimizations.


> a JIT compiler simply doesn’t have time for expensive optimizations

That is a common misconception, which might apply to some JIT compilers. For example, OpenJDK does not only need to compile only the hot methods, but even then it employs tiered compilation, where more powerful optimisations are made over time, and so the JIT can take as much time as needed (it runs concurrently with the already-compiled program) and has more profiling information than available to an AOT compiler. I.e. the same optimisation might be much quicker for a JIT than for an AOT compiler because the JIT doesn't need to do as deep an analysis of the program.

The advantage AOT compilers have over JITs is that they don't require warmup and they don't consume resources at runtime, but JITs have a greater potential for optimisation, and AOT compilers generally require more low-level information from the programmer to produce good code. There are various projects for caching JIT output and get something that's close to the best of both worlds.

Having said that, compilers aren't all developed on the same schedule, and different ones might prioritise different optimisations.


Also another thing that critics forget is that all modern JVM implementations provide JIT caches with PGO feedback, so the quality of code can in theory be improved to some local maximum, after several runs.

.NET Framework 4.6 did provide some mechanism (MGO, Managed PGO), but did required calling into runtime APIs and was a bit clunky. Starting with .NET 6 they are using an approach more similar to the JVMs.

As far as I am aware, only Java (including the Android stepchild) and .NET ecosystems do have such JITs with PGO feedback between runs.


Also, the JVM can frequently totally avoid compiling large swaths of the code (my methods ended up with about 10 uncommon traps in each method). https://shipilev.net/jvm/anatomy-quarks/29-uncommon-traps/


I don’t see an easy way to vectorize those loops, given that the loop iterations have a dependency on idx1, respectively idx2.

Can you explain?

(Also, if you loop idx2 down from to, I think you need only one buffer and one System.arraycopy call, at the cost of no longer having a stable sort)


In AVX-512, there's an instruction vpcompressd, which literally does the equivalent of 16 iterations of that loop (the "store a subset of items to memory contiguously" part at least). The compiler would have to figure out how the conditional write & increment work together, but it's definitely possible as ICC manages to: https://godbolt.org/z/n68ca4aYe

For non-AVX-512, it gets more complicated to do the compression part, but you can still do it with a shuffle & lookup table of indexes from the mask.


If you use Oracle's own IDE, it will support it out of the box, as it already did on Sun's days.

Then there are other ways depending on which JVM implementation is used.

On OpenJDK's case you can load runtime plugin to do it

https://github.com/AdoptOpenJDK/jitwatch


Oh very cool. I'm really glad these things exist. I skimmed the intro video and it does looks like a separate tool though. Would be extra slick if these things got highlighted as you coded. But I get that usually you only really bring out the big guns when you're optimizing a tight loop somewhere - and then you probably want the whole toolbox and not just some annotations.

Thanks for letting me know about this :)


The reasoning is that these are development tools and thus shouldn't be shipped by default.

I don't agree, but it is as it is, at least it does exist in some form.

Here .NET is better, you can just press F12 (Show Assembly) during debug session, and it will show it just like for C and C++ sessions.

Compiler explorer and sharplab make use of it.

Node allows to show some Assembly stuff, but if you want all the possible debugging options for Assembly code, you will need a debug build of nodejs.


You can't really have bytecode instructions for the vector API functions because they're generic. They're instead methods, which the JIT compiler can handle specially if it so chooses to. Effectively, for pure computational operations (i.e. not ones doing complex stack operations or jumps), there's no real difference between a method call and a new bytecode (except that adding a new bytecode is a lot more work because everything must be updated to support it, whereas a new method is just a new method).


> You should for instance be immediately able to see if a function is being inlining while you're editing

But it’s a dynamic decision, except in the most trivial cases. How can you know it without running the program?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: