Java's Atomic and volatile, under the hood on x86

jcdavis · on Nov 18, 2012

I thought java had switched to using a slightly more optimal lock xadd for AtomicInteger/etc updates? (see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7023898) Doesn't address the main issue of memory synchronization between cores, but should make things a little better?

pron · on Nov 18, 2012

This fix is very recent. Not sure whether it's in the stable builds of JDK 7.

haberman · on Nov 18, 2012

Maybe I'm missing something, but I'm very surprised that the JVM would be implementing atomic increment with a loop that does "lock cmpxchg", retrying if it fails. The same can be accomplished much more easily (and safely, and probably with better performance) with "lock add".

For example, take this C program which uses the GCC atomic builtin __sync_add_and_fetch():

    void f(int *x) { __sync_add_and_fetch(x, 1); }

This compiles into:

   lock add DWORD PTR [rdi],0x1
   ret

cmpxchg is also vulnerable to the ABA problem (http://en.wikipedia.org/wiki/ABA_problem), which the "lock add" approach is not.

"inc" is also an instruction best avoided these days, since it doesn't update all of the flags and can therefore cause an EFLAGS stall (http://stackoverflow.com/questions/12163610/why-inc-and-add-...)

pron · on Nov 18, 2012

This has been fixed very recently: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7023898

stephencanon · on Nov 18, 2012

The partial flags update stall to which INC is vulnerable is tiny -- on the order of 10 cycles -- and generally not a significant factor except in tiny loops with carried flag dependencies that are executed millions of times. Also, the stall has been largely eliminated on recent Intel µarchs (Sandybridge and later). That said, ADD is no worse and sometimes better, so there's really no good reason to use INC.

haberman · on Nov 18, 2012

> Also, the stall has been largely eliminated on recent Intel µarchs (Sandybridge and later).

How is that accomplished? Isn't the stall required by definition, since the partial flags update creates extra data dependencies?

stephencanon · on Nov 18, 2012

Intel hasn't published much documentation on the precise architectural techniques that are used; examples 3-25 and 3-26 and the surrounding text in their optimization manual give a vague description ("In Intel microarchitecture code name Sandy Bridge, the cost of partial flag access is replaced by the insertion of a micro-op instead of a stall.") and include and example of a code sequence that would incur such a stall on earlier µarchs but is faster than a sequence that avoids the stall on Sandy Bridge.

One simple and common optimization is to perform register rename on the flags register. More often than not, the flags updated by INC or DEC are simply overwritten by a later arithmetic instruction without ever being used; simple rename can eliminate the stall in these cases.

Another simple optimization would be for the front-end perform macro-op fusion of an INC or DEC with a following branch that is known to only use the flag bits written by the INC; then the fused macro-op can issue without waiting on the other flag bits, and avoid stalling control flow.

However, I have no inside knowledge about what particular changes Intel made or didn't made to remove this particular hazard; I only know that it seems to have been nearly entirely eliminated.

tomjen3 · on Nov 18, 2012

I think what you are missing with the ABA problem is that it doesn't apply as it is the expected behaviour from the point of view of the programmer.

haberman · on Nov 18, 2012

In this case it doesn't make a difference, but in many cases it does, which is a reason to avoid CAS() if you can.

kyrra · on Nov 18, 2012

Just a random fact, for those that use Snort[1], this article brings up the same reason why stream processing in snort is single threaded. See their attempt to make it multi-threaded:

http://securitysauce.blogspot.com/2009/04/snort-30-beta-3-re...

Edit: to clarify, snort can run across multiple threads, but a single stream is handled by a single thread. When they tried to process the same data in multiple threads at once, cache synching killed performance.

[1] http://snort.org/

brooksbp · on Nov 18, 2012

Poor cache utilization on traditional shared memory arch when passing data (packet) from core to core. I wonder if, while going down this road, they made sure key structures were cache-aligned.

Flow affinity, locking "streams" to cores, is the way to go. Relevant linux kernel support: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git...

wtracy · on Nov 18, 2012

I did a double take when I saw the generated assembly reference %r8d.

Apparently this is a register that was added with AMD64. I guess that's what I get for not keeping up with x86 assembly over the last five years.

0x0 · on Nov 18, 2012

That's not valid in standard 32bit x86 assembly, is it? So in fact this is actually x86_64 / amd64 assembly, almost a different architecture.

Edit: It is not valid according to http://www.codeproject.com/Articles/45788/The-Real-Protected... - so headlining with "x86" is misleading at best.

technogeek00 · on Nov 18, 2012

Correct the %r8 - %r15 registers were added by the x86_64 architecture and are 64-bit in length just as the others are now. They are commonly used for passing arguments to functions.

mjb · on Nov 18, 2012

The memory semantics are standard x86, but you are right that the registers are from x86_64 (and 64 bit mode at that). I'll update the article to clarify that point.

wolf550e · on Nov 18, 2012

At what point do we just call x86_64 "x86"? Do you for example have any x86 hardware that doesn't support x86_64? If so, what is it and what is it used for?

from http://en.wikipedia.org/wiki/X86-64

>> The first AMD64-based processor, the Opteron, was released in April 2003.

>> The first processor to implement Intel 64 was the multi-socket processor Xeon code-named Nocona in June 2004.

>> Intel's official launch of Intel 64 (under the name EM64T at that time) in mainstream desktop processors was the N0 Stepping Prescott-2M. All 9xx, 8xx, 6xx, 5x9, 5x6, 5x1, 3x6, and 3x1 series CPUs have Intel 64 enabled, as do the Core 2 CPUs, as will future Intel CPUs for workstations or servers. Intel 64 is also present in the last members of the Celeron D line.

>> The first Intel mobile processor implementing Intel 64 is the Merom version of the Core 2 processor, which was released on 27 July 2006.

d2fn · on Nov 18, 2012

"For it to end up with the right value at the end (M x N), two things need to be true." <-- the two things (immediate visibility and atomicity) are not strictly required. this is only required if all intermediate values are to be observed by the running threads. otherwise you can end up with the correct answer (M x N) without requiring threads to coordinate each write.

mjb · on Nov 18, 2012

Right, that's true. Perhaps I oversimplified.

What I was trying to get across is that, in the trivial implementation, visibility and atomicity are required. There are obviously better ways for threads to correctly count in parallel with much better performance - but not ones that Java will automatically recognize based on the obvious implementation of the code.

3825 · on Nov 18, 2012

perhaps there is a reason why they wrote Java's instead of Java is?

jbri · on Nov 18, 2012

The reason is that it's a possessive apostrophe, not a contraction.

dkersten · on Nov 18, 2012

They're not saying "Java is atomic and volatile", they're saying how atomic and volatile are implemented in Java on x86, or java's atomic and volatile (possessive case) implementations. Atomic and volatile, in the context of the article, are things that languages do, have or implement, not properties of the language.