I thought java had switched to using a slightly more optimal lock xadd for AtomicInteger/etc updates? (see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7023898) Doesn't address the main issue of memory synchronization between cores, but should make things a little better?
Maybe I'm missing something, but I'm very surprised that the JVM would be implementing atomic increment with a loop that does "lock cmpxchg", retrying if it fails. The same can be accomplished much more easily (and safely, and probably with better performance) with "lock add".
For example, take this C program which uses the GCC atomic builtin __sync_add_and_fetch():
The partial flags update stall to which INC is vulnerable is tiny -- on the order of 10 cycles -- and generally not a significant factor except in tiny loops with carried flag dependencies that are executed millions of times. Also, the stall has been largely eliminated on recent Intel µarchs (Sandybridge and later). That said, ADD is no worse and sometimes better, so there's really no good reason to use INC.
Intel hasn't published much documentation on the precise architectural techniques that are used; examples 3-25 and 3-26 and the surrounding text in their optimization manual give a vague description ("In Intel microarchitecture code name Sandy Bridge, the cost of partial flag access is replaced by the insertion of a micro-op instead of a stall.") and include and example of a code sequence that would incur such a stall on earlier µarchs but is faster than a sequence that avoids the stall on Sandy Bridge.
One simple and common optimization is to perform register rename on the flags register. More often than not, the flags updated by INC or DEC are simply overwritten by a later arithmetic instruction without ever being used; simple rename can eliminate the stall in these cases.
Another simple optimization would be for the front-end perform macro-op fusion of an INC or DEC with a following branch that is known to only use the flag bits written by the INC; then the fused macro-op can issue without waiting on the other flag bits, and avoid stalling control flow.
However, I have no inside knowledge about what particular changes Intel made or didn't made to remove this particular hazard; I only know that it seems to have been nearly entirely eliminated.
Just a random fact, for those that use Snort[1], this article brings up the same reason why stream processing in snort is single threaded. See their attempt to make it multi-threaded:
Edit: to clarify, snort can run across multiple threads, but a single stream is handled by a single thread. When they tried to process the same data in multiple threads at once, cache synching killed performance.
Poor cache utilization on traditional shared memory arch when passing data (packet) from core to core. I wonder if, while going down this road, they made sure key structures were cache-aligned.
Correct the %r8 - %r15 registers were added by the x86_64 architecture and are 64-bit in length just as the others are now. They are commonly used for passing arguments to functions.
The memory semantics are standard x86, but you are right that the registers are from x86_64 (and 64 bit mode at that). I'll update the article to clarify that point.
At what point do we just call x86_64 "x86"? Do you for example have any x86 hardware that doesn't support x86_64? If so, what is it and what is it used for?
>> The first AMD64-based processor, the Opteron, was released in April 2003.
>> The first processor to implement Intel 64 was the multi-socket processor Xeon code-named Nocona in June 2004.
>> Intel's official launch of Intel 64 (under the name EM64T at that time) in mainstream desktop processors was the N0 Stepping Prescott-2M. All 9xx, 8xx, 6xx, 5x9, 5x6, 5x1, 3x6, and 3x1 series CPUs have Intel 64 enabled, as do the Core 2 CPUs, as will future Intel CPUs for workstations or servers. Intel 64 is also present in the last members of the Celeron D line.
>> The first Intel mobile processor implementing Intel 64 is the Merom version of the Core 2 processor, which was released on 27 July 2006.
"For it to end up with the right value at the end (M x N), two things need to be true." <-- the two things (immediate visibility and atomicity) are not strictly required. this is only required if all intermediate values are to be observed by the running threads. otherwise you can end up with the correct answer (M x N) without requiring threads to coordinate each write.
What I was trying to get across is that, in the trivial implementation, visibility and atomicity are required. There are obviously better ways for threads to correctly count in parallel with much better performance - but not ones that Java will automatically recognize based on the obvious implementation of the code.
They're not saying "Java is atomic and volatile", they're saying how atomic and volatile are implemented in Java on x86, or java's atomic and volatile (possessive case) implementations. Atomic and volatile, in the context of the article, are things that languages do, have or implement, not properties of the language.