Though that's 10 pages and short for a paper, I'll give a shot at a simpler expl...

Though that's 10 pages and short for a paper, I'll give a shot at a simpler explanation.

Compilers are expected to optimize your code, and the primary way to optimize code is through rearranging your statements.

Consider the following:

  int i=0;
  i++;
  sleep(1); // Yeah, sleep isn't a proper memory barrier.
  i++;
  sleep(1); //  But in my experience, beginners understand sleep. So shoot me.
  i++;
  sleep(1);
  i++;
  sleep(1);
  i++;
  sleep(1);

The compiler will often "rearrange" these ++ statements to all happen on the same line, ultimately as follows:

  int i=0;  i++;  i++;  i++;  i++;  i++;
  sleep(1);
  sleep(1);
  sleep(1);
  sleep(1);
  sleep(1);

Then

  int i=5;
  sleep(5);

Simple enough. Except... what if Thread#2 had the following:

  while(i<2); // Infinite loop waiting for i to equal 2
  foo();

Then in Thread#3:

  while(i<3); // Infinite loop waiting for i to become 3
  bar();

Then in Thread#4:

  while(i<4); // Infinite loop waiting for i to become 4
  baz();

Then in Thread#5:

  while(i<5); // Infinite loop waiting for i to become 5
  foobar();

As we can see here, "i" is a synchronization variable. We only know this fact if we know how another thread works. Now that i no longer steps from 1 to 2 to 3 to 4 to 5, your threads no longer synchronize and the code gains a race condition (all threads might execute at once, since i starts off as 5).

-----------

For better or worse, modern programmers must think about the messages passed between threads. After all, semaphores are often i++ and i-- statements at the lowest level (maybe with a touch of atomic_swap or maybe a lock-statement depending on your architecture).

Modern code must note when a variable is important to inter-thread synchronization, to selectively disable the Compilers optimizer (funny enough: it also is needed to strongly order the L1 cache, as well as the Out-of-order core of modern processors).

As such, proper threading requires a top-to-bottom language-level memory model.

The "knowledge" that the i++ cannot be optimized / combined beyond the sleep statements.

---------

This is no longer an issue on modern platforms. Today, we have C++11's memory model which strongly defines where and when optimizations can occur, with "seq_cst" memory ordering.

There is also a faster, but slightly harder to understand, memory model of acquire and release. This acquire / release memory model is useful on more relaxed systems like ARM / POWER9.

Your mutex_lock() and mutex_unlock() statements will have these memory-barriers which tell the compiler, CPU, and L1 cache to order the code in ways the programmer expects. No optimizations are allowed "over" the mutex_lock() or mutex_unlock() statements, thanks to the memory model.

But back in 2004, before the memory model was formalized, it was impossible to write a truly portable posix-thread implementation. (Fortunately, compilers at the time recognized the issue and solved it in their own ways. Windows had Interlock_Exchange calls, GCC had its own memory model. But the details were non-standard and non-portable).