This article gives the impression that everything is the compiler's fault when you end up with conflicting reads in different cores, but that's not right.
From the point of view of someone outside the CPU, yes you can say that simultaneous reads might never give different answers. But everything happening at that level barely resembles the original software. Dozens of instructions are happening at any moment, overlapping each other, starting and finishing in very different orders from how they're stored in memory. This happens even if you directly write machine code byte by byte.
From the point of view of the software, running a bunch of instructions in sequence, you do get stale values. You can have two threads wait for a signal, then both read a value, and both get different results. You can have a thread set a flag after it's done editing some values, have another thread wait for the flag, and then after waiting it sees the edits as still incomplete.
The exact types of nonsense depend on the memory model, but things as simple as two MOV instructions in a row can violate strict ordering. It's not just that the compiler might stash something in a register, but that the CPU itself will make swaths of very observable changes within the rules set by the memory model.
You can't trust the hardware coherency protocol to do much for you until you follow the platform-specific rules to tell the CPU to make something act coherent.
Strictly speaking, the Java Memory Model derives from the data-race-free model of the early '90s. Java was the first programming language to explicitly incorporate it as part of the specification, but the main derivation actually comes from the C++ memory model, which built into it the basic atomic memory model that most derivatives rely on--Java doesn't have the weaker atomics support that C++ added, just sequentially-consistent. Java also introduced some frankly confusing (and incorrect, per my understanding) semantics for what happens in the face of data races that everybody else just shrugged and said "let's make it be UB and call it a day."
Okay, it looks like those were added in Java 9, which was after I stopped following Java (and well after the C++ memory model was largely settled, in 2007).
Not with the Unsafe class. It was never intended to be used by anything except the built-in concurrency support classes, and it was just barely good enough. External libraries which made use of the Unsafe class did so by studying how it was used, and then assumed/concluded how it worked. Java 9 formalized everything with the VarHandle class, and it offers much more features than the Unsafe class.
Thanks for the clarification. Java was obviously a trailblazer on formally introducing advanced memory models on practical languages, but it did take a few tries to get it correct.
Why is Hanz Boehm referring to memory model fixes and the JMM in the C++11 spec (Java started tackling this problem in 2004, see Goetz) if Java copied C++?
> You can't trust the hardware coherency protocol to do much for you until you follow the platform-specific rules to tell the CPU to make something act coherent.
If software has to perform some operation to maintain coherency then you hardware is not cache coherent. Which some aren't, and especially some sub-sets of operations aren't (instruction cache vs data ops, or CPU vs DMA). But for load/store operations by CPUs, they're all coherent.
Barrier/fence instructions are about enforcing ordering on when memory operations to more than one location can be performed or become visible, with respect to one another.
It's coherent behind the scenes but it often presents an incoherent view to the software. It's not acting coherent when the rearrangement of memory operations makes you see different orderings from different cores.
When a CPU has a loose memory model and is aggressively making use of the reordering capabilities, the cache being coherent internally is basically just an implementation detail. It's not part of the visible ABI.
I'm using coherency as in the term of art, not a colloquial meaning. Every agent observes stores to a location in the same order[*]. Cache coherency says nothing about observed ordering of stores to different locations.
[*] Although store forwarding throws a bit of a spanner in that definition, there can still be reordering occurring absent that local reordering.
Hum I don't see how store forwarding breaks the illusion of total order of stores on a single memory location, at least in 5 minutes of thinking I can't come up with a litmus that would demonstrate it. In fact even c++ relaxed stores and loads preserve this ordering.
I think your definition is correct without the asterisk.
Fine, with that specific term of art meaning then ignore my second post. I stand by my original statement that you can't trust it to "do much for you". Just replace the last word with "act consistent" or "act ordered". Per-address ordering is nearly useless by itself. And if you had a CPU that didn't guarantee that, you'd observe almost no difference.
Well no, if you don't have a coherent system then your memory operations aren't reliable. You can lose updates or read stale data. Look at what software has to do in incoherent systems, specific flush and invalidate points which is not the same as ordering barriers.
Your CPU guarantees a lot, cache coherency to start with. But also a very well defined ordering model and ordering instructions. It's not necessarily trivial to program for, but that doesn't mean you can't trust it.
A system without cache coherency can still promise that updates won't be lost. There are lots of way to write rules around update propagation, and cache coherency is just one of them.
> or read stale data
Cache coherency doesn't protect you from stale data unless you only read one memory address ever.
> Look at what software has to do in incoherent systems, specific flush and invalidate points which is not the same as ordering barriers.
That depends on the memory model. You could have a system that doesn't guarantee cache coherency in general but works fine if you put in ordinary memory barriers.
> A system without cache coherency can still promise that updates won't be lost. There are lots of way to write rules around update propagation, and cache coherency is just one of them.
Not without coherency performed in software though, which is the point.
> Cache coherency doesn't protect you from stale data unless you only read one memory address ever.
It does. Observing updates to different locations in other than sequential order does not mean the data is stale. I guess that's also colloquial language issue. The data is up to date according to the constraints of the memory model.
> That depends on the memory model. You could have a system that doesn't guarantee cache coherency in general but works fine if you put in ordinary memory barriers.
What property of cache coherency could you lose and still have it working?
> Not without coherency performed in software though, which is the point.
How would you do cache coherency in software? I don't follow.
> It does. Observing updates to different locations in other than sequential order does not mean the data is stale. I guess that's also colloquial language issue. The data is up to date according to the constraints of the memory model.
When you read address X that refers to address Y, it's impossible to read Y and know it's the version that X refers to (or a more recent version). I would call that Y being "stale, as per the memory model". I'm pretty sure that's a normal use of the word "stale" in the context of memory models?
> What property of cache coherency could you lose and still have it working?
Imagine a CPU with a weak memory model, where multithreaded code without memory barriers loses the property of always seeing the same order of accesses to a specific address. That would break some things, but that code was broken anyway.
Normal memory barrier semantics could force certain accesses to be viewed in the same order, including important accesses that would normally be enforced by cache coherency. You don't need it to be enforced twice. The code will run fine.
> How would you do cache coherency in software? I don't follow.
Hardware caches which are not coherent require e.g., writeback and flushes to be coherent with other agents.
> When you read address X that refers to address Y, it's impossible to read Y and know it's the version that X refers to (or a more recent version). I would call that Y being "stale, as per the memory model". I'm pretty sure that's a normal use of the word "stale" in the context of memory models?
It isn't, because it's relative. "Most recent" is according to the observer, and if you couldn't previously observe something that is "newer" (within the rules of memory consistency model), then it is not stale.
> Imagine a CPU with a weak memory model, where multithreaded code without memory barriers loses the property of always seeing the same order of accesses to a specific address. That would break some things, but that code was broken anyway.
Not same order of access, same order of stores. If memory location x receives two stores, A and B, and CPU1 sees A, B and CPU2 sees B, A, now one thinks the location contains B and the other thinks it contains A. In the end, all CPUs can see different values at all memory locations. A normal barrier doesn't solve this, the stores are already done.
> It isn't, because it's relative. "Most recent" is according to the observer, and if you couldn't previously observe something that is "newer" (within the rules of memory consistency model), then it is not stale.
Why is most recent according to the observer, and not the core(s) that actually did the writes?
The value in Y caused the value in X. Surely that's worth something in terms of ordering?
0: mov $0 $random_cell // zero initialize $random_Cell
1: mov $0 $random_cell_ready // zero initialize $random_cell_ready
3: rng r1 // generates a non-zero random number in r1, takes 300 hundreds cycles
4: mov r1 $random_cell // write generated value to memory location $random_cell
5: mov 1 $random_cell_ready // sets a flag to notify that the value has been written
Reader
0: test $random_cell_ready
1: jz 0 // spin-wait for cell ready
2: mov $random_cell r1
Consider an 1-wide OoO[0] machine with a maximally relaxed memory model. There are no caches, but magic memory with 1 clock cycle latency, so no coherence issues: at time t writers starts computing a random number (writer:3): rng is going to take a few hundreds cycles before writing the result to r1. Next clock cycle t+1 it can't execute writer:4 as r1 is not ready. Instead it executes writer:5 that has no dependencies. writer:3 is only executed at t+300.
At time t, the reader will read a zero form the flag, so at t+1 loops back. On t+2 it sees the flag set to 1. So t+3 the jump falls through and t+4 we read 0 form $random_cell. That's obviously wrong as the data is not there. It is certainly not reading stale data as that's literally the last value that was written to this memory location: a newer value hasn't even been computed yet.
To fix this you need a fence #StoreStore between writer:4 and writer:5 and a corresponding fence #LoadLoad between reader:1 and reader:2 to implement release and acquire semantics [1]. In particular the fence on the writer side will stall the pipeline [2] until previous stores in program order have committed.
As you can see there are no caches, only a single shared memory which is necessarily always coherent. There is no stale data. Yet we need fences for correctness.
You can now reintroduce caches, but MESI give the illusion to the rest of the CPU that it is actually talking with uncached, fully coherent memory, except that latency is now variable.
Ergo, generally, fences and caches are completely orthogonal concepts.
[0] BTW the example would be broken even on a machine with scoreboarding but no renaming (so no fully OoO).
[1] proper release and acquire require slightly stronger fences, but these are sufficient for this example
[2] a better option is to just prevent further stores to be dispatched.
Your example has a single store, there is no reordering. There is no value before 8, it doesn't require any barrier. If you make and additional store to the array, that's would be equivalent to my example.
I said there was a value before 8, I just didn't describe it well enough.
First off your example looks a little under-constrained (What if reader:0 happens before writer:0?), so let's assume all the writes of 0 at the start of every program happen before any other instructions.
Let there be a third thread, "builder". It writes 0 to every address in the array, then fills it with new values including an 8.
The "writer" thread loops repeatedly over the array until it finds an 8, then stores the address in X.
The "reader" thread waits for X to be nonzero then loads the value at [X].
In the toy coherent computer, reader will always load an 8.
But in a very weak memory model, one that doesn't enforce dependent loads, reader is allowed to get a 0.
The write to X and the write of 8 don't have to show up in any particular order.
But in that situation I would say that X is "more recent" than 8, because of causality. And if you asked me yesterday I would have called the 0 "stale".
Your example was hard to follow. What do you mean exactly. Write it in terms of memory operations performed and values observed by each CPU and address, with assertions that are or are not violated according to your example.
> You could have a system that doesn't guarantee cache coherency in general but works fine if you put in ordinary memory barriers.
How would that work? In such a system, either you have no caches or memory barriers would need to pessimistically flush all dirty lines to memory and send invalidation and synchronization messages to all other cores. In practice such system, far from being fine, would be so slow to be unusable if barriers had such semantics. Even implementing c++ relaxed semantics would be very expensive.
> In such a system, either you have no caches or memory barriers would need to pessimistically flush all dirty lines to memory and send invalidation and synchronization messages to all other cores.
Why would you need to avoid caches or flush to memory?
And invalidation and synchronization message are already part of a normal CPU's overhead, so I don't see why restricting some of them to memory barriers would increase overhead.
In other words, assume you still have a cache coherency protocol, but you're lazy about certain states in the absence of memory barriers, so the default behavior is not always coherent.
How would you implement the c++11 memory model in a non-cc system exactly? Let's say you build linked list, spanning a few cachelines and then publish it by storing the address of the first node to into an atomic with release semantics. Another thread load-consumes its address and start traversing it.
On a cc system the only thing the release barrier needs to ensure is that all previous stores commit to L1 (in any order) before the final store. There is typically nothing to do on the consume side as the CPU pipeline preserves causality. Everything else is taken care by plain MESI, which will transfer exactly and on-demand those cache lines that contain the list nodes and no more.
What would the acquire/consume barriers do on a non-cc system? Consider that, generally, neither the CPU nor the compiler actually have an understanding of the list data structure itself.
In that example I don't think anything really changes. With cache coherency a release barrier makes sure all previous stores are committed to L1, and by implication any old versions of the cache line have been invalidated. Without cache coherency a release barrier makes sure all previous stores are committed to L1, and explicitly says that any old versions of the cache line have been invalidated.
If you had a design a lot like MESI, you wouldn't really need release semantics, you'd just have an extra "Shared but maybe stale" state instead of just Invalid, and consume would coerce all those lines into Invalid and they'd have to be re-acquired. But re-acquiring those lines is no worse than if you had vanilla MESI. If you had an Owned state that needs to broadcast changes, you could avoid most broadcasts until seeing a release or similar barrier.
In both of these situations you'd probably make the CPU try to sync lines right away, but it could squeeze out some more performance when it's not immediately mandatory.
So to be clarify, I understand from this comment and other in this thread that you are talking about a mostly CC system, with coherency messages and the full MESI transition; The only relaxation is that you allow an additional Stale state. I don't see what you gain here, you still need to send RFOs for every write (including plain reads) to not yet exclusive lines as you can't delay sending invalidates to other caches on a fence as otherwise two cores might be writing to the same cacheline at the same time.
I guess this improves the false sharing scenario as you can read from data on a stale line if the data you care is not the one that actually caused the line to go stale and a barrier is needed to fully invalidate it, but the cost is that your release barriers, instead of being cheap purely core-local effect, now have to cross the coherence fabric. This can move the cost from a dozen of cycles to hundreds.
It's not supposed to be a particularly valuable model, it's just to show how it could exist.
> the cost is that your release barriers, instead of being cheap purely core-local effect, now have to cross the coherence fabric. This can move the cost from a dozen of cycles to hundreds.
That's not the intent.
The barrier pushes the lines in this new state to I, but under MESI they already would have been I.
There is no need to have any performance loss compared to MESI. If no lines are in this state then there is no extra work. And if it would be helpful, you can preemptively work on converting those lines to S in the background.
Edit: Oh wait you said release barrier. No, if you're doing the simplest plan based on MESI then there is no need for anything extra when releasing. Those considerations were only if you wanted to look at MOSI where it's already doing write broadcasts.
Sorry, I'm losing track of what you are proposing. I guess you mean MOESI plus an additional Stale state? Can you describe all the transitions and when they are performed?
It seems to be a state that can send stale data to loads, but will get invalidated if the CPU performs a barrier.
It doesn't work of course, obviously because there is no forward progress. CPU1 can order all previous stores, then later store some flag or lock variable, and that will never propagate to CPU2 spinning on that waiting for the value.
But also because CPU2 and CPU3 can see different values depending on the state of their caches. If one had no such line and the other had a valid-stale line, then they will end up seeing different values. And the writeback out of CPU1's cache needs to write back and invalidate all possible older such written-to lines. By the time you make it work, it's not a cache it's a store queue but worse because it has to do all these coherency operations and barriers have to walk it and flash or flush it, etc.
I can't say that spinning without a barrier is doing things wrong?
I guess if I need an emergency fix I can make those lines invalidate themselves every thousand cycles.
> But also because CPU2 and CPU3 can see different values depending on the state of their caches. If one had no such line and the other had a valid-stale line, then they will end up seeing different values.
But only if there's no memory barriers, so it shouldn't be a big deal. And the valid-stale lines can't be used as the basis for new writes.
> And the writeback out of CPU1's cache needs to write back and invalidate all possible older such written-to lines.
The only such lines are in this new state that's halfway between S and I. They don't need to be marked any more invalid. Zero bus traffic there.
Yes, saw that flaw, but I think in their model, barriers are always associated with stores (so atomic_thread_fence is not implementable but store_release is), which fixes your first example. I agree that in any case you end up doing more work that in the typical model to make other scenarios work.
> In that example I don't think anything really changes. With cache coherency a release barrier makes sure all previous stores are committed to L1, and by implication any old versions of the cache line have been invalidated.
That is not what a release barrier does.
> Without cache coherency a release barrier makes sure all previous stores are committed to L1, and explicitly says that any old versions of the cache line have been invalidated.
That is not a "normal barrier" though, writeback and invalidate operations are software coherency.
> If you had a design a lot like MESI, you wouldn't really need release semantics,
This is not the case. Release barrier can be required even if your cache coherency operations completed in FIFO order, because reordering could be done before cache coherency.
> That is not a "normal barrier" though, writeback and invalidate operations are software coherency.
I think I was unclear when I said "the cache line". I meant the one containing a releasing store.
Let me try wording it a different way. A store_release requires all previous writes to be ordered before it. This obviously includes other memory addresses, but its own memory address isn't an exception. So even without cache consistency as a general rule, the nature of a release gives you all the ordering you need in this situation.
I'm sorry for mentioning the word "invalidate", because that's the implementation and not the semantics.
> This is not the case. Release barrier can be required even if your cache coherency operations completed in FIFO order, because reordering could be done before cache coherency.
So I meant acquire but I think there's also a clear solution based on exactly what I said.
The lines that could possibly be affected by the FIFO are all in the "Shared but maybe stale" state. Consume turns those into Invalid. So any read that's from after the Consume, reordered before it, should see those lines as Invalid.
> I think I was unclear when I said "the cache line". I meant the one containing a releasing store.
Not sure what you mean by that.
> Let me try wording it a different way. A store_release requires all previous writes to be ordered before it. This obviously includes other memory addresses, but its own memory address isn't an exception. So even without cache consistency as a general rule, the nature of a release gives you all the ordering you need in this situation.
We're talking about cache coherency, not memory consistency. Coherency is not about ordering, it's about ensuring agents don't see stale data.
> So I meant acquire but I think there's also a clear solution based on exactly what I said.
The same goes for acquire though.
> The lines that could possibly be affected by the FIFO are all in the "Shared but maybe stale" state. Consume turns those into Invalid. So any read that's from after the Consume, reordered before it, should see those lines as Invalid.
Implementation of CPU memory pipelines and cache coherency aren't really something you can just get a bit of a feel for and then handwave about.
When I talk about how the cache protocol has to do XYZ to implement a barrier, you complain that that isn't what a barrier is.
When I talk about what memory barriers do in pure terms, you complain that I'm not mentioning the cache protocol.
When I give an example of a cache protocol in isolation, you start talking about which memory barriers I'm missing.
I don't know what you want.
> Coherency is not about ordering, it's about ensuring agents don't see stale data.
Well, if I go by "if it's part of the memory model then it's not stale", then you can allow a relaxed ordering on single addresses without having stale data.
When a core takes Exclusive control of a cache line, put all other Shared copies into the state "might be an old version, but that's allowed by the protocol and the memory model".
Some instructions can read "might be an old version, but that's allowed by the protocol and the memory model" values and some can't. The exact details of "some instructions" are flexible/irrelevant. See the memory model (not provided) for details.
There. Done. Minimal proof established of a design that doesn't always guarantee cache coherency, but can enforce as much cache coherency as you need. You don't need to add any explicit writebacks or flushes to manage it from software, and enforcing it doesn't take any significant effort beyond a normal CPU's coherency protocol.
Agents will never be confused. They know that Shared means everyone has the same value, and "might be an old version, but that's allowed by the protocol and the memory model" does not mean everyone has the same value. They know that transitioning from "might be an old version, but that's allowed by the protocol and the memory model" to Shared or Exclusive requires reading the data anew, just like transitioning from Invalid to Shared to Exclusive.
If agents want to always be as up to date as possible, they can simply not use this state. If an agent wants to be up to date some of the time, then it can allow this state but purge it at will.
This state only allows for "old" values going back to the most recent purge, so it's not a useless act to read from it. And this state can give you data faster than acquiring a Shared state, so there's a reason to use it.
> The same goes for acquire though.
I'm pretty sure the entire point of acquire is that you can't reorder reads from after it to before it.
I don't want anything, I was correcting your misconceptions.
> Well, if I go by "if it's part of the memory model then it's not stale", then you can allow a relaxed ordering on single addresses without having stale data.
I don't know what you're talking about. Memory ordering is not about ordering of a single address. That's cache coherency.
[snip]
> I'm pretty sure the entire point of acquire is that you can't reorder reads from after it to before it.
And you're still wrong. Acquire barrier can be required even if you receive coherency updates in a sequential order. Barriers in modern processors do not flush, invalidate, or perform coherency operations.
This is what I mean by you can't just handwave with a basic idea about the programming semantics (which aren't particularly complicated). It's easy to think up some vaguely plausible sounding implementation of those things, but the reality is infinitely more complicated. Real cache coherency protocols are verified with formal proofs, and not because they are easy. I guarantee if you handwave a new coherency state or give up some property of coherency, you will have bugs.
> I don't know what you're talking about. Memory ordering is not about ordering of a single address. That's cache coherency.
The ordering of a single address is relevant to both the cache protocol and the memory model.
That section is describing a cache protocol.
> And you're still wrong. Acquire barrier can be required even if you receive coherency updates in a sequential order.
I agree. How does that make my statement wrong in any way?
> Real cache coherency protocols are verified with formal proofs, and not because they are easy. I guarantee if you handwave a new coherency state or give up some property of coherency, you will have bugs.
Do you think my description is impossible to fix, or are you just trying to impress on me that it's hard?
I don't feel like spending hours finding and editing a concurrency simulator today.
> From the point of view of the software, running a bunch of instructions in sequence, you do get stale values. You can have two threads wait for a signal, then both read a value, and both get different results. You can have a thread set a flag after it's done editing some values, have another thread wait for the flag, and then after waiting it sees the edits as still incomplete.
Isn't that implementation of semaphore? I would imagine that should already being done in the CPU implementation?
Only if you use the special instructions for that purpose. If you're just doing a regular read/write, you're vulnerable to this kind of problem.
If you do use the atomic instructions, you're guaranteed a consistent view across threads for that location. Check your platform documentation as to whether this is also a memory barrier, affecting ordering between the atomic instruction and other non-atomic instructions.
Uh, that's uncalled for. I think your hint is right there in the title: Myths Programmers Believe about CPU Caches. Was there an inaccuracy in the article with respect to CPU Caches?
My favourite myth is that I still believe it's possible to write code which operates totally out of L1 and L2 cache. Not just tight ASM on bare metal: C code or similar, compiled down, to run under a modern UNIX/POSIX os on a multi-core host.
I have never explored HOW this would work, or WHAT I would do to achieve it, but I believe it, implicitly. For it to be true I would have to understand the implications of every function and procedure call, stack operations, actual size of objects/structs under alignment, optimisations in the assembler to align or select code which runs faster in the given ALU, none of which I know (if I ever did, certainly not any more)
But still: I believe this myth. I believe I know somebody who has achieved it, to test some ideas about CPU's ability to saturate a NIC at wire-rate, with pre-formed packets. He was able to show by binding affinity to specific cores and running this code, he could basically flood any link his CPU was capable of being exposed to, for a given PCI generation of NIC speeds available to him. (as I understand it) -But that assumes that walking off L2 cache would somehow make it run SLOW enough, to not be able to do this.
As mentioned in another comment, this literally happens during system bringup when you don't have working RAM yet - link training involves a lot of state and you can't do it in registers alone, so the cache is configured in such a way that you can use it as RAM (I /think/ this is just a special case of write-back where you never actually do the write?). As long as you're not doing DMA I don't see any reason why you wouldn't be able to bring up a system with empty RAM sockets and just launch Doom out of firmware flash.
> I /think/ this is just a special case of write-back where you never actually do the write?
Sort of! A typical implementation (from what I've seen, anyways) is that you have a memory mapped region which, when cache-as-RAM is activated, directly indexes into some cache (typically the LLC). From a hardware perspective, it's a full second address decode mode where you essentially just access the data array without performing tag check/write. When coming in to CAR mode, the cache typically needs to flush, but when leaving it really doesn't need to do anything (assuming it didn't update the tag array and left all lines as invalid).
With the size of some modern SoC's LLC, you could fairly easily run DOOM out of CAR. It'll depend on the SoC, but there's no reason why DMA wouldn't work as other agents would still be able to send read and write requests to the CAR memory region.
Modern CPUs are pushing 1MB L1 and 8MB L2 or more. You can fit a dozen FreeRTOS instances in that space. It would be pretty cool to see someone build a system that used a high end CPU but didn't have any installed ram, though I'm not sure if the CPU microcode would be ok with that.
Some SoCs will let you directly/explicitly use the cache as (ostensibly secondary) RAM, remapping your configuration from
[cpu] -- [cache] -- ram
to
[ cpu ]
| |
[cache] [ram]
Here's TI's documentation on doing just that CC13x2/CC26x2 family of MCUs: [0]. In this case, it's a seemingly paltry 8K of "RAM" but that's a lot for microcontroller developers! In this configuration and for this particular MCU, the CPU will actually run at a reduced speed (60% of its norm), probably because it needs to synchronize directly against the RAM. If you were to use it in CaR-only mode (which I don't think TI exposes), i.e. with the cache being used instead of rather than in addition to any RAM, you probably wouldn't need to run at those reduced speeds (and might even be able to run at higher clock speeds!).
I guess that means my favourite myth that I've done it. Get the os out of the way, cpu pinning, user level network stack, no system calls after setup, lay out all your memory carefully in a cache line aware fashion noting the virtual addresses & bits 6-10. io via shared memory to another core on the same package.
I convinced myself looking at latency benchmarks as I went about optimising it that it was all in L1.
so I'm imagining speedyOS which assumes one core for the OS, and runs its userspace jobs on the other cores, and if you wire a job to a core and its been coded to fit in L1/L2 then.. it just "does" -as long as the other jobs are somehow co-erced to run in the other cores, "swapping" in and out as need be. If you can prevent any clicktick disrupting your run state on this one "golden" user core, it just runs as fast as it can, subject to the OS timeslice effects on it, and any unavoidable synchronisation into other bits of the combined CPU/ALU/Memory/Bus system as a whole.
Hmm, I think the discussion is missing a few things needed for a complete picture of the situation.
First, ARM and x86 coherency models differ, so a big disclaimer is needed regarding the protocol. Most ARM processors use the MOESI protocol instead of the MESI protocol.
Second, synchronization isn't just because of register volatility and such. Synchronization is needed in general because without the appropriate lock/barrier instructions, compilers make assumptions about how loads and stores may be reordered with respect to one another.
> First, ARM and x86 coherency models differ, so a big disclaimer is needed regarding the protocol. Most ARM processors use the MOESI protocol instead of the MESI protocol.
MOESI is called out.
> The above are just some of the possible scenarios that can occur. In reality, there are numerous variations of the above design, and no 2 implementations are the same. For example, some designs have an O/F state.
However that's not an ARM v x86 thing; AMD has at least previously used MOESI.
> Second, synchronization isn't just because of register volatility and such. Synchronization is needed in general because without the appropriate lock/barrier instructions, compilers make assumptions about how loads and stores may be reordered with respect to one another.
Compiler barriers are different than hardware memory barriers. There are reasons to have one, but not the other.
Thanks I missed that line regarding the O states. Still, a word about write reordering on ARM would probably be useful (unless I missed that also).
I understand that synchronization in code vs hardware is different, but the blog explicitly moves out of hardware-land into source code land with references to Java volatile and such.
The blog mentions about java volatiles (but it would also apply to C++ atomic) to explicitly mention that volatile has no cache coherency implications on a typical MESI (and variants) machine. The fences required to maintain language level memory model guarantees act at a level above the L1 cache, once the data reaches L1 (i.e. the coherence point), the fences have done their job.
[I'm ignoring remote fences which are a specialized and not yet mainstream feature]
Awesome, more than a source than I found. I knew previous designs were MOESI, but a quick search didn't give any results wrt Zen. Thanks for the citation!
Honestly, the best resources I know are the ISA/architecture manuals directly from each vendor (Intel, AMD, Arm) for the various CPUs and/or GPUs of interest. Prior to that, my experience largely comes from doing (working with lower level code, benchmarks, looking at assembly, profiling hardware counters, etc.).
We worked with the Intel guys, in my last gig. They were incredibly helpful.
They were an impressive lot, and they helped us out, quite a bit. They often sent engineers over, for weeks at a time, to help us optimize.
The cache thing was a 100X improvement thing, and it came from the oddest places. There's a lot of "that doesn't make sense!" stuff, with preserving caches.
I don't remember all the tricks, but we were constantly surprised.
One thing that saved us, was instrumentation. Intel had a bunch of utilities that they wrote, and that kept showing us that the clever thing we did, was not so clever.
Dealing with caches, memory ordering, and memory barriers can be truly mind-warping stuff, even for those who have spent years dealing with basic cache coherency before. If you want a challenge, try to absorb all this in one sitting.
I kept an earlier version of this close to hand at all times a couple of jobs ago where we were using our own chips with a very weak memory ordering. The implementation team had mostly come from Alpha, which had the weakest memory ordering ever, and in the intervening years a lot of bugs related to missing memory barriers had crept into the kernel because nobody was using anything nearly as weak. I specifically remember at least one in NBD, at least one in NFS, and many in Lustre. Pain in the ass to debug, because by the time you can look at anything the values have "settled" and seem correct.
For extra fun, as weak as the memory ordering was, the first run of chips didn't even get that right. LL/SC wouldn't work reliably unless the LL was issued twice, so we actually modified compilers to do that. Ew.
You don't need that full documentation. Really, I could simplify what most programmers would need to understand about memory ordering down to this text:
There are a few models of cross-thread memory ordering that you can choose between. If you have never been exposed to this field before, the naïve model of memory ordering you probably think is going on is sequential consistency. This is not implemented in hardware because oh-gosh-it's-expensive, and if there's no indication of what memory ordering model your library or language is using, it's likely defaulting to sequential consistency.
But don't worry, you don't have to worry about the more complex memory orderings, because you get to pretend everything is sequentially consistent if you write proper synchronization. The easiest way to satisfy proper synchronization is an acquire-release model. Before reading any data that may have been written by a different thread, you need to do an acquire load. After writing any data that may be read by a different thread, you need to do a release store. The basic flow is write then release store, then change thread, then acquire load, then read. Follow this rules, and things stay simple.
There is a theoretical slight relaxation of the above model that works on most hardware called release-consume in the C/C++ memory model. It isn't implemented by any compiler for arcane compiler reasons I won't get into, but the Linux kernel (which implements the memory model itself for $REASONS) does rely heavily on it.
The final major memory ordering is relaxed atomics. Their semantics are... weird. Essentially, you get what the hardware gives you, plus whatever fun the compiler can toss in, and the guarantees are minimal. I can't recommend many cases where you can safely use it, and if you have to ask if you should use it, the answer is no.
> You get to pretend everything is sequentially consistent if you write proper synchronization. The easiest way to satisfy proper synchronization is an acquire-release model
But acquire/release give, generally,a partial ordering, not a total order like sequential consistency. That's often enough of course.
The data-race-free theorem states that, in the absence of data races, acquire/release is indistinguishable from sequential consistency. Define data races to be UB, as C/C++ do, and you get to the state that acquire/release lets you pretend everything (except atomic operations themselves) is sequentially consistent.
My understanding is that that only applies to programs that use mutex lock/unlock operations (which do have acquire/release semantics), but not to programs that use acq/rel memory operations in general. For example: https://www.hboehm.info/c++mm/sc_proof.html which contains the the SC proof for lock operations that you mention, but also a counterexample for acq/rel atomics.
The original data-race-free models are based on acquire/release semantics that underlie the C++ memory model (https://pages.cs.wisc.edu/~markhill/papers/topds93_drf1.pdf), so the use of acquire/release atomics should provide the same guarantees as mutex lock/unlock.
However, the sequential consistency guarantee doesn't necessarily apply to atomics themselves, and I think the difference between the data-race-free-0 and data-race-free-1 models is whether not they would extend the guarantee to the release-acquire atomic operations.
To clarify, you are saying that there is a model that guarantees DRF-SC even if the atomic operations are not themselves SC? Aside from the fact that I'm not sure such guarantee would be useful or even meaningful, I think you can extend Bohem counterexample to add non-atomic cells (and conditional reads) and show SC violations.
My understanding of the C++ memory model since the early standardization discussions was that DRF-SC is only generally guaranteed[1] if the acquire/release operations themselves were SC and can't be easily recovered otherwise.
I.e.: full SC requires all operations to be SC, DRF-SC requires, in addition of no data races, only the synchronization edges to be SC.
I suspect that's what the drf-1 model in the paper you have linked specifies. But it will take me a bit to digest it and I'll readily admit that I might be wrong (thanks for the paper BTW).
This is at the penumbra of my knowledge of memory ordering, so it's entirely possible that I'm completely incorrect here, especially because confirming correctness requires spending a lot of time making sure that the various papers are all using the same definitions for the various words.
I’m nowhere near as deep in this rabbit hole. But I recall running/seeing some benchmarks on different memory orderings and the differences were not.. that big, on x86 I believe. Obviously there’s a lot that can go wrong with micro-benchmarks in these incredibly complex systems, but I still got the feeling that memory orderings are perhaps not worth the immense complexity that they introduce, for let’s say the majority of programmers, even low level folks.
At leas on x86, acq/rel load/stores vs relaxed is basically free. Seq/cst loads are also free. Seq/cst are relatively fast, but at around 20-30 clock cycles still measurably slower thant everything else.
The catch is that x86 only has seq/cst atomic RMW so even if you ask for, say, a relaxed CAS or XADD, you will still get an expensive one.
So the c++11 memory model allows you to more easily maintain correctness (and your sanity), but for performance you still have to know how to map to the underlying microarchitecture.
> At leas on x86, acq/rel load/stores vs relaxed is basically free.
The corollary to this is that using them doesn't buy you any performance (under x86).
(Also it makes it difficult to test that your code itself synchronizes correctly if you're developing on x86, since bugs may only show up under ARM or RISC or whatever, and even there, only some of the time... and that's why project loom, tsan, and miri exist.)
> Seq/cst are relatively fast, but at around 20-30 clock cycles
Did you mean to say seq/cst store?
Also, what operation is “set the value to X unconditionally and return me the previous value”? Is that possible with a store or something different? (Golang calls this op atomic swap)
In either case, sounds like the room for optimizing for performance with granular memory models on x86 is even narrower than I thought.
> set the value to X unconditionally and return me the previous value
That would be atomic::exchange that maps to XCHG on x86, which, as all atomic RMW is sequentially consistent.
Incidentally seq-cst stores are also typically lowered to XCHG on x86 as opposed to the more obvious MFENCE+MOV.
There is still room for optimization, as if you can implement your algos with just load/stores and as few strategically placed RMW as you can, it can be a win.
Of course if there is any contention, cache coherence traffic is going to dominate over any atomic cost.
This matches my own micro-benchmarks in golang ish. I see basically either ~4ns for any write op, including store, swap, add, etc. And ~1ns for loads. I assume it’s all seq-cst.
As noted elsewhere, x86 itself really doesn't have much difference. But it matters a lot more on other architectures--I caused like a 20-30% regression in JS performance on ARM by changing one atomic variable in the engine to sequentially-consistent instead of release-acquire.
My recommendations boil down to the following:
* If you're ever truly unsure, just stick with sequentially-consistent unless performance is so critical you need to get off of it. Correctness is more important that speed!
* You can use acquire/release if you've got something that smells sufficiently like a lock (there's a clear scope with beginning and end, and most of the memory accesses outside the acquire/release themselves are regular, unsynchronized accesses). Most of this code should probably be hidden in libraries anyways, but this probably should be your basic default if you're working with atomics if there's only one atomic variable in play.
* The other memory orderings I wouldn't recommend at all. Release/consume, even were it implemented by compilers as intended, requires a particular (though common) set of circumstances to work correctly. Relaxed affords no synchronization opportunities, and the one use case I can think of for it involves atomic read-modify-write operations, which I think all hardware makes as strong as an release+acquire anyways.
In short, worrying about sequential consistency versus release/acquire can be helpful, and I think there are simple enough rules-of-thumb to make it worthwhile to summarize it. The other memory orderings, not so much.
> I caused like a 20-30% regression in JS performance on ARM by changing one atomic variable in the engine to sequentially-consistent instead of release-acquire.
Very interesting. Was this overall performance? What type of workload was involved?
Yeah it seems like acq-rel is the only other one worth keeping an eye out for. When using atomics you have different logical ops with a certain happens-before relation between them anyway. Figuring out whether these ops map to acq-rel seems like a reasonable task to take on, given the total effort. The main argument against it is lack of testing infrastructure (since indeed correctness is more important). With something like Loom (the Rust project) it’s significantly easier to prevent subtle bugs. I wish it was more widely available.
Oh, you suppose the authors spent their time writing that for no reason? You know better than them? Your explanation is so incomplete I hardly know where to begin. Maybe with the fact that you're talking about local ordering as it may or may not be perturbed by a compiler, while the document is talking about the observed order elsewhere (other processors or main memory). That observed order can vary depending on the processor's ordering model, and to ensure it (e.g. across all modifications to a complex shared data structure) without doing the memory equivalent of an fsync() on every file write (disastrous!) you need memory barriers. All of the bugs I mentioned, all of which were crashes, were related to exactly the distinction you're missing. Yours is the kind of mythology that OP sought to address. Please read the document - particularly the section on CPU memory barriers and anywhere that mentions Alpha - before spreading more misinformation.
> Oh, you suppose the authors spent their time writing that for no reason? Your explanation is so incomplete I hardly know where to begin.
No. I am suggesting that you don't need the full rigor of understanding the memory barrier semantics to be able to effectively write multithreaded code correctly. Completeness was never a goal of my explanation; sufficiency was. Furthermore, my focus is on a software memory model, not the hardware memory model.
> Maybe with the fact that doing acquire/release for one value has varying effect (sometimes none) on others, so you'd have to do it separately for all of the many values that might have changed.
You have misinterpreted my words, I think. At no point do I suggest that you should insert an acquire/release for each, individual value. Rather, you need to do an acquire at the beginning of a block of code (wherein you can read and write multiple values) and a release at the end of a block of code, much as you would with a mutex (except it's not necessary that the regions of code actually be mutually excluded from executing on multiple threads). The only requirement is that the store be followed by a release operation that crosses threads to an acquire operation that is followed by a load.
> Then we can move on to the concept of dependent reads, and so on.
I allude to those under the part where I mention "There is a theoretical slight relaxation of the above model that works on most hardware called release-consume in the C/C++ memory model." That you did not pick up on that is perhaps because you yourself are unfamiliar with the release-consume portion of the C/C++ memory model. If you need a refresher, I might point you to the thread where I actually go into why it exists, and why it's not implemented by compilers, here: https://news.ycombinator.com/item?id=36059369. (There's a reason I'm not going into it in a facile explanation!)
> Please read the document before you spread more misinformation.
Oh, do trust me, I have read that document, and far more, on memory models. I am not spreading misinformation. Perhaps my word choice and my writing is inartful or oversimplified in the goal of making the topic more accessible rather than seeking to enshrine comprehensive knowledge in thick tomes that many will give up reading. Perhaps you yourself may not be sufficiently informed as to pick up on the specific terminology that I use, which derives from the dominant memory model in use by, well, everybody.
No, you didn't. It's the C++ memory model, which is borrowed by C, then every compiler IR targeting something in the C/C++ space, and then every other language that decided to support language-level atomics.
Your mistake is thinking solely in terms of the hardware memory model. Indeed, if you care only about x86 (and ignore the potential for compiler optimizations), most of the variety provided by the C/C++ memory model is unnecessary, as every operation on the x86 (well, except for the exceptions) are inherently release/acquire operations.
Please do read up on the C/C++ memory model added in C11 and C++11. It may serve you to understand this space better.
So your incomplete explanations are based on a single language model instead of a single processor model? Sorry, that's a distinction without a difference. It's no excuse for misleading people who might use other languages or processors. And stop with the implied insults about my level of knowledge, when you're the one clearly fixated on only one environment. That's both rude and stupid. I've worked on many CPU architectures, on NUMA and COMA systems long before most people even knew such things existed, at many different levels from the first instruction after an exception (or restart) on up. I say that not to make my own appeal to authority, but to refute yours and to underscore that these "irrelevant" details are in fact highly relevant and important to some of us out here in the wider world. I haven't been alone in any of those things. There are still many programmers working in "exotic" environments where these things matter, and we both rely on their work every day. You should at the very least qualify your statements to say that they're only true for application level programmers like yourself.
> So your incomplete explanations are based on a single language model instead of a single processor model?
It's the language memory model incorporated by all major programming languages. That you are unaware of this tells me that you are a hardware engineer, not a software engineer.
> I've worked on many CPU architectures
And I've worked on computer architectures where it's not clear how to even translate "volatile" correctly because the memory system is that weird. I didn't bring that up before because I'm not interested in appeal-to-authority until people start accusing me of being an idiot who doesn't know what they talk about. I could bring up more bona fide credentials, but what's the point? You've already dismissed my expertise.
> You should at the very least qualify your statements to say that they're only true for application level programmers like yourself.
I did. Every single message, in fact:
> what most programmers would need to understand about memory ordering
> Furthermore, my focus is on a software memory model, not the hardware memory model.
> It's the C++ memory model
And actually, I would go further and suggest that kernel programmers might be better served by adopting the language memory model offered by their compiler rather than insisting on rolling their own and yelling at the compiler when they mess it up.
That's pretty rich, since you were the one who jumped in to dismiss others'.
> you are a hardware engineer, not a software engineer
Incorrect. I've merely worked close to the hardware enough to understand why these things matter. Someone has to support the abstraction on which people like you depend, which requires understanding both the abstraction and the domain where it doesn't exist yet. BTW that's why kernel folks don't rely on your abstraction; that would be a circular dependency.
I suggest that this whole fracas could have been avoided if you weren't so prone to make assumptions (including that one about me), over-generalize from your own experience, and meet any disagreement with ever-increasing levels of condescension. I'm sure you're good at what you do, but try to accept that others are also good at what they do and got that way by learning about things you consider irrelevant or exotic. All I was trying to do before you decided to play "I'm smarter" was share some of that information for the next generation of system programmers.
> That's pretty rich, since you were the one who jumped in to dismiss others'.
Where did I do that? You were the one to accuse me of "you suppose the authors spent their time writing that for no reason?", to which I directly responded that no, I did not. My goal was to produce a smaller, more concise comment that could be reasonably offered up as something that "every programmer should know" (with particular emphasis on the word "every"), which by its very nature, ought to be incomplete.
The central myth is that the average programmer should care.
The typical programmer should treat CPU caches as what they are designed to be: mostly transparent. You work in a high level language and leave the tricky details to a library and your compiler.
It's only a small minority that should really worry about these things.
In my daily work, I see more often premature microoptimizations (in part using the myths from the article) which are entirely unnecessary rather than code that needs to optimize for those things.
You should still at least care about 'CPU friendly' data layout in memory and data access patterns in your code to make the CPU's life easier, compiler magic won't help all that much there.
This can often trivially give you a 10x, and sometimes a 100x performance difference for real-world single-threaded code, especially if it needs to work on big data sets. Your fancy high level compiler won't magically reshuffle your data in memory to help with prefetching (at least I'm not aware of a language that does).
Most high level languages popular today are "rooted" in the 90's when the latency gap between CPU and memory practically didn't exist and thus don't care about his specific aspect. Explicit control over memory layout is probably also one of the a main reasons why C stood the test of time so well.
> This can often trivially give you a 10x, and sometimes a 100x performance difference for real-world single-threaded code
That's one hell of a lot of speed up, can you explain how you managed that? I mean I can certainly believe it but not when you put "trivially" in front of it, I could only conceive of that in very special cases and with very careful coding.
Sometimes you have some very very high-use variables, with multiple threads each having one and writing to it often. If multiple of those variables are located on the same cache line, control of it will bounce wildly between cores and this can completely trash your performance. The fix there is absolutely trivial: add padding.
Sometimes you're iterating through a huge number of objects, and only using a couple fields out of each object. If you rearrange the data so each field is stored in a different arena, you can cut the number of memory accesses by a huge amount, and let prefetching work a lot better. That's usually not as trivial but it's simple work.
Another often-trivial one is avoiding linked lists whenever feasible.
> Sometimes you have some very very high-use variables, with multiple threads each having one and writing to it often.
And how often do you actually write such code? Most people: Between rarely and never. You write single threaded code or use synchronization primitives. Sure if you are writing a library squeezing performance out of some parallel processing problem then this is relevant, but that's a niche scenario.
> The fix there is absolutely trivial: add padding.
Most common case is that this just wastes memory. Don't micro-optimizr before you know that this is actually a problem.
> Another often-trivial one is avoiding linked lists whenever feasible.
Again, not true, depending on your use case. If the operations that you commonly perform on the data structure, like inserting/deleting elements, then of course you should use a linked list or whatever container data structure your language
provides. How caches play into this is at most a second order effect in the common case, unless you really want to optimize a tight loop in a performance critical application.
I've seen too many prematurely applied fancy data structures where it turned out that all this extra complexity was entirely unnecessary and just made things harder to maintain.
> Most common case is that this just wastes memory. Don't micro-optimizr before you know that this is actually a problem.
You don't have many variables that fit the description I gave. You're already doing profiling if you can pick them out, and preemptively spacing them would barely take any memory, and it's the kind of problem that's hard to thoroughly test unless you have a 64 core machine sitting around.
> If the operations that you commonly perform on the data structure, like inserting/deleting elements, then of course you should use a linked list
For situations where linked list and array are both usable, then even if you very commonly insert and delete you're usually better off with a data structure that's built on top of fixed-size arrays. Iterating an array is so fast that it makes up for the cost of shifting around a surprisingly large number of elements.
> or whatever container data structure your language provides.
But which one? Languages tend to have a lot.
And the ones that give you a one-size-fits-all data structure usually don't have a built-in linked list anyway.
Avoid cache line ping-pong, don't waste cache, and make accesses as spatially predictable as possible. Okay, I can see where you're coming from, thanks.
I haven't seen such dramatic speedups in my own code for quite a while since I don't tend to start with a worst case version.
With some googling it's easy to find cases like this where switching from a traditional OOP approach of "unique objects in random heap locations" to a DOD approach gains a 173x speedup overall (it's not just the tight memory layout of course, but also all the code simplifications this enables, like tighter loops and then easier integration of multithreading):
This is partly wrong. For example, it's important how the fields are organized in a structure, and only the developer knows what belongs together and only they know the access patterns across threads (or at least should know). This is far from being a micro-optimization.
If I do it wrong, are we talking about my application (say, some web app) becoming noticeably slower? Or we would need to be running several million operations per second before anything a user could notice the difference?
I suspect only very high end games, image processing apps and that kind of thing would ever need to care about the order of fields in a struct... perhaps the author of my web server as well, but even that only if I have a ridiculously high load on my server (which I don't, almost no one does).
So, no I don't think I should know that until I develop one of those highly niche applications myself.
> are we talking about my application (say, some web app) becoming noticeably slower? Or we would need to be running several million operations per second before anything a user could notice the difference?
If you have an algorithm with quadratic or higher complexity somewhere you will be in the millions even if you only have a few thousand entries to deal with. Now the worst case cache miss (down to ram) is modeled as roughly 1000 cycles. Put that right in the middle of that O(n^2) algorithm and your web app will be able to handle 1 request per second on a good day.
Maybe you shouldn't be doing O(n^2) inside a request handler in the first place. This has nothing to do with caches.
And even if you do quadratic operations in your handler, how often do you write a new such handler? Most of the time you work on stuff around that, supporting infra, testing, etc. None of those needs to be cache optimized either.
It was mostly meant as a example of millions of operations even at a small scale. Sadly I have seen way too much accidental O(n^3) code and not all of it could be "fixed" trivially.
> This has nothing to do with caches.
Would you rather handle 1000 request per second or 1? Caching has the potential to make an already bad situation so much worse.
> Would you rather handle 1000 request per second or 1?
Would you rather like to approach this problem by optimizing memory layout which, let's be honest, tends to give you more in the region of 1-10% improvements rather than 10-100x, or would you rather like to try to go from O(n^3) to O(n^2) or O(n log n) or similar?
The real meat is in the complexity class. Cache effects are where you go when there is nothing else to optimize and it actually matters. Preemptively making all your data structures 2-3 as big because of not-actually-necessary padding does not sound right to me, but YMMV.
How many of your structs does your code access million times a second? For most programs, that number is firmly 0.
While you are optimizing your struct layout and decrease you app start time from 1.122765 seconds to 1.122764 seconds, I ship production code at 2x the rate because I only optimize for performance where it matters while otherwise optimizing for maintainability, testability, extensibility, dev fun and actually shipping.
Loved the article, appreciate the share. A meta topic but I wish people would spend more time discussing what they like about something than the opposite.
This author took great pains to give many caveats around their writing and distilled down a very complex topic into something that could be consumed in less than 10 minutes.
I'm smarter for having read this.
Best of luck to any poor soul who attempts to summarize the work of many architects and then share it with this group of people.
SQLite database locks can be more quickly obtained and released if CPU affinity is set for the database processes, allowing all I/O activity to share the same cache(es).
I have read (but cannot remember where) that this can increase performance by thousands of DML operations per second.
This is really going to be hardware- and OS-dependent. Most schedulers try to avoid changing CPU affinity for active processes, so you're likely looking at the delta for the worst-case scenario rather than the average case. But yes, affinity definitely helps!
On the related topic of store buffers and write coalescing, can anyone point me to some good articles discussing how this works in practice for merging small writes to the same cache line? I have a workload where we are appending different size payloads (of sizes between 1 and 64 bytes) to an in memory log. One option is to just store each entry in its own cache line. A more space efficient option is to pack each write tightly one after the other. My hypothesis is that due to write coalescing in store buffers this will also be more memory bandwidth efficient since writes to the same cache line will be merged. Is this correct? Note that metadata for each entry (e.g. the size) will be stored separately.
On a typical write back cache, the cache line is written back to memory (or to a lower hierarchy) only when needed: either some other core wants it or it needs to be evicted to make space for other data. For a write mostly task, if you write to
the same line again, it is likely to be still in the M state and data can be written to it immediately. Later, when the line is written back, the separate writes are effectively coalesced. But if some other core is concurrently reading from the head of the log, it can be suboptimal as the cacheline will keep bouncing between M and S state generating coherence traffic. This is still more scalable than two threads writing to the same cacheline of course.
In Java, threads that are accessing non synchronized variables can read different information. It's in the jvm spec, and was a surprise back then. In C you could assume (at least in some runtimes) that you only need to sync on write.
Exactly. I was referring to that bit at the end of the article:
"In the case of Java volatiles, part of the solution is to force all reads/writes to bypass the local registers, and immediately trigger cache reads/writes instead. As soon as the data is read/written to the L1 cache, the hardware-coherency protocol takes over and provides guaranteed coherency across all global threads. Thus ensuring that if multiple threads are reading/writing to the same variable, they are all kept in sync with one another. And this is how you can achieve inter-thread coordination in as little as 1ns."
That last sentence can be pedantically true, if you look at in a very strict context. It doesn't say the variable can be returned in 1ns - but even if it did, it says the inter-thread coordination can take place in "as little as" 1ns. If a L1 cache immediately returning it's cached value for a memory address because it has the data in E or M state, that's technically a form of inter-thread coordination, just using already-coordinated state.
Language-specific synchronization primitives such as mutaxes or volatile variable use memory barries to ensure the caches are flushed and/or a CPU core reads from the main memory directly:
https://en.wikipedia.org/wiki/Memory_barrier
From the point of view of someone outside the CPU, yes you can say that simultaneous reads might never give different answers. But everything happening at that level barely resembles the original software. Dozens of instructions are happening at any moment, overlapping each other, starting and finishing in very different orders from how they're stored in memory. This happens even if you directly write machine code byte by byte.
From the point of view of the software, running a bunch of instructions in sequence, you do get stale values. You can have two threads wait for a signal, then both read a value, and both get different results. You can have a thread set a flag after it's done editing some values, have another thread wait for the flag, and then after waiting it sees the edits as still incomplete.
The exact types of nonsense depend on the memory model, but things as simple as two MOV instructions in a row can violate strict ordering. It's not just that the compiler might stash something in a register, but that the CPU itself will make swaths of very observable changes within the rules set by the memory model.
You can't trust the hardware coherency protocol to do much for you until you follow the platform-specific rules to tell the CPU to make something act coherent.