Edit: Looks like the slides had an inaccuracy (see replies). Huh, looks like I l...

colanderman · on May 26, 2019

That interpretation of those slides is incorrect. "volatile" means nothing more than "ensure that a store instruction is issued". It absolutely does not bypass any of the mechanisms listed. Write a test program and look at the assembler output on multiple architectures for proof. (Or look at the intermediate output from Clang.)

Looking at the formatting on the actual slides, I think the 1st is meant to be a question, and the 2nd is the answer. That the first contains the word "volatile" and the second doesn't looks to me like an editing error; they probably both said "volatile" at one time (or didn't) and the proof failed to update one when updating the other.

chrisseaton · on May 26, 2019

> looks to me like an editing error; they probably both said "volatile" at one time (or didn't) and the proof failed to update one when updating the other

Isn't it sobering to think that a university slide could have a minor error like that, someone could read it and internalise it as being very important, and then go off and ask interview questions about it (as suggested on the slide!!!!) for the rest of their career!

(Not the fault of the student in this thread, of course.)

keldaris · on May 26, 2019

I don't understand these slides. The volatile keyword does not magically bypass the mechanism by which modern CPUs write to main memory. Am I missing something, or are they somehow meant to be ironic?

flafla2 · on May 26, 2019

It (in theory) should bypass any caches in between physical memory and the CPU. Of course this is compiler/arch/OS dependent so YMMV...

The slide is admittedly a bit vague, the point is mostly to convey "lots of complicated things that you probably haven't considered are going on in the background to speed up memory accesses in a uniprocessor model." Keep in mind the class is exploring parallel architectures, and that lecture is about snooping-based cache coherence.

chrisseaton · on May 26, 2019

Check for yourself - look at the compiler output using https://godbolt.org for volatile on a variety of architectures. Ask yourself 'where is the logic to bypass the cache or virtual memory?' You won't find it.

keldaris · on May 26, 2019

The volatile keyword will certainly lead to implications for cache coherency, but it cannot bypass the TLB or somehow magically avoid the need to involve the memory controller. Unless I'm grossly misunderstanding something, a majority of the points on the second slide should also be on the first.

burfog · on May 26, 2019

Yep. The slide is completely wrong. It is showing low-level architecture details that would be 100% identical between the two cases. Volatile changes nothing on that list.

Volatile just makes sure the compiler bothers. Otherwise, a pair of writes to the same memory location could be optimized by eliminating the first write. Volatile makes the compiler do that. Of course, the CPU itself may then do this optimization, so volatile is thus not good enough for IO.

keldaris · on May 26, 2019

> It is showing low-level architecture details that would be 100% identical between the two cases.

To be as charitable as I can possibly be, the only part that could theoretically make sense is that the compiler could emit non-temporal store instructions to bypass the cache. I know compilers currently don't do that for volatile, but I don't know why.

comex · on May 27, 2019

> the only part that could theoretically make sense is that the compiler could emit non-temporal store instructions to bypass the cache. I know compilers currently don't do that for volatile, but I don't know why.

Two reasons:

First, using nontemporal accesses would break mixed volatile and non-volatile accesses to the same memory, something which is not defined by the C standard but which some programs rely on anyway.

Second, more importantly: why would they?

- If the address you’re accessing points to hardware registers, the page table entry should be marked non-cacheable, which makes nontemporal accesses unnecessary. And if for some reason it’s not marked properly, nontemporal accesses wouldn’t be sufficient to guarantee that things work anyway, because nontemporal is just a hint which the hardware may not respect. In any case, at least on x86, AFAIK the only nontemporal instructions access 128+ bits of memory at a time, which wouldn’t even work for hardware registers (which generally require you to use a specific access size).

- If the address you’re using points to regular memory, on the other hand, volatile is probably being used to implement atomics, in which case bypassing the cache is unnecessary and also slow. In theory, compilers could compile volatile into accesses surrounded by memory barrier instructions, which would enforce a stronger memory ordering (while being faster than bypassing the cache entirely), especially useful on architectures with weaker memory models than x86. In fact, that’s what volatile does in Java. But in C, it’s pretty long-established that volatile accesses should just compile to regular load/store instructions, and any necessary barriers must be inserted manually. People writing high-performance code wouldn’t be happy if the compiler started inserting unnecessary barrier instructions for them… In any case, usage of volatile for atomics is deprecated in favor of C/C++11 atomics, which do insert barriers for you.

Gibbon1 · on May 27, 2019

I think the reason is the details are too complicated to be captured by the volatile keyword.

For instance the processor I use has a controller that enforces consistency on IO memory operations. So volatile works 'fine'. I know that. The compiler is targeting a core not an implementation. So it has no idea.

tempguy9999 · on May 26, 2019

I don't get this, most likely due to my ignorance, but I thought volatile doesn't necessarily force anything to RAM, it can just push it out so cache coherence handles the rest, between cores (and perhaps peripherals). MESI can do the work without actually hitting memory.

if you want to force actually to ram then perhaps you'd need a memory barrier.

This is not my area though. Wrong? Right?

johntb86 · on May 26, 2019

Yes, and you'll either need to set up the memory mapping as uncached or issue the correct cache flush/invalidate operations.

spc476 · on May 26, 2019

What happens with this code?

    volatile int x;
    int          y;
    int          z;
    
    x = 10;
    x = 20;
    y = x;
    z = x;

Answer:

    the constant 10 is written to x
    the constant 20 is written to x
    the contents of x is read and written into y
    the contents of x is read and written into z

Now, what happens with this code?

    int x;
    int y;
    int z;
    
    x = 10;
    x = 20;
    y = x;
    z = x;

One answer is the same as the above. Another valid answer is:

    the constant 20 is written to x
    the constant 20 is written to y
    the constant 20 is written to z

Why? Because x is not used between the two assignments, so the first will never be seen. Also, x is not used between it's assignment and the assignment to y, so the compiler can do constant propagation.

All volatile does it tell the compiler "all writes must happen, and no caching of reads".

tempguy9999 · on May 27, 2019

Understood but we're talking about different things I think (though this is very much not my area).

You're saying volatile is acting as a kind of memory barrier instruction for the compiler - got it. But I'm saying I understand that at the CPU level, just considering x86 instructions, writes don't have to be forced to RAM, despite a common assumption that they are; they can remain in caches. See johntb86's reply confirming this.

the8472 · on May 26, 2019

> Now describe everything that might occur during the execution of this statement.

fun fact: either swap+loopback devices+FUSE/network filesystems or userfaultfd means arbitrary userspace code execution including IO to remote machines might occur.

chrisseaton · on May 26, 2019

How does volatile bypass, for example, a TLB lookup or miss?

flafla2 · on May 26, 2019

I didn't write the slides myself, but I think the implication is that the TLB is not consulted at all and the physical address is resolved again for every memory access. Of course, this is compiler / architecture / OS dependent though, so YMMV. The point is mostly to convey "lots of stuff you probably didn't consider is going on in the background and may have a nontrivial impact on parallelism."

chrisseaton · on May 26, 2019

> I think the implication is that the TLB is not consulted at all

This is not true.

flafla2 · on May 26, 2019

Could you clarify? I am merely a student of that class and we didn't discuss TLBs in detail so I'm all ears for details.

chrisseaton · on May 26, 2019

On an architecture with virtual protected memory (the one being described in the slides) there is no compiler control over the TLB. There is no mechanism for the compiler to bypass it. It isn't the semantics in theory or in practice for volatile to bypass almost anything on that list you have for the non-volatile case. It just isn't true. There must be some misunderstanding somewhere that is only clarified viva voce.

If you are still in the class I'd love to hear a clarification - maybe I'm wrong!