Don't you need some kind of way of telling the compiler you would like barriers here? I think otherwise the helper thread could run on another cpu and the two cpus would operate on their own cached copies of foo. But then again I'm not 100% on how that works.
There are barriers for join. But without barriers, the risk is compiler reordering/lift to registers/thread scheduling. The CPU cache would not be the direct cause of any “stale” reads. https://news.ycombinator.com/item?id=36333034
Well I knew there were possible issues both from the compiler and the cpu. It seems you are right that the cache is kept coherent, however there is another issue owing to out-of-order execution of cpu instructions. Either way, gpderreta is probably right that thread.join tells the compiler to make sure it's all taken care of correctly.