> But C compilers are super careful to compile structured assembly style code without breaking it, even though they don’t formally promise to do so.
You have never worked on a C compiler, I take it. All production compilers will at some point turn even straight-line code into a DAG and relinearize it without care to the original structure of code. Any semantics beyond that which is given in the C specification (which are purely dynamic, mind you) are not considered for preservation. In particular, static intuitions about relative positions (or even the number!) of call statements to if statements or other control flow are completely and totally wrong and unreliable.
This is the main reason why people keep harping on the "C is an abstract machine." When people think of machine code, there is an assumption about the set of semantics that are relevant to the machine code, and that set is often much, much broader than the set of semantics actually preserved by the C specification.
Structured assembly semantics are what C users in a lot of domains expect and production compilers like clang at -O3 will obey the programmer in all of the cases that are necessary to make that code work:
- Aliasing rules even under strict aliasing give a must-alias escape hatch to allow arbitrary pointer casts.
- Pointer math is treated as integer math. Llvm internally considers typed pointer math to be equivalent to casting to int and doing math that way.
- Effects are treated with super high levels of conservatism. A C compiler is way more conservative about the possible outcomes of a function call or a memory store than a JS JIT for example.
- Lots of other stuff.
It doesn’t matter that the compiler turns the program into a DAG or any other representation. Structured assembly semantics are obeyed because of the combination of conservative constraints that the compiler places on itself.
But bottom line: lots of C code expects structured assembly and gets it. WebKit’s JavaScript engine, any malloc implementation, and probably most (if not all) kernels are examples. That code isn’t wrong. It works in optimizing C compilers. The only thing wrong is the spec.
…as I take it, for languages without undefined behavior.
> Structured assembly semantics are what C users in a lot of domains expect and production compilers like clang at -O3 will obey the programmer in all of the cases that are necessary to make that code work
No, they will not, and the people working on Clang will tell you that they won't.
> Llvm internally considers typed pointer math to be equivalent to casting to int and doing math that way.
LLVM IR is not C, but I am fairly sure that it is still illegal to access memory that is outside of the bounds of an object even in IR.
> A C compiler is way more conservative about the possible outcomes of a function call or a memory store than a JS JIT for example.
Well, yes, because JavaScript JITs own the entire ABI and have full insight into both sides of the function call.
> That code isn’t wrong. It works in optimizing C compilers. The only thing wrong is the spec.
The code is wrong, and that it works today doesn't mean it won't work tomorrow. If think that all numbers ending in "3" are prime because you've only looked at "3", "13" and "23", would you say that "math is wrong" because it tells you that this isn't true in general?
There is a broad pattern of software that uses undefined behavior that gets compiled exactly as the authors of that software want. That kind of code isn’t going anywhere.
You’re kind of glossing over the fact that for the compiler to perform an optimization that is correct under C semantics but not under structured assembly semantics is rare because under both laws you have to assume that memory accessed have profound effects (stores have super weak may alias rules) and calls clobber everything. Put those things together and it’s unusual that a programmer expecting proper structured assembly behavior from their illegally (according to spec) computed pointer would get anything other than correct structured assembly behavior. Like, except for obvious cases, the C compiler has no clue what a pointer points at. That’s great news for professional programmers who don’t have time for bullshit about abstract machines and just need to write structured assembly. You can do it because either way C has semantics that are not very amenable to analysis of the state of the heap.
Partly it’s also because of the optimizations went further, they would break too much code. So it’s just not going to happen.
You say that this doesn't happen, and yet we have patches like this in JavaScriptCore: https://trac.webkit.org/changeset/195906/webkit. Pointers are hard to reason about, but 1. undefined behavior extends to a lot of things that aren't pointers and 2. compilers keep getting better at this. For example, it used to be that you could "hide" code inside a function and the compiler would have no idea what you were doing, but today's compilers inline aggressively and are better at finding this sort of thing. And it isn't just WebKit: other large projects have struggled with this as well. The compiler discarded a NULL pointer check in the Linux kernel (which I can't find a good link to, so I'll let Chris Lattner paraphrase the issue for me): http://blog.llvm.org/2011/05/what-every-c-programmer-should-... ; here's one where it crippled sanitization of return addresses in NaCl: https://bugs.chromium.org/p/nativeclient/issues/detail?id=24...
You have never worked on a C compiler, I take it. All production compilers will at some point turn even straight-line code into a DAG and relinearize it without care to the original structure of code. Any semantics beyond that which is given in the C specification (which are purely dynamic, mind you) are not considered for preservation. In particular, static intuitions about relative positions (or even the number!) of call statements to if statements or other control flow are completely and totally wrong and unreliable.
This is the main reason why people keep harping on the "C is an abstract machine." When people think of machine code, there is an assumption about the set of semantics that are relevant to the machine code, and that set is often much, much broader than the set of semantics actually preserved by the C specification.