I'm interested in things like how the intermediate representation allows easier optimisations, the control flow graph optimisation algorithms, and the general optimisation passes. Can anyone suggest a good resource where I can learn this stuff well enough to implement it? Where do the people working on this compiler tech get their background/base knowledge from?
Generally speaking, the Rust language is quite rich, so it's difficult to for the computer to reason about performance, but easy for it to reason about the high-level semantics. On the other end, the LLVM 'assembly' is very primitive, so it's easy to reason about performance, but not in a way that would preserve the desired semantics.
An intermediate language allows you to describe the high-level semantics in a more primitive form, so it is easier to both manipulate and reason about while preserving the high-level semantics. In this case (as far as I understand it,) MIR needs to be as primitive as possible while still modeling Rust's scoping, safety, and type rules. This is an optimization design pattern that occurs quite frequently in practice.
"but easy for it to reason about the high-level semantics."
Is it? Does it make it possible to argue about such things as that mappings can be fused, for example? I keep wondering about the possibility of very-high-level algebraic transformations of programs (although admittedly, this could be shifted into some kinds of CASE tools instead).
It makes it easy to verify that the high-level semantics are only those allowed by the language. It's fairly easy to look at an AST and tell if it is a legal construct in your language.
>Is it? Does it make it possible to argue about such things as that mappings can be fused, for example?
It makes it possible to know that you are "fusing mappings." In the assembly language, you would have a hard time knowing that this is what you were doing, since it communicates semantics with too much granularity.
Interesting. From what I understand, lifetime resolution is pretty deterministic...mallocs always happen at the beginning of a lifetime and frees always happen at the end of the lifetime. Would it be possible during the Mir stage to perform analysis and optimization to modify lifetimes? (ie. Array preallocations, chunked frees, etc.)
Ownership and borrowing are actually two complementary systems, not one. Because of the ownership/initialization rules Rust has we know that mallocs occur when the object is constructed and that the data is freed when it goes out of scope, but for the borrowchecker dynamic memory has really been abstracted away - and the borrowchecker is guaranteeing the validity of pointers into the stack just as much as pointers into the heap. In fact a language could have ownership without having borrowing (at one point, Rust did).
The main thing performing borrow checking on MIR allows for is actually a more expressive lifetime system, particularly what are called single entry, multiple exist (SEME) or non-lexical lifetimes - that is, knowing that in one branch a borrow lasts longer than in another branch. The classic example of how this is useful is that the naive way to write "get or insert" for a hashmap is a borrow checker error today. With non-lexical lifetimes it won't be an error, because it will know that you are only performing the "insert" when you didn't get a reference from the "get."
Its possible that because the MIR has stronger guarantees than LLVM IR can normally assume, there will be optimization passes that make sense on it to reorder codepaths to be more optimal, like what you were suggesting, but I don't know.