Does anyone know of literature describing why compilers can't do these optimizations on top of vanilla code, without requiring an extra DSL? Is there just no ergonomic way to express the extra information about what operation orderings are optimal or allowed for specific platforms to the compiler?
Give or take some nitpicking about where a language extension stops and a DSL begins, those things are the same. The only question is the syntax. The key thing the referenced paper shows is a set of composable operations (extremely nice property to have in a language).
There are other solutions. E.g., Polly [0] is capable of creating the 6-loop tiled square matrix multiplication routine from the vanilla 3-loop implementation. Much like auto-vectorization though, there are limits. E.g., if changing the scheduling also requires a change in your data representation -- like AoSoA rather than AoS or SoA to fit collocated data into a cache line -- you'll have to manually write the scheduling code as well since a vanilla loop no longer indexes the right things.
One benefit of making a 'properly new' language is that you don't have to support old code; I've definitely seen a lot of cool optimizations dropped because while they'd help in one place, they'd blow up a loop somewhere else, and ultimately there's only so far you can go with using heuristics to make your optimization apply to one but not the other. Sometimes (e.g. taking advantage of UB) this can even be functional -- 'breaking' some pieces of non-conformant code in exchange for getting better performance elsewhere. This is especially relevant in broad-domain languages like C/C++; Linus in particular has opined multiple times against the GCC devs choosing to implement new optimizations that ended up breaking code which had been working fine in Linux for years up to that point.
You can quite often write C code that compilers will autovectorize into what you want.. if the thing in question is a simple case, or you put in effort into doing the non-trivial aspects yourself.
e.g. for the loop reordering example [1] in C you could just, well, manually reorder the loops; and the simple loop would autovectorize to AVX2 just fine.
The things that are messy-to-impossible to achieve in current traditional compilers are things where the compiler thinks it does a smart transformation, but it ends up making things worse.
Here’s a short example to show how this gets hard.
Add the ability to do point-wise arithmetic on C++ vectors By implementing operator+ and such.
Let’s compute A = (B + C) * D
The addition result is stored in a temporary vector and then processed again for the pointwise multiplication.
It would have been far better to avoid the temporary object (and associated memory allocation + cache pressure) and compute everything at once before moving the next index.
That becomes ridiculously non trivial to do ergonomically in C++ without extensive template meta programming —- with little thanks to the compiler.
I suspect One might even be able to argue with that C/C++ semantics in terms of sequence points and such very well might even forbid a mythical compiler to optimize and reorder function calls across loops and such.
Of course, the juicy arithmetic goes far beyond pointwise operations too…
Yeah, HPC is still the domain of humans for the time being.
idk if Mojo is 'the answer', but eventually, a compiled language with first class tensors and concurrency is necessary to simultaneously support heterogeneity, performance, and usability. https://chapel-lang.org is the only other one I know of that's currently in active development (?)