The Death of Optimizing Compilers

danso · on March 14, 2015

On the topic of compiler optimization, Peter Seibel's interview with Fran Allen, a Turing Award recipient for her "pioneering contributions to the theory and practice of optimizing compiler techniques", is a good read:

http://books.google.com/books?id=nneBa6-mWfgC&lpg=PA9&pg=PA5...

She considers the development of C to be a "big blow", saying that "C has destroyed our ability to advance the state of the art in automatic optimization, automatic parallelization, automatic mapping of a high-level language to the machine. This is one of the reasons compilers...are basically not taught much anymore in the colleges and universities"

Her whole interview is great...in fact the whole book is fantastic, one of the best books about programming I've ever read.

Animats · on March 14, 2015

"C has destroyed our ability to advance the state of the art in automatic optimization, automatic parallelization, automatic mapping of a high-level language to the machine."

She's right. As I occasionally point out on why C programs are so buggy, the three big problems in C are "how big is it", "who owns it", and "who locks it". The language doesn't help with any of those. Those problems also inhibit optimization.

One optimization that's been lost to history is optimizing subscript checking. This was done in some Pascal compilers, and about 95% of subscript checks can be optimized out. Where there's a loop involved, it's often possible to make one check at loop entry, rather than a check at every reference. Sometimes that one check can be proven true from FOR statement bounds, and can be eliminated. This is usually the case for numeric matrix work, where it really matters.

I had a talk with one of the designers of Rust about this last Wednesday. They're not optimizing subscript checks yet. They hope to do so. They're currently limited by being a front-end to LLVM, which doesn't really comprehend subscript checks. The Go compiler does some subscript check optimizations, but only on FOR statements of restricted form.

C, Go, and Rust all lack multidimensional arrays. This prevents a whole range of array-related optimizations which FORTRAN compilers have had all the way back to Backus's first FORTRAN compiler in 1956. This is part of why the supercomputer crowd still uses FORTRAN.

I was trying to push the Go crowd towards multidimensional arrays. People in love with the slicing syntax wanted to extend it to multidimensional arrays. There were long arguments over how far to take that. Should you be able to slice out an arbitrary sub-hypercube of an N-dimensional array? If you allow that, you have to have a slice representation which is general but inefficient. After much discussion, nothing happened.

Meanwhile, the number-crunching crowd is trying to compile Matlab, which has multi-dimensional arrays but not much else.

sitkack · on March 14, 2015

So much is right, in many senses about FORTRAN and SQL. Both are expressive and have a limited enough semantics that allows them to be very aggressively optimized for the underlying hardware and data.

I have removed thousands of lines of Python by putting the entire working set into sqlite, transforming and projecting it relationally. Optimizations are orthogonal to the algorithm, mostly by adding and index, partition or view.

Halide is a very interesting piece of research that makes this decoupling of algorithms and schedules a core tenet.

http://people.csail.mit.edu/jrk/halide12/

I'd really like to see a relational set theoretic fork of Rust with some Fortran thrown in. I have no idea how they would all mix, but goal seems interesting.

CyberDildonics · on March 14, 2015

Julia is definitely worth a look as it has taken on some of these issues (multi-dimensional arrays, slicing, some bounds checking). Now that it has a good garbage collector its remaining achilles heals are multi-threading (there are multiple processes that can send messages though) and long start up times for some modules (no JIT caching yet). It does make a lot of really good decisions at the language level though and is faster than many native languages while being more flexible than many scripting languages.

cbsmith · on March 14, 2015

> One optimization that's been lost to history is optimizing subscript checking.

Depends upon how you look at it. Most JIT compilers do this.

kibwen · on March 15, 2015

It's not a priority for Rust to hoist subscript checks in loops, because Rust uses iterators pervasively and iterators are guaranteed to check the bounds of the array no more than once.

dbaupp · on March 16, 2015

LLVM (used by rustc) is perfectly capable of hoisting and eliminating subscript checks, although I believe rustc doesn't necessarily drive LLVM in the optimal way, yet: https://github.com/rust-lang/rust/issues/22987

E.g. this has no bounds checks, and is in fact vectorised:

  pub fn bad_sum(x: &[i32]) -> i32 {
      let x = x; // *
      let mut sum = 0;
      for i in 0..x.len() {
          sum += x[i]
      }
      sum
  }

The line marked * is needed because of https://github.com/rust-lang/rust/issues/22924 which is caused by https://llvm.org/bugs/show_bug.cgi?id=22786 .

Animats · on March 15, 2015

Write a matrix multiply. Or a linear equation solver.

pcwalton · on March 19, 2015

The bounds checks for matrix multiplication over fixed length arrays are completely optimized out by LLVM, and LLVM will also optimize the bounds checks for non-fixed-length arrays in many cases. The slice length is an SSA value, so it's subject to constant propagation and SCCP.

Lofkin · on March 15, 2015

Julia and Python Numba are two fast (almost native) high level languages for multdimensional array programming.

DannyBee · on March 14, 2015

Fran is right in the sense that C is a very very hard language to do these things for. But she's wrong in the sense that it caused massive amounts of funding to be poured into trying to make it happen anyway.

It's like saying Itanium was a complete failure for compiler optimization.

It is in the sense that they couldn't get compilers that produced good enough code for it (well, the could, just not compilers that ran fast).

But the amount of research and funding used to try to make it happen actually did advance the state of the art in compilers. The techniques we learned weren't all useless, and we discovered a lot about the limits of what can be done.

Fran, btw, is one of the smartest people you'll ever meet. Just last year I discovered, through some IBM Research colleagues, a really great algorithm that she came up with for region analysis, that solves a lot of the problems of common ones, cleanly, and nobody has rediscovered it in 35 years yet. (Sadly, this falls into the "i wish they had published more papers on stuff" area)

beagle3 · on March 14, 2015

> It is in the sense that they couldn't get compilers that produced good enough code for it (well, they could, just not compilers that ran fast).

Can you elaborate on that? My last update is from 2002, from an Intel guy who was working on it and said they couldn't. I was under the impression that no much has changed since - but perhaps I am wrong? (and perhaps I misunderstood him in 2002)?

DannyBee · on March 15, 2015

They could do it by ~2005.

In any case, it's always been possible. At worst, you make large integer linear programming based models of what you want to happen, and solve them. You can do optimal register allocation and scheduling, etc.

You can basically shove everything you want into the ILP model.

It's just going to take a ridiculous amount of memory and time to solve.

Elzair · on March 14, 2015

Can you share this algorithm?

DannyBee · on March 16, 2015

It was given to me by Kenny Zadeck in the form of an an unpublished chapter of a book that they had been working on.

I was asked not to reshare it because of that, so i haven't.

The book, sadly, still has not been published.

DannyBee · on March 14, 2015

I'd be interested to see this talk, because as far as i know, it's pretty far off base.

1. He says" As computation has become cheaper, users have correspondingly expanded the volume of data that they are handling, and optimization remains a critical challenge for the occasional "hot spots" in the code.""

Except, uh, a lot of people have applications whose profiles are mostly flat, because they've spent a lot of time optimizing them. We build optimizing compilers that can speed up these apps anyway.

2."Have compilers become so smart that they automatically turn clean high-level code for these hot spots into optimized code, removing the need for humans to be optimization experts? The reality, unfortunately, is very much the opposite: general-purpose "optimizing" compilers are falling farther and farther behind the actual capabilities of modern processors."

This is just flat out wrong.

That said, Daniel is a very smart (and obviously pretty opinionated :P) guy, so i'm really interested to see what he has to say.

haberman · on March 14, 2015

I don't know Dan's work/thinking well enough to anticipate what he might talk about. But if someone wanted me to speak on this prompt, I would talk about modern examples where optimizing compilers significantly underperform hand-written assembly:

- LuaJIT's interpreter (here is Mike Pall's famous post giving a detailed explanation of why this is, incidentally prompted by a question I asked: http://lua-users.org/lists/lua-l/2011-02/msg00742.html)

- x264, which has extensive acceleration implemented in assembly for a/v codecs (they even built their own assembly abstraction layer: http://x264dev.multimedia.cx/archives/191)

- GMP, which has extensive acceleration implemented in assembly for arbitrary-precision arithmetic.

I wouldn't go so far as to argue that any of this implies the death of optimizing compilers, so I will be very interested to see what Dan has to say.

nkurz · on March 14, 2015

  > The reality, unfortunately, is very much the opposite: 
  > general-purpose "optimizing" compilers are falling farther 
  > and farther behind the actual capabilities of modern 
  > processors.

  This is just flat out wrong.

Saying this is wrong implies that the gap between ideal processor-specific assembly and generated code is closing, and that compilers today can achieve a higher percentage of potential performance than they could in the past. Do you have evidence that suggests this is true? Are you possibly conflating "optimizing the code as written including all unintentially specified corner cases" with "optimizing the production of the desired answer"?

In my personal experience (which is mostly compression and scientific computing), I find that I can get significantly higher x64 performance from Haswell than from earlier Intel generations --- frequently more than twice as fast as I can do for Sandy Bridge, due to a combination of better memory interactions, fewer scheduling gotcha's, and wider integer vectors. Very rarely do I find that compiler generated code (Clang, GCC, icc) achieves this level of speedup, and even more rarely do I find that the same code gets optimal performance on both of these generations. Even more rarely does the generated assembly surprise me by exceeding my expectations of maximal performance.

I'm not sure where Daniel is going with his presentation, but what I often find myself wanting is a non-brain-dead non-optimizing compiler. It makes me sad to know how to make code run fast, but be unable to keep the compiler from slowing it down. Obviously, this is a rare case, since only a tiny percentage of code will ever be allowed to be this highly optimized. But I sure wish that there was a way to return to "glorified assembly", and that I had a way to tell the compiler to keep its grubby little fingers off certain small sections of my consciously crafted code.

DannyBee · on March 14, 2015

"aying this is wrong implies that the gap between ideal processor-specific assembly and generated code is closing, and that compilers today can achieve a higher percentage of potential performance than they could in the past. Do you have evidence that suggests this is true? "

Yes. We now do things like http://drona.csa.iisc.ernet.in/~uday/publications/pluto+.pdf

Which will fully, automatically, parallelize and optimize loop nests to the target architecture, including tiling, interchange, cache blocking, blah blah blah Basically anything you can think of. You can't do this by hand. You can do something by hand, maybe tiling, maybe unrolling, maybe interchange but the math is too complex for you to get it right in the general case and to combine all these things at once. We couldn't do this 20 years ago. We couldn't even approach the performance of most architectures.

We can also vectorize whatever you want. The only hard part is calculating the cost vs benefit.

Past that, now we actually have the opposite problem. We come so close to optimal on most architectures that we can't do much more without using NP complete algorithms instead of heuristics. We can only try to get little niggles here and there where the heuristics get slightly wrong answers. We decided we care more about compile speed than doing stuff like this.

"Very rarely do I find that compiler generated code (Clang, GCC, icc) achieves this level of speedup, and even more rarely do I find that the same code gets optimal performance on both of these generations."

I don't know your code, so i really can't speak to this at all, but it's the opposite experience for us. In fact, moving platforms now makes little performance difference. Getting the compiler up to snuff for that architecture does. If what you say is true, it just tells me you need to spend a small amount of time figuring out why the compiler is doing the wrong thing. It's almost certainly just a small bug or tweak somewhere.

gnufx · on March 15, 2015

Which compiler is this which can, for instance, take Netlib LAPACK and run serial Linpack as fast as OpenBLAS on recent x86_64? (Other common hotspots are available.) Enquiring HPC minds want to know.

nkurz · on March 14, 2015

Yes. We now do things like http://drona.csa.iisc.ernet.in/~uday/publications/pluto+.pdf

Thanks, I'll read this over.

In fact, moving platforms now makes little performance difference.

Interestingly, we are saying the same thing but coming to opposite conclusions. I agree that compiler generated code produces approximately the same performance on Sandy Bridge vs Haswell. But I also know that the performance potential on Haswell can be much higher than that of Sandy Bridge, thus see this as a drop in compiler performance. Whereas you seem to be concluding that because the compiler achieves comparable performance, there has been little hardware improvement across generations.

We decided we care more about compile speed than doing stuff like this.

I've never really understood the emphasis on compile speed --- it seems misguided. Sure, you don't want to wait for the compiler to slowly redo the same suboptimal "optimization" on every recompilation, but shouldn't there be some way to cache the previous "best" so you can start from there? And wouldn't it be more efficient if there was some way that you could specify to the compiler where you want it to spend its time, maybe even offline so it won't hold up everything else? Not just the spectrum from "don't optimize and while you are at it please pass all function variables on the stack" to "do the best you can as long as it doesn't take longer than a millisecond", but "do what it takes and I'll tell you when it's good enough"?[1]

It's almost certainly just a small bug or tweak somewhere.

Perhaps I can try to send you an example the next time I come across a compact example. I'm dubious that increasing the automatization is the right approach --- what I think I want is easier access to "manual mode". And I'm equally dubious that it's going to be an easy fix, but would be wonderful it it was.

As a quick example of the sort of issue, Haswell allows two 32B loads and one 32B store per cycle, and has three address generation ports to support this. Two of the AGP's can be used for stores or loads, and one (Port 7) is only for stores. But Port 7 can only generate "simple" addresses --- fixed offsets from a register.

I've yet to find a way to convince a compiler to generate two loads and a store that can be executed in a single cycle without resorting to assembly. Effectively, the cost of an instruction varies depending on what other instructions are executed the same cycle. They can't be scored independently, and the compiler gets stuck in a "local optimum" rather than finding the "global optimum".

Just the ability to say "No, really, I want both of these to be loop variables, and this one to be incremented, and this one to be decremented, and I want the loop to terminate when the decrement hits zero" would be tremendous. Or a least, I'd like to be able to specify this as a starting point for optimization, and have some guarantee that the compiler won't "simplify" by combining the registers unless the "simplification" actually has better performance.

[1] I feel like Dan Luu's excellent and under-discussed article on Software Testing is very applicable here: http://danluu.com/testing/ Like software testing, compilers seem to have settled into the strange expectation that they should do the same thing over and over again very quickly, rather than doing something once and well.

vardump · on March 16, 2015

I pretty much share your sentiment on this subject. My conclusion is was to play with specialized JITs. Although so far my attempts have been more like code generators that just concatenate instructions and take care of loop target alignment and so on. SSE2/SSE4.2/AVX2 depending on runtime CPU architecture. Achieved performance has been very good. There seems to be a huge potential, shame I have so little time to work on this.

Large pages (2MB+) can sometimes contribute a nice amount of extra performance, of course depending on access patterns. It can also have negative effect under some circumstances, like some very random access patterns. Gigabyte pages could help there, but support isn't great.

Other thing I've investigated is memory channel interleaving. Local memory seems to be mostly 64 bytes each channel round robin, but I guess it can be more complicated too. NUMA systems seem to be either round robin 4096 byte per NUMA region or all CPU local memory in one multi-gigabyte (?) chunk. Understanding memory interleaving can help balancing the work between different memory channels.

sitkack · on March 14, 2015

Isn't what you want is a way for the compiler to find the fast encoding? http://en.wikipedia.org/wiki/Superoptimization

It would be interesting to have an EDSL for representing compiler optimizations that could be tweaked on a per program basis.

While we are at it, maybe swapping out cache associativity, write back semantics and the branch predictor.

nkurz · on March 14, 2015

Yes, that would be great, but I've yet to find a superoptimizer that lives up to its claims: https://news.ycombinator.com/item?id=5509254 Possibly the gap between predicted and actual performance is just too large? But CPU-internal performance monitors are sufficiently advanced that it seems reasonable that one could check actual performance rather than relying on flawed heuristics.

One application I'm thinking about right now is how to optimize "interpolative coding", which is a very tight way of compressing inverted indexes (better than Golomb/Rice/Elias), but so far very slow. It's not immediately obvious how vectorization can make it faster, but possibly some use of the BMI instructions should help, perhaps combined with a big lookup table. It's hard even to get close enough to the right form of solution that a superoptimizer can make headway.

sitkack · on March 14, 2015

Cool application. I had read about FM-Indexes a little bit, I find succinct data structures fascinating. Reading up on the new BMI instructions [-1] and "interpolative coding".

I was wondering if something like, "ISDL: An Instruction Set Description Language for Retargetability" [0] would help? It would be fun to create a combinatoric instruction sequencer and then run them all in an attempt to find some sort of probabilistic heuristic about instruction costs and efficiencies. Something like FFTW, but it analyzes then entire universe of possibilities. Some of the research in synthesizing new instruction sets or operating on compressed instructions might attack the problem from the other side [1].

Stay Fresh!

[-1] https://chessprogramming.wikispaces.com/BMI1

[0] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.8...

[1] http://www.well.com/~cwf/pro/vita.html#Books (look for 'compress')

chubot · on March 14, 2015

Regarding #2, it's conventional wisdom that you can't beat a compiler, but I'm sure DJB is going to show examples where he beats the compiler significantly.

That doesn't surprise me for two reasons: 1) DJB is a great programmer. 2) Compilers are so complex that there are lots of non-obvious reasons that they miss optimizations.

I'm not an expert, but it appears to me that, in the last 10 years, processors have gotten complex at higher rate than compilers have gotten better. There's just more room to do nontrivial things that have come up only recently.

I found this interesting portable ASM DSL on DJB's site: http://cr.yp.to/qhasm.html

Honest question: Can you write C that compiles to what his qasm programs do with a modern C compiler? (I suspect the answer is no, because DJB writes tons of C and wouldn't have written it othewise)

Older perf notes on his site: http://cr.yp.to/hardware/x86.html

On the other hand, DJB writes a very particular type of software (crypto and mathematical code lately), so I suspect his experience won't be that generalizable. But it will be correct and important, because his niches are important.

Dewie · on March 14, 2015

> Regarding #2, it's conventional wisdom that you can't beat a compiler, but I'm sure DJB is going to show examples where he beats the compiler significantly.

Is the fact that a compiler can beat you sometimes[1] really that controversial?

And he better have a large class of problems in which he can beat the compiler, without using too much time[2]. Or else the title is just needlessly provocative.

And thirdly: many of just rely on the compiler doing a good enough job. Not a better job than we would do if we were more careful.

[1] Or often times. Depends on who you ask, and in what domains.

[2] Or else you can just say that a programmer can beat a compiler given enough time, where time can be arbitrarily long. But all of us have some kind of time constraint.

mpweiher · on March 14, 2015

Hmm...the abstract matches my experience pretty closely.

A lot of the non-critical code is perfectly fine, performance-wise, in bytecode interpreted Squeak Smalltalk or even Ruby. On the other hand, perf critical code tends to require re-thinking your abstractions and often your layering, which can give you several orders of magnitude of performance improvement (I've frequently had 10x - 1000x).

So I don't need the optimizer for the first part, and it doesn't really help me with the second part. It can and sometimes does take out some of the drudgery, but it can also get in the way by being somewhat unpredictable.

What I find interesting is the approach talked about by VPRI of allowing manual optimization specifications sitting next to high-level code (some of the earliest AOP examples were like that). Less magic, more control, much more predictability and power.

vardump · on March 14, 2015

2 is very much true. Compilers are pretty clever using general purpose registers, but falling behind when it comes to using SIMD vectorization in non-vector context. A human "compiler" can beat best compilers by 2-10x pretty often by just using new CPU features compilers have yet to "learn".

readerrr · on March 14, 2015

On 2.: Maybe he is talking about branch prediction vs lack of this feature in compilers. What do I mean by that is that optimizing compilers( at least in c ) don't optimize based on measurements from actual programs. A compiler that would run the program for a certain amount of time, to profile, could find the weak spots and optimize.

ericfontaine · on March 14, 2015

In addition to GCC, a bunch of other compilers including Intel and Visual Studio do profile based optimization, as well as many dynamic compilers, such as firfox javascript or Java VMs. Profil-guided opitmization has been studied for decades: https://en.wikipedia.org/wiki/Profile-guided_optimization

mansr · on March 14, 2015

GCC can do that.

readerrr · on March 14, 2015

You are mistaken, unless you can prove me wrong. Care to elaborate?

thoughtpolice · on March 14, 2015

GCC has supported `-fprofile-generate` and `-fprofile-use` for profile-guided optimization since (at least[1]) GCC 3.4, which was released over 10 years ago:

https://gcc.gnu.org/onlinedocs/gcc-3.4.0/gcc/Optimize-Option...

Firefox is an example of an application that ships production binaries with PGO.

[1] It was actually supported earlier than this, but I didn't search further; the 3.4.0 manual says "New -fprofile-generate and -fprofile-use command-line options to simplify the use of profile feedback", implying it was there before in a less convenient form.

fche · on March 14, 2015

Or rather, "I may be mistaken in my grand assertion, but I'll place the burden of proof on you".

dalke · on March 14, 2015

I assume mansr refers to profile guided optimizations in gcc and many other compilers. See http://en.wikipedia.org/wiki/Profile-guided_optimization.

ericfontaine · on March 14, 2015

I would add for 2. that there are newer architectures, especially low power and various experimental parallel ones, that could use better optimizing compilers.

jeffreyrogers · on March 14, 2015

For those who aren't aware, this is an abstract for a talk to be given next month by Daniel Bernstein, who has written several very high quality C programs including qmail, daemontools, and djbfft, which is a very fast library for computing FFTs, so he presumably knows what he is talking about.

amelius · on March 14, 2015

I think what we need is a programming language that addresses performance orthogonally from correctness/functionality. What this means is that a programmer could write his code in a very easy-to-understand functional style (say), and then he could write a library of rewrite rules that allows the program to be executed more efficiently.

Dewie · on March 14, 2015

That sounds like library writers being able to bake in optimizations (rewrite rules) into their libraries. Which makes sense; you distribute the optimizations to domain experts (like data structures) instead of just having general optimizations that the compiler writers had time to add.

mike_hearn · on March 14, 2015

And the counter argument:

http://www.chrisseaton.com/rubytruffle/cext/

dalke · on March 14, 2015

I read it, but I don't see the counter-argument. It seems to say that people who want a C extension for Ruby, for performance reasons over native Ruby, can use a system which dynamically compiles the C extension code.

While this proposal seems to be that fewer and fewer programmers know how to optimize against the machine.

mike_hearn · on March 15, 2015

So optimising compilers are becoming more important, not less, then.

Edit: to clarify, the interesting thing about that blog is not the cross language interop magic, though that is pretty cool, it's that a smart compiler that can compile C and ruby together simultaneously can delete vast swathes of abstraction and overhead to produce radical speedups, just through better optimisation. So whilst things might not be moving very fast in the world of more traditional compilers, research compilers with a focus on dynamic languages are still showing big speedups. It's a bit early to proclaim the death of optimising compilers just yet, though perhaps I've just misunderstood the abstract. I'm not sure the latest hanging fruit for compiler developers is using more and more exotic machine instructions

dalke · on March 15, 2015

I view it as two different things.

There are many optimizations which work on a simple machine model that hasn't changed much since the VAX. There's memory with constant time lookup, some registers, and assembly code.

Real hardware changed long ago. As a simple example, on traditional hardware it's best to save and re-use intermediate values, but on modern hardware RAM lookup is so slow that it can be faster to recompute rather than doing a random lookup. An optimizing compiler isn't going to do much if the code wasn't designed to match the hardware, or expects a different hardware model.

While lowering the impedance mismatch between two systems, whether it be two different languages as in your example, or serialization/deserialization, or call thunking, belongs to an old and well-understood set of optimizations which are nearly insensitive to the machine model.

RAM lookup, and cache performance, and cache lines, are important for high performance code. Though as the abstract says, most people only work with "freezingly cold" code, so aren't aware of the limitations in the conceptual machine model vs. the actual machine. Fixing this almost always requires deep changes to the code, which an optimizing compiler can't do.

101914 · on March 14, 2015

"...most users still spend time waiting for computers."

ufo · on March 14, 2015

I'm only seeing the abstract here. Is there a link for the full tutorial?

morcheeba · on March 14, 2015

The presentation hasn't happened yet. It will take place at ETAPS 2015: 11-18 April 2015, London, UK

Dewie · on March 14, 2015

I don't have much faith in future processors if it really is impractical to make optimizing compilers for them.

Animats · on March 14, 2015

Well, that's happened in the past. Itanium and the Cell made it to volume production while presenting extremely hard problems for a compiler.

I once met the group from HP who were trying to write an optimizing Itanium compiler. It wasn't going well at all. The Itanium could execute several instructions simultaneously, but the compiler had to block them together and specify this. About a hundred instructions at a time had to be scheduled together to get decent optimization.