Tail calls are a very common optimization, both Clang and GCC have been performing this optimization successfully for a while. What is new is getting a guarantee that applies to all build modes, including non-optimized builds.
If you're interested in this optimization for performance reasons, why would you want an otherwise non-optimized build? It seems that the only important case is the optimized build... where for some reason you're not getting this optimization without explicitly asking for it.
So the question remains... why is the compiler optimization missing this chance to optimize this tail call without it being explicitly marked for optimization?
Are you saying that the achievement outlined in the article is making sure that this optimization is turned on in debug mode?
Because I thought the entire point of the article is that they have made a marked speed up in production mode, not in debug mode. And I've yet to see (what must be obvious to everyone else) why the compiler is missing this obvious optimization in production mode without the explicit markup.
And if the compiler is not missing this optimization in production mode, why is the article saying they've sped up their code by explicitly marking it for optimization?
The C programming language does not provide any guarantees that tail calls will be optimized out in this way. Compilers are free to do it if they want to and it doesn't violate the "as-if" rule, but they are neither required to do it, nor required not to do it.
However, if you can statically GUARANTEE that this "optimization" happens (which is what this new clang feature does), it opens up a new kind of programming pattern that was simply not possible in C before, it was only really possible in assembly. The reason you couldn't do it in C before is that if the compiler decides (which it is free to do) not to optimize the tail call, it adds to the stack, and you will get stack overflow pretty much instantly.
This MUSTTAIL thing is basically an extension to the C programming language itself: if you can guarantee that the tail call happens, you can structure your code differently because there's no risk of blowing the stack in this case. Another language that actually does this is Scheme, in which the language guarantees that tail calls are eliminated, for the very same reason. If a scheme interpreter doesn't do this, it's not a valid Scheme interpreter.
"Debug mode" vs. "Release mode" doesn't really enter into it: if this optimization doesn't happen, the program is no longer correct, regardless of what optimization level you've selected. Asking "why is the compiler missing this obvious optimization in production mode" is missing the point: you have to guarantee that it always happens in all compilation modes, otherwise the code doesn't work at all.
The code has to be structured entirely differently for this optimisation to be possible.
If the code was restructured without marking the optimisation as mandatory the code would break in debug mode (as it would quickly blow the stack)
What's the argument for building in debug mode vs specifying -g to the compiler to build with debug symbols?
I've previously encountered bugs that stubbornly refused to reproduce in a non-optimized build.
One thing that comes to mind is `#ifdef DEBUG`-style macros, although I'm curious if there's anything else I'm missing. Omitted stack frames, e.g. from inlining, perhaps?
One reason that this is not done by default is that it can make debugging a surprise: since each function does not leave a stack frame, stack traces are much less useful. This is not so bad in the case of a tail-recursive function which calls itself, but if proper tail calls are done between multiple functions, you could have A call B tail call C tail call D, and if D crashes the stack trace is just (D called from A).
I’m sure there are some good ways around this problem, and I would love to have proper tail calls available in most languages.
In OP's case, the individual functions are replacing small fragments of code that would have been basic blocks inside a larger function. The tail calls are replacing gotos. Those things also wouldn't have produced a stack trace.
There are other reasons why this style of programming isn't the default. It requires restructuring the program using tail calls. You commandeer the registers that are normally used for function parameters and force the compiler to use them for your most important variables. But at the same time, it also means that trying to call a normal function will compete with those registers. This optimization technique works better if the inner loop that you're optimizing doesn't have many function calls in the fast path (before you break it down in a network of tail-call functions).
I must be really missing something obvious, since I don't know why so many replies are talking about debugging.
We're talking about the speed of production code (aren't we?) that already elides a lot of debugging features. The article doesn't say that they've found a feature for faster debugging, they're saying that explicitly marking tail calls makes their production code faster... so I'm still lost why the compiler can't find these optimizations in production builds without explicit markup.
People are talking about debugging because debugging needs to be possible.
If you blow the stack in a debug build, your application fucking crashes. That makes debug builds useless and can make debugging some problems very difficult. Therefore we need a mechanism that guarantees the code doesn't blow the stack in debug builds either. "Only optimized production builds are executable" is not the path you want to go if it's avoidable.
If you don't have markup to enforce TCO in every build, some build will be broken or needs to use a different code path altogether to avoid such breakage.
If you write this style of code and don't get the optimization, then your test code and debug sessions are dog slow. Much much slower than the old giant switch style parser.
Similarly, if you make a mistake that would cause the optimization not to be applied, you'd rather get a compiler error than suddenly have a 10X performance regression.