Yep, that's exactly the right takeaway. :) I have verified this on ARM64 also, but architectures that pass parameters on the stack will make this technique not really worth it.
Yes, there are portability issues with the tail call approach, so there would need to be a fallback on non-x64/ARM64 platorms. This would add complexity. But it's exciting to think that it could unlock significant performance improvements.
It's a bit of a weird case for Haskell. Because of laziness, you don't always want tail recursion (in fact sometimes you definitely don't want the tail recursive version). The classic example of this is foldl vs foldr.
-- Tail-recursive so must be good right?
-- That way lies thunk accumulation and *non-constant*
-- additional memory usage for a lot of functions f
foldl f acc [] = acc
foldl f acc (x:xs) = foldl f (f acc x) xs
-- In contrast, this definition often results in constant
-- additional space usage despite the non-tail call.
foldr f acc [] = acc
foldr f acc (x:xs) = f x (foldr f acc xs)
Tail recursion is an unalloyed good only for strict languages.
Although I'm not sure why you put mutual in parentheses and maybe you meant to address laziness with that?
You are right indeed regarding that there are cases where trail recursion does not work. But in this case I would say that the culprit is the laziness of f.
I put mutual in between parentheses because not all languages support optimising mutually tail recursive functions.
On the matter of portability, I wonder if it would be possible to use some macro magic and/or a code generator to convert the tail-call version back into a more traditional while/switch loop, for the stack-based architectures.
While the tail call version is more architecture dependent, it's nevertheless more portable than assembly language. It's still C.
I haven't implemented the fallback yet, but my thought was to keep most of the structure the same (dispatch function tail calls to the function for an individual operation) but then have the individual operation just return instead of tail calling back to dispatch.
I think this would be portable while still avoiding some of the performance problems associated with a traditional switch (like the bounds check).
I'm curious how this approach would fair for microcontrollers, well cortex m0-m4's actually. They're arm32 bit and may not have enough useful call registers. Still parsing protobuf's or interpreting WASM faster would result directly in better battery performance, etc.
Re: PHP, Ruby, etc. yes, computed gotos help some, but I think tail calls would help much more. That was our experience at least. I wrote more about this here: https://gcc.gnu.org/pipermail/gcc/2021-April/235891.html
Yes, there are portability issues with the tail call approach, so there would need to be a fallback on non-x64/ARM64 platorms. This would add complexity. But it's exciting to think that it could unlock significant performance improvements.