> With the tail-call approach, each bytecode now gets its own function, and the ...

sillycross · on Nov 23, 2022

> More recently I have been experimenting with a new calling convention that uses no callee-saved registers to work around this, but the results so far are inconclusive. The new calling convention would use all registers for arguments, but allocate registers in the opposite order of normal functions, to reduce the chance of overlap. I have been calling this calling convention "reverse_cc".

As explained in the article, LLVM already has the calling convention you are exactly looking for: the GHC convention (cc 10). You can use it to "pin" registers by passing arguments at the right spot. If you pin your argument in a callee-saved register of the C calling conv, it won't get clobbered after you do a C call.

haberman · on Nov 23, 2022

I tried exposing the "cc 10" and "cc 11" calling conventions to Clang, but was unsuccessful. With "cc 10" I got crashes at runtime I could not explain. With "cc 11" I got a compile-time error:

> fatal error: error in backend: Can't generate HiPE prologue without runtime parameters

If "cc 10" would work from Clang I'd be happy to use that. Though I think reverse_cc could potentially still offer benefits by ordering arguments such that callee-save registers are assigned first. That is helpful when calling fallback functions that take arguments in the normal registers.

obl · on Nov 23, 2022

If you didn't already, I'd recommend compiling llvm/clang in debug or release+assert mode when working on it. The codebase is quite heavy in debug-only assertions even for relatively trivial things (like missing implementation of some case, some combination of arguments being invalid, ...) which means that it's pretty easy to get into weird crashes down the line with assertions disabled.

sillycross · on Nov 23, 2022

Yeah, you should do it at LLVM IR level.

JonChesterfield · on Nov 23, 2022

Adding calling conventions to clang is fairly straightforward (e.g. https://reviews.llvm.org/D125970) but there's a limited size structure in clang somewhere that makes doing so slightly contentious. Partly because they're shared across all architectures. At some point we'll run out of bits and have to do something invasive and/or slow to free up space.

haberman · on Nov 23, 2022

Yes, I've already experimented with doing this, see: https://github.com/haberman/llvm-project/commit/e8d9c75bb35c...

But when I tried it on my actual code, the results weren't quite as good as I hoped, due to sub-optimal register allocation.

JonChesterfield · on Nov 23, 2022

I think I remember one called preserve_none. That might have been in a downstream fork though.

The calling convention you want for a coroutine switch is caller saves everything live, callee saves nothing. As that means spilling is done on the live set, to the stack in use before switching. Return after stack switch then restores them.

So if a new convention is proposed that solves both fast stackful coroutines and bytecode interpreter overhead, that seems reasonable to me.

chc4 · on Nov 23, 2022

Have you read the Copy-and-patch compilation paper? It seems broadly similar to what you're describing, and Deegen. They use GHC's calling convention for it's snippets which has only caller-saved registers and all arguments passed in registers in order to optimize stitching the snippets together (via JIT templating in it, but with tailcalls in your case would probably also be the same design space).

sillycross · on Nov 23, 2022

The copy-and-patch paper is also written by me. Deegen a follow-up work of copy-and-patch, and it will use copy-and-patch as a tool to build its JIT tiers in the future.

chc4 · on Nov 23, 2022

Haha, woops! I didn't realize that :)

svnpenn · on Nov 23, 2022

https://blog.reverberate.org/2021/04/21/musttail-efficient-i...

tekknolagi · on Nov 23, 2022

For your (and maybe other people's) information, this post is authored by the person you are replying to.