Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> With the tail-call approach, each bytecode now gets its own function, and the pathological case for the C/C++ compiler is gone. And as shown by the experience of the Google protobuf developers, the tail-call approach can indeed be used to build very good interpreters. But can it push to the limit of hand-written assembly interpreters? Unfortunately, the answer is still no, at least at its current state.

> The main blockade to the tail-call approach is the callee-saved registers. Since each bytecode function is still a function, it is required to abide to the calling convention, specifically, every callee-saved register must retain its old value at function exit.

This is correct, wasting of callee-saved registers is a shortcoming of the approach I published about protobuf parsing (linked from the first paragraph above). More recently I have been experimenting with a new calling convention that uses no callee-saved registers to work around this, but the results so far are inconclusive. The new calling convention would use all registers for arguments, but allocate registers in the opposite order of normal functions, to reduce the chance of overlap. I have been calling this calling convention "reverse_cc".

I need to spend some time reading this article in more detail, to more fully understand this new work. I would like to know if a new calling convention in Clang would have the same performance benefits, or if Deegen is able to perform optimizations that go beyond this. Inline caching seems like a higher-level technique that operates above the level of individual opcode dispatch, and therefore somewhat orthogonal.



> More recently I have been experimenting with a new calling convention that uses no callee-saved registers to work around this, but the results so far are inconclusive. The new calling convention would use all registers for arguments, but allocate registers in the opposite order of normal functions, to reduce the chance of overlap. I have been calling this calling convention "reverse_cc".

As explained in the article, LLVM already has the calling convention you are exactly looking for: the GHC convention (cc 10). You can use it to "pin" registers by passing arguments at the right spot. If you pin your argument in a callee-saved register of the C calling conv, it won't get clobbered after you do a C call.


I tried exposing the "cc 10" and "cc 11" calling conventions to Clang, but was unsuccessful. With "cc 10" I got crashes at runtime I could not explain. With "cc 11" I got a compile-time error:

> fatal error: error in backend: Can't generate HiPE prologue without runtime parameters

If "cc 10" would work from Clang I'd be happy to use that. Though I think reverse_cc could potentially still offer benefits by ordering arguments such that callee-save registers are assigned first. That is helpful when calling fallback functions that take arguments in the normal registers.


If you didn't already, I'd recommend compiling llvm/clang in debug or release+assert mode when working on it. The codebase is quite heavy in debug-only assertions even for relatively trivial things (like missing implementation of some case, some combination of arguments being invalid, ...) which means that it's pretty easy to get into weird crashes down the line with assertions disabled.


Yeah, you should do it at LLVM IR level.


Adding calling conventions to clang is fairly straightforward (e.g. https://reviews.llvm.org/D125970) but there's a limited size structure in clang somewhere that makes doing so slightly contentious. Partly because they're shared across all architectures. At some point we'll run out of bits and have to do something invasive and/or slow to free up space.


Yes, I've already experimented with doing this, see: https://github.com/haberman/llvm-project/commit/e8d9c75bb35c...

But when I tried it on my actual code, the results weren't quite as good as I hoped, due to sub-optimal register allocation.


I think I remember one called preserve_none. That might have been in a downstream fork though.

The calling convention you want for a coroutine switch is caller saves everything live, callee saves nothing. As that means spilling is done on the live set, to the stack in use before switching. Return after stack switch then restores them.

So if a new convention is proposed that solves both fast stackful coroutines and bytecode interpreter overhead, that seems reasonable to me.


Have you read the Copy-and-patch compilation paper? It seems broadly similar to what you're describing, and Deegen. They use GHC's calling convention for it's snippets which has only caller-saved registers and all arguments passed in registers in order to optimize stitching the snippets together (via JIT templating in it, but with tailcalls in your case would probably also be the same design space).


The copy-and-patch paper is also written by me. Deegen a follow-up work of copy-and-patch, and it will use copy-and-patch as a tool to build its JIT tiers in the future.


Haha, woops! I didn't realize that :)



For your (and maybe other people's) information, this post is authored by the person you are replying to.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: