> I just meant that even being aware this is a naive copy-and-patch JIT, my first impression was that the code was slightly worse than I expected.
> "there’s only so much one can do without touching the data model"
You probably want to look at the other link in that PR, which demonstrated how well copy-and-patch can do for another dynamic language (Lua): [1]
Of course, whether or not CPython could eventually make it to that point (or even further) is a different story: they are under a way tighter constraint than just developing something for academia. But copy-and-patch can do a lot even for dynamic languages :)
That's correct. Lua function calls are not that easy to remove, as function is first-class value in Lua so can be redefined at any time. To remove a function call, you need speculative compilation and OSR-exit, which is outside the job of the baseline JIT.
That's an interesting approach :)
Though it only works if the control flow in the language is exactly "paired" (no continue/break, no goto, etc), I guess?
I agree with your point, but I want to point out that Deegen also provided APIs to hide all the details of inline caching. The bytecode only specifies the body lambda and the effect lambdas, and Deegen lowers it to an efficient implementation automatically, including exotic optimizations that fuses the cached IC effect into the opcode so one indirect branch can be avoided.
If LuaJIT interpreter were to employ IC, it would have to undergo a major rewrite (due to its assembly nature) to have the equally efficient code as LJR (that we generate automatically). This is one advantage of our meta-compiler approach.
Finally, although no experiment is made, my subjective opinion is already in the article:
> LuaJIT’s hand-written assembly interpreter is highly optimized already. Our interpreter generated by Deegen is also highly optimized, and in many cases, slightly better-optimized than LuaJIT. However, the gain from those low-level optimizations are simply not enough to beat LuaJIT by a significant margin, especially on a modern CPU with very good instruction-level parallelism, where having a few more instructions, a few longer instructions, or even a few more L1-hitting loads have negligible impact on performance. The support of inline caching is one of the most important high-level optimizations we employed that contributes to our performance advantage over LuaJIT.
That is, if you compare the assembly between LJR and LuaJIT interpreter, I believe LJR's assembly is slightly better (though I would anticipate only marginal performance difference). But that's also because we employed a different boxing scheme (again...). If you force LJR to use LuaJIT's boxing scheme, I guess the assembly code would also be similar since LJR's hand-written assembly is already optimal or at least very close to optimal.
Clang/LLVM accepts (little-known but documented) flags to let you align code block using a custom alignment, though it only works at file-level. See [1].
However, I'm not sure if doing so is useful or necessary. Interpreter performance is sensitive to code layout (which affects hardware branch predictor accuracy), but I don't think there is a general way to optimize the code layout to make the branch predictor as happy as possible.
So if you changed your code alignment and saw a perf change, it's more likely caused by the random perf gain/loss due to code layout change, not because that 1-byte-alignment is better than 16-byte-alignment or vice versa.
I just want to point out that LLVM is not a runtime dependency. It is only a build-time dependency if you want to build LJR from source. Once LJR is built, it is stand-alone and does not need LLVM at runtime.
It should be a build time dependency. There's no JIT here, so there's no calling into LLVM's JIT, so I'd hope it's equivalent to using clang to create a native binary that implements the lua language.
Could probably retarget it to emit assembly and commit that for a loose interpretation of building from source.
The copy-and-patch paper is also written by me. Deegen a follow-up work of copy-and-patch, and it will use copy-and-patch as a tool to build its JIT tiers in the future.
> More recently I have been experimenting with a new calling convention that uses no callee-saved registers to work around this, but the results so far are inconclusive. The new calling convention would use all registers for arguments, but allocate registers in the opposite order of normal functions, to reduce the chance of overlap. I have been calling this calling convention "reverse_cc".
As explained in the article, LLVM already has the calling convention you are exactly looking for: the GHC convention (cc 10). You can use it to "pin" registers by passing arguments at the right spot. If you pin your argument in a callee-saved register of the C calling conv, it won't get clobbered after you do a C call.
I tried exposing the "cc 10" and "cc 11" calling conventions to Clang, but was unsuccessful. With "cc 10" I got crashes at runtime I could not explain. With "cc 11" I got a compile-time error:
> fatal error: error in backend: Can't generate HiPE prologue without runtime parameters
If "cc 10" would work from Clang I'd be happy to use that. Though I think reverse_cc could potentially still offer benefits by ordering arguments such that callee-save registers are assigned first. That is helpful when calling fallback functions that take arguments in the normal registers.
If you didn't already, I'd recommend compiling llvm/clang in debug or release+assert mode when working on it.
The codebase is quite heavy in debug-only assertions even for relatively trivial things (like missing implementation of some case, some combination of arguments being invalid, ...) which means that it's pretty easy to get into weird crashes down the line with assertions disabled.
Adding calling conventions to clang is fairly straightforward (e.g. https://reviews.llvm.org/D125970) but there's a limited size structure in clang somewhere that makes doing so slightly contentious. Partly because they're shared across all architectures. At some point we'll run out of bits and have to do something invasive and/or slow to free up space.
I think I remember one called preserve_none. That might have been in a downstream fork though.
The calling convention you want for a coroutine switch is caller saves everything live, callee saves nothing. As that means spilling is done on the live set, to the stack in use before switching. Return after stack switch then restores them.
So if a new convention is proposed that solves both fast stackful coroutines and bytecode interpreter overhead, that seems reasonable to me.
Yes, the object can become unreachable, but we won't know it until the end of the current GC cycle (and the whole purpose of GC is to figure that information out). So yes, even if it has become unreachable, it is live (and must be live) until the end of the cycle.
I think the point you missed is: the function 'cellContainsLiveObject' is not used by GC, it is used by allocator to tell if the cell is available for allocation. So it's fine if the function returns true but the object is actually unreachable, but not the other way around.
> He (Abbot) proposed instead an alternative framework called Merit, Fairness, and Equality (MFE) whereby university applicants are treated as individuals and evaluated through a rigorous and unbiased process based on their merit and qualifications alone
I am in full support with this, though it seems to me this is too idealized to be practical in practice. How can one reach a fair judgement of a student only based on a 1000-word essay in his/her application (which might not even be written by him/herself)?
However, I'm still saddened by that the MIT response to this incident is simply "it is Abbot's right of free expression to say whatever he wants", but nothing about what he actually said, or whether it at least makes some sense. It's as if MIT treated Abbot as an unknowing child whose nonsense words shall be tolerated, which is disturbing. Below is part of the mail list letter I received:
> Freedom of expression is a fundamental value of the Institute.
> I believe that, as an institution of higher learning, we must ensure that different points of view – even views that some or all of us may reject – are allowed to be heard and debated at MIT. Open dialogue is how we make each other wiser and smarter.
> This commitment to free expression can carry a human cost. The speech of those we strongly disagree with can anger us. It can disgust us. It can even make members of our own community feel unwelcome and illegitimate on our campus or in their field of study.
> I am convinced that, as an institution, we must be prepared to endure such painful outcomes as the price of protecting free expression – the principle is that important.
> "there’s only so much one can do without touching the data model"
You probably want to look at the other link in that PR, which demonstrated how well copy-and-patch can do for another dynamic language (Lua): [1]
Of course, whether or not CPython could eventually make it to that point (or even further) is a different story: they are under a way tighter constraint than just developing something for academia. But copy-and-patch can do a lot even for dynamic languages :)
[1] https://sillycross.github.io/2023/05/12/2023-05-12/