I’ll repeat what I said at that time: one of the benefits of the new design is that it’s less vulnerable to the whims of the optimizer: https://news.ycombinator.com/item?id=43322451
If getting the optimal code is relying on getting a pile of heuristics to go in your favor, you’re more vulnerable to the possibility that someday the heuristics will go the other way. Tail duplication is what we want in case, but it’s possible that a future version of the compiler could decide that it’s not desired because of the increased code size.
With the new design, the Python interpreter can express the desired shape of the machine code more directly, leaving it less vulnerable to the whims of the optimizer.
- functional programming model --- some folks find not having traditionally mutable variables limiting
- output is as an STL, or DXF using polylines
- native objects are spheres, cylinders, cubes, with functions for hull and Minkowski, so filleting and other traditional CAD operations can be difficult
> Hotspot is the choice for high performance programs. Approaching its performance even with C++ requires a dedicated team of experts.
It's very surprising to hear you say this, as it's so contrary to my experience.
From the smallest programs (Computer Language Benchmarks Game) to pretty big programs (web browsers), from low-level programs (OS kernels) to high-level programs (GUI Applications), from short-lived programs (command-line utilities) to long-lived programs (database servers), it's hard to think of a single segment where even average Java programs will out-perform average C, C++, or Rust programs.
I hadn't heard of QuestDB before, but it sounds like it's written in zero-GC Java using manual memory management. That's pretty unusual for Java, and would require a team of experts to pull off, I'd think. It also sounds like it drops to C++ and Rust for performance-critical tasks.
It's a statement of my experience in the performance achieved in practice by real developers who lack dedicated language support teams. And even the ones who enjoy dedicated language support teams. I could point to gRPC. gRPC-Java is slapping gRPC-C++ sideways. Why is that? Because when a codebase is increasingly complex, the C-style lifetime management becomes too difficult for developers to ponder, and they revert to relying on the slower features of the language platform, like reference counting smart pointers.
I think hybrid implementations, where a project enjoys the beneficial aspects of the language runtime at large, but delegates small, critical functions to other languages, makes sense. That keeps the C, C++, or Rust stuff contained to boundaries that are ponderable and doesn't let those language platforms dictate the overall architecture of the program.
If gRPC overhead is critical to your system, you've probably already lost the plot on performance in your overall architecture.
You make a fair point about smart pointers, and median "modern C++" practices with STL data structures are unimpressive performance-wise compared to tuned custom data structures, but I can't imagine that idiomatic Java with GC overhead on top is any better.
That effort was focused primarily on learnability and teachability, but it seems like more fundamental arena support could help even for experienced devs if it made patterns like linked lists fundamentally easier to work with.
Thanks for those links. Have you tried using arenas that give out handles (sometimes indexes) instead of mutable references? It's less convenient and you're not leveraging borrow checking but I would imagine it supports Send well.
The build of Python that I used has tail calls enabled (option --with-tail-call-interp). So that was in place for the results I published. I'm not sure if this optimization applies to recursive tail calls, but if it does, my Fibonacci test should have taken advantage of the optimization.
That tells you how much I know about the feature. :)
But in any case, I'm positive that the flag was enabled, so my results are with tail calls. I suppose part of the difference between 3.13 and 3.14 could be thanks to this.
Good to know! Thanks for confirming. Yes, I would guess that the tail call interpreter explains part of the difference between 3.13 and 3.14. Previously the overall improvement to the interpreter has been measured at 1-5%, or even 10-15% depending on the compiler version you are using: https://blog.nelhage.com/post/cpython-tail-call/
If your benchmark setup is easy to re-run, it would be awesome to see numbers that compare the tail call interpreter to the build where it is disabled, to isolate how much improvement is due to that.
The article says exactly this in bold at the bottom:
> If you can break up a task into many parts, each of which is highly local, then memory access in each part will be O(1). GPUs are already often very good at getting precisely these kinds of efficiencies. But if the task requires a lot of memory interdependencies, then you will get lots of O(N^⅓) terms. An open problem is coming up with mathematical models of computation that are simple but do a good job of capturing these nuances.
LuaJIT bucks the trend of slow-warmup JITs. It is extremely quick to compile and load, and its interpreter is very fast -- faster than the JIT-compiled code from LuaJIT v1 IIRC, and certainly faster than the interpreter of Lua.
It wasn't until LuaJIT that I realized that JIT didn't inherently have to be these slow lumbering beasts that take hundreds of milliseconds just to wake from their slumber.
Yet I've witnessed Lua 5.1 launching faster than luajit for some of my use cases.
My point still stands though. Don't just use LuaJIT thinking it will magically make things faster in all cases. If you are embedding, LuaJIT is a no-brainer. If you are using a stand-alone interpreter, measure if you care about reality.
This was an interesting article, but it made me even more interested in the author's larger take on R as a language:
> In the years since, my discomfort has given away to fascination. I’ve come to respect R’s bold choices, its clarity of focus, and the R community’s continued confidence to ‘do their own thing’.
I would love to see a follow-up article about the key insights that the author took away from diving more deeply into R.
Varint encoding is something I've peeked at in various contexts. My personal bias is towards the prefix-style, as it feels faster to decode and the segregation of the meta-data from the payload data is nice.
But, the thing that tends to tip the scales is the fact that in almost all real world cases, small numbers dominate - as the github thread you linked relates in a comment.
The LEB128 fast-path is a single conditional with no data-dependencies:
if ! (x & 0x80) { x }
Modern CPUs will characterize that branch really well and you'll pay almost zero cost for the fastpath which also happens to be the dominant path.
If getting the optimal code is relying on getting a pile of heuristics to go in your favor, you’re more vulnerable to the possibility that someday the heuristics will go the other way. Tail duplication is what we want in case, but it’s possible that a future version of the compiler could decide that it’s not desired because of the increased code size.
With the new design, the Python interpreter can express the desired shape of the machine code more directly, leaving it less vulnerable to the whims of the optimizer.
reply