More

haberman · 2025-12-25T19:03:13 1766689393

I’ll repeat what I said at that time: one of the benefits of the new design is that it’s less vulnerable to the whims of the optimizer: https://news.ycombinator.com/item?id=43322451

If getting the optimal code is relying on getting a pile of heuristics to go in your favor, you’re more vulnerable to the possibility that someday the heuristics will go the other way. Tail duplication is what we want in case, but it’s possible that a future version of the compiler could decide that it’s not desired because of the increased code size.

With the new design, the Python interpreter can express the desired shape of the machine code more directly, leaving it less vulnerable to the whims of the optimizer.

kenjin4096 · 2025-12-25T19:19:26 1766690366

Yeah, I believe that statement and it seems to hold true for MSVC as well. Thanks for your work inspiring all of this btw!

haberman · 2025-12-20T22:08:17 1766268497

A long time ago I read that CadQuery has a fundamentally more powerful geometry kernel than OpenSCAD, so I dropped any attempt to try OpenSCAD.

Years later, I never actually got the hang of CadQuery, and I'm wondering if it was a mistake to write off OpenSCAD.

I am pretty new to CAD, so I don't actually know when I would run into OpenSCAD's limitations.

WillAdams · 2025-12-20T22:13:38 1766268818

The notable limitations for OpenSCAD are:

- functional programming model --- some folks find not having traditionally mutable variables limiting

- output is as an STL, or DXF using polylines

- native objects are spheres, cylinders, cubes, with functions for hull and Minkowski, so filleting and other traditional CAD operations can be difficult

haberman · 2025-12-12T19:12:25 1765566745

> Hotspot is the choice for high performance programs. Approaching its performance even with C++ requires a dedicated team of experts.

It's very surprising to hear you say this, as it's so contrary to my experience.

From the smallest programs (Computer Language Benchmarks Game) to pretty big programs (web browsers), from low-level programs (OS kernels) to high-level programs (GUI Applications), from short-lived programs (command-line utilities) to long-lived programs (database servers), it's hard to think of a single segment where even average Java programs will out-perform average C, C++, or Rust programs.

I hadn't heard of QuestDB before, but it sounds like it's written in zero-GC Java using manual memory management. That's pretty unusual for Java, and would require a team of experts to pull off, I'd think. It also sounds like it drops to C++ and Rust for performance-critical tasks.

jeffbee · 2025-12-12T20:59:47 1765573187

It's a statement of my experience in the performance achieved in practice by real developers who lack dedicated language support teams. And even the ones who enjoy dedicated language support teams. I could point to gRPC. gRPC-Java is slapping gRPC-C++ sideways. Why is that? Because when a codebase is increasingly complex, the C-style lifetime management becomes too difficult for developers to ponder, and they revert to relying on the slower features of the language platform, like reference counting smart pointers.

I think hybrid implementations, where a project enjoys the beneficial aspects of the language runtime at large, but delegates small, critical functions to other languages, makes sense. That keeps the C, C++, or Rust stuff contained to boundaries that are ponderable and doesn't let those language platforms dictate the overall architecture of the program.

ahefner · 2025-12-13T03:54:14 1765598054

If gRPC overhead is critical to your system, you've probably already lost the plot on performance in your overall architecture.

You make a fair point about smart pointers, and median "modern C++" practices with STL data structures are unimpressive performance-wise compared to tuned custom data structures, but I can't imagine that idiomatic Java with GC overhead on top is any better.

haberman · 2025-10-15T21:03:16 1760562196

I agree, but in my experience arena allocation in Rust leaves something to be desired. I wrote something about this here: https://blog.reverberate.org/2021/12/19/arenas-and-rust.html

I was previously excited about this project which proposed to support arena allocation in the language in a more fundamental way: https://www.sophiajt.com/search-for-easier-safe-systems-prog...

That effort was focused primarily on learnability and teachability, but it seems like more fundamental arena support could help even for experienced devs if it made patterns like linked lists fundamentally easier to work with.

taylorallred · 2025-10-17T17:57:13 1760723833

Thanks for those links. Have you tried using arenas that give out handles (sometimes indexes) instead of mutable references? It's less convenient and you're not leveraging borrow checking but I would imagine it supports Send well.

haberman · 2025-10-09T21:01:25 1760043685

Do any of these tests measure the new experimental tail call interpreter (https://docs.python.org/3.14/using/configure.html#cmdoption-...)?

I couldn't find any note of it, so I would assume not.

It would be interesting to see how the tail call interpreter compares to the other variants.

miguelgrinberg · 2025-10-09T21:50:45 1760046645

The build of Python that I used has tail calls enabled (option --with-tail-call-interp). So that was in place for the results I published. I'm not sure if this optimization applies to recursive tail calls, but if it does, my Fibonacci test should have taken advantage of the optimization.

ufo · 2025-10-09T22:21:29 1760048489

The tail calls in question are C tail calls inside the inner interpreter loop. They have nothing to do with Python function calls.

miguelgrinberg · 2025-10-09T22:51:20 1760050280

That tells you how much I know about the feature. :) But in any case, I'm positive that the flag was enabled, so my results are with tail calls. I suppose part of the difference between 3.13 and 3.14 could be thanks to this.

haberman · 2025-10-10T15:28:14 1760110094

Good to know! Thanks for confirming. Yes, I would guess that the tail call interpreter explains part of the difference between 3.13 and 3.14. Previously the overall improvement to the interpreter has been measured at 1-5%, or even 10-15% depending on the compiler version you are using: https://blog.nelhage.com/post/cpython-tail-call/

If your benchmark setup is easy to re-run, it would be awesome to see numbers that compare the tail call interpreter to the build where it is disabled, to isolate how much improvement is due to that.

emil-lp · 2025-10-09T22:03:43 1760047423

It wouldn’t have, since

    fib(n-1) + fib(n-2)

isn’t a tail call—there’s work left after the recursive calls, so the tail call interpreter can’t optimize it.

haberman · 2025-10-08T23:28:46 1759966126

The article says exactly this in bold at the bottom:

> If you can break up a task into many parts, each of which is highly local, then memory access in each part will be O(1). GPUs are already often very good at getting precisely these kinds of efficiencies. But if the task requires a lot of memory interdependencies, then you will get lots of O(N^⅓) terms. An open problem is coming up with mathematical models of computation that are simple but do a good job of capturing these nuances.

haberman · 2025-10-02T20:07:45 1759435665

LuaJIT bucks the trend of slow-warmup JITs. It is extremely quick to compile and load, and its interpreter is very fast -- faster than the JIT-compiled code from LuaJIT v1 IIRC, and certainly faster than the interpreter of Lua.

It wasn't until LuaJIT that I realized that JIT didn't inherently have to be these slow lumbering beasts that take hundreds of milliseconds just to wake from their slumber.

gorjusborg · 2025-10-02T20:29:41 1759436981

Yet I've witnessed Lua 5.1 launching faster than luajit for some of my use cases.

My point still stands though. Don't just use LuaJIT thinking it will magically make things faster in all cases. If you are embedding, LuaJIT is a no-brainer. If you are using a stand-alone interpreter, measure if you care about reality.

NuclearPM · 2025-10-02T23:59:16 1759449556

> If you are embedding, LuaJIT is a no-brainer. If you are using a stand-alone interpreter, measure if you care about reality.

This seems backwards. Lua is easier to embed and luajit is just as easy to install standalone and has zero downsides.

gorjusborg · 2025-10-03T17:20:17 1759512017

> zero downsides

I just said that I have measured it being slower in at least some use cases.

JIT gains better when already compiled paths run repeatedly. Most long running programs embedding Lua will choose luajit for this reason.

I don't care what people use, the point is that JIT compilation isn't magic that makes everything faster. The way to know is measure.

haberman · 2025-09-20T03:10:41 1758337841

This was an interesting article, but it made me even more interested in the author's larger take on R as a language:

> In the years since, my discomfort has given away to fascination. I’ve come to respect R’s bold choices, its clarity of focus, and the R community’s continued confidence to ‘do their own thing’.

I would love to see a follow-up article about the key insights that the author took away from diving more deeply into R.

haberman · 2025-09-18T06:10:10 1758175810

These features look compelling. When will they all be available in mainstream browsers?

haberman · 2025-09-05T17:21:21 1757092881

I advocated for PrefixVarint (which seems equivalent to vint64 ) for WebAssembly, but it was decided against, in favor of LEB128: https://github.com/WebAssembly/design/issues/601

The recent CREL format for ELF also uses the more established LEB128: https://news.ycombinator.com/item?id=41222021

At this point I don't feel like I have a clear opinion about whether PrefixVarint is worth it, compared with LEB128.

zigzag312 · 2025-09-05T18:34:21 1757097261

Just remember that XML was more established than JSON for a long time.

kannanvijayan · 2025-09-07T22:18:30 1757283510

Varint encoding is something I've peeked at in various contexts. My personal bias is towards the prefix-style, as it feels faster to decode and the segregation of the meta-data from the payload data is nice.

But, the thing that tends to tip the scales is the fact that in almost all real world cases, small numbers dominate - as the github thread you linked relates in a comment.

The LEB128 fast-path is a single conditional with no data-dependencies:

  if ! (x & 0x80) { x }

Modern CPUs will characterize that branch really well and you'll pay almost zero cost for the fastpath which also happens to be the dominant path.

It's hard to beat.

yencabulator · 2025-09-07T22:47:02 1757285222

SQLite format equivalent:

  if x <= 240 { x }

while strictly improving all other aspects (at least IMHO)

https://sqlite.org/src4/doc/trunk/www/varint.wiki