Hacker News new | past | comments | ask | show | jobs | submit login
The Structure and Performance of Efficient Interpreters [pdf] (jilp.org)
98 points by ingve on March 29, 2017 | hide | past | favorite | 18 comments



This is an old famous paper which got a lot of arguments wrong, and a lot right.

The famous result is the old computed goto vs switch argument, which basically comes down to switch adds bounds-checking overhead. He didn't get that. Most commenters neither. He also cemented the poor wording "threaded interpreter" and "indirect branching" for different concepts. It's also a totally unimportant feature of making interpreters fast.

Indirect call overhead is significant and measurable. Jit compilers made that obsolete.

He didn't get the case against perl. perl has a huge op and data overhead, as is way too dynamic. But the dispatch loop is better than the others, as perl always returns the pointer to the next op already. There's no dispatch.

Much better interpreter optimization papers came later with lua, reiterating old lisp results. Note: not luajit. lua already had a perfect interpreter loop. lua uses the lisp architecture of small tagged dynamic data, plus a tiny opset, efficiently stored in a single word, with 2-3 arguments. python with its linearizing cfg optimizer tried to mimic it a bit, but is still too slow. Many one-word primitives allow fast stack allocation and updates, and one-word ops allow fast icache throughput.


> Much better interpreter optimization papers came later with lua, reiterating old lisp results.

Can you point to some of those papers?


Thanks for the overview an perspective. When you say lua' interpreter loop is perfect, are there any better? Is lua state of the art here?


For unjitted interpreters writren in C Lua and few other lisp and Lua derived interpreters are perfect.

With jit added, luajit is perfect. This is a completely different beast. Also the self/strongtalk derived jits of course: hotspot and v8.


I also imagine this is only inside a C interpreter. One made in .NET will require other solution?


With a managed runtime/interpreter most of what makes an interpreter fast or slow is taken away from you.


Something tells me that Mike Pall had read this...

This provides a reasonable tl;dr, with nice ASCII figures:

http://web.archive.org/web/20170102051538/http://article.gma...


Very poor, there's no mention of continuation style passing interpreters.

If anyone's curious about tales from the battlefield, highly recommend Cliff Click's talks (any of them) [1]

[1] https://youtu.be/Hqw57GJSrac?t=341


Wow the class from which this video is taken looks really interesting, and has more videos! Thanks for the link.

http://soft-dev.org/events/vmss16/


Thank you for this link! The info-gem bandwidth of this man is off the scales.


Like the other have said, this looks great! Very dense on information too, although that has a downside: I usually have little trouble following a tech-talk, but I think I need a transcript to follow this one...

Sadly, the archive.org links on their website aren't working :(

I've just sent them a tweet, maybe they will fix it


Less than a minute in it and I'm already sitting down.


This is pretty old, from 2003, and I think modern Intel processors nullify the purported benefits of "indirect branching" (aka "computed goto") versus switch statements. A few years ago I learned of gcc's computed goto (also supported by clang, IIRC) and I reimplemented lightgrep's core regex engine with computed gotos instead of the indirect switch statement. I could not measure a difference, and I went back to the clearer, more portable switch statement.


What it comes down to is that predictable indirect branches are well-predicted. It does not require any insider information to verify this about Intel architecture processors. I built a "fake interpreter" a while back and found that branch misses in predictable sequences of interpreter code are very unlikely unless there is a longish common prefix of interpreter indirect branches in two interpreter sequences (i.e. one sequence goes ABCDEFGH and the other goes ABCDEFGZ).

This led us down a path of also using an interpreter within our own regex engine (Hyperscan) - not for individual engines like NFA/DFA/literal match but for the overall orchestration layer. It's been quite an improvement of code cleanliness and generally has been either performance-neutral or even good. When you replace a great big ugly clagged "if this rare thing happens, go do this code, or else if this happens, go do this other things, then ..." with an interpreter you can get nice performance benefits in the common case - even though there are not branch misses in the "giant clagged pile of code" anti-pattern.


That said, it's not too surprising that it would be faster on MIPS. MIPS would have far, far fewer transistors working on branch prediction than x86.


That's actually really interesting. All the literature I've read suggested that switch statements were some horrible performance killer and you had to speed or portability. Cool to see that it's changed.

I think even the Lua Paper had a bit about that.


Darek Mihocka has a nice writeup on inner interpreter efficiency on modern-ish architectures: http://www.emulators.com/docs/nx25_nostradamus.htm

[edit: get his name right. you'd think i'd know it, since he rubs my nose into some bugs I wrote 30+ years ago every time we meet. hi, darek! :-)]


Modern branch predictors have made switch-based interpreters more viable: https://hal.inria.fr/hal-01100647/document




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: