The Structure and Performance of Efficient Interpreters [pdf]

rurban · on March 30, 2017

This is an old famous paper which got a lot of arguments wrong, and a lot right.

The famous result is the old computed goto vs switch argument, which basically comes down to switch adds bounds-checking overhead. He didn't get that. Most commenters neither. He also cemented the poor wording "threaded interpreter" and "indirect branching" for different concepts. It's also a totally unimportant feature of making interpreters fast.

Indirect call overhead is significant and measurable. Jit compilers made that obsolete.

He didn't get the case against perl. perl has a huge op and data overhead, as is way too dynamic. But the dispatch loop is better than the others, as perl always returns the pointer to the next op already. There's no dispatch.

Much better interpreter optimization papers came later with lua, reiterating old lisp results. Note: not luajit. lua already had a perfect interpreter loop. lua uses the lisp architecture of small tagged dynamic data, plus a tiny opset, efficiently stored in a single word, with 2-3 arguments. python with its linearizing cfg optimizer tried to mimic it a bit, but is still too slow. Many one-word primitives allow fast stack allocation and updates, and one-word ops allow fast icache throughput.

zmodem · on March 30, 2017

> Much better interpreter optimization papers came later with lua, reiterating old lisp results.

Can you point to some of those papers?

marktangotango · on March 30, 2017

Thanks for the overview an perspective. When you say lua' interpreter loop is perfect, are there any better? Is lua state of the art here?

rurban · on March 31, 2017

For unjitted interpreters writren in C Lua and few other lisp and Lua derived interpreters are perfect.

With jit added, luajit is perfect. This is a completely different beast. Also the self/strongtalk derived jits of course: hotspot and v8.

mamcx · on March 30, 2017

I also imagine this is only inside a C interpreter. One made in .NET will require other solution?

rurban · on March 31, 2017

With a managed runtime/interpreter most of what makes an interpreter fast or slow is taken away from you.

pygy_ · on March 29, 2017

Something tells me that Mike Pall had read this...

This provides a reasonable tl;dr, with nice ASCII figures:

http://web.archive.org/web/20170102051538/http://article.gma...

filereaper · on March 30, 2017

Very poor, there's no mention of continuation style passing interpreters.

If anyone's curious about tales from the battlefield, highly recommend Cliff Click's talks (any of them) [1]

[1] https://youtu.be/Hqw57GJSrac?t=341

chubot · on March 30, 2017

Wow the class from which this video is taken looks really interesting, and has more videos! Thanks for the link.

http://soft-dev.org/events/vmss16/

eternalban · on March 30, 2017

Thank you for this link! The info-gem bandwidth of this man is off the scales.

vanderZwan · on March 30, 2017

Like the other have said, this looks great! Very dense on information too, although that has a downside: I usually have little trouble following a tech-talk, but I think I need a transcript to follow this one...

Sadly, the archive.org links on their website aren't working :(

I've just sent them a tweet, maybe they will fix it

agumonkey · on March 31, 2017

Less than a minute in it and I'm already sitting down.

jonstewart · on March 30, 2017

This is pretty old, from 2003, and I think modern Intel processors nullify the purported benefits of "indirect branching" (aka "computed goto") versus switch statements. A few years ago I learned of gcc's computed goto (also supported by clang, IIRC) and I reimplemented lightgrep's core regex engine with computed gotos instead of the indirect switch statement. I could not measure a difference, and I went back to the clearer, more portable switch statement.

glangdale · on March 30, 2017

What it comes down to is that predictable indirect branches are well-predicted. It does not require any insider information to verify this about Intel architecture processors. I built a "fake interpreter" a while back and found that branch misses in predictable sequences of interpreter code are very unlikely unless there is a longish common prefix of interpreter indirect branches in two interpreter sequences (i.e. one sequence goes ABCDEFGH and the other goes ABCDEFGZ).

This led us down a path of also using an interpreter within our own regex engine (Hyperscan) - not for individual engines like NFA/DFA/literal match but for the overall orchestration layer. It's been quite an improvement of code cleanliness and generally has been either performance-neutral or even good. When you replace a great big ugly clagged "if this rare thing happens, go do this code, or else if this happens, go do this other things, then ..." with an interpreter you can get nice performance benefits in the common case - even though there are not branch misses in the "giant clagged pile of code" anti-pattern.

jonstewart · on March 30, 2017

That said, it's not too surprising that it would be faster on MIPS. MIPS would have far, far fewer transistors working on branch prediction than x86.

lacampbell · on March 30, 2017

That's actually really interesting. All the literature I've read suggested that switch statements were some horrible performance killer and you had to speed or portability. Cool to see that it's changed.

I think even the Lua Paper had a bit about that.

kabdib · on March 30, 2017

Darek Mihocka has a nice writeup on inner interpreter efficiency on modern-ish architectures: http://www.emulators.com/docs/nx25_nostradamus.htm

[edit: get his name right. you'd think i'd know it, since he rubs my nose into some bugs I wrote 30+ years ago every time we meet. hi, darek! :-)]

panic · on March 29, 2017

Modern branch predictors have made switch-based interpreters more viable: https://hal.inria.fr/hal-01100647/document