This is an old famous paper which got a lot of arguments wrong, and a lot right.
The famous result is the old computed goto vs switch argument, which basically comes down to switch adds bounds-checking overhead. He didn't get that. Most commenters neither. He also cemented the poor wording "threaded interpreter"
and "indirect branching" for different concepts.
It's also a totally unimportant feature of making interpreters fast.
Indirect call overhead is significant and measurable. Jit compilers made that obsolete.
He didn't get the case against perl. perl has a huge op and data overhead, as is way too dynamic. But the dispatch loop is better than the others, as perl always returns the pointer to the next op already. There's no dispatch.
Much better interpreter optimization papers came later with lua, reiterating old lisp results. Note: not luajit. lua already had a perfect interpreter loop. lua uses the lisp architecture of small tagged dynamic data, plus a tiny opset, efficiently stored in a single word, with 2-3 arguments. python with its linearizing cfg optimizer tried to mimic it a bit, but is still too slow. Many one-word primitives allow fast stack allocation and updates, and one-word ops allow fast icache throughput.
Like the other have said, this looks great! Very dense on information too, although that has a downside: I usually have little trouble following a tech-talk, but I think I need a transcript to follow this one...
Sadly, the archive.org links on their website aren't working :(
I've just sent them a tweet, maybe they will fix it
This is pretty old, from 2003, and I think modern Intel processors nullify the purported benefits of "indirect branching" (aka "computed goto") versus switch statements. A few years ago I learned of gcc's computed goto (also supported by clang, IIRC) and I reimplemented lightgrep's core regex engine with computed gotos instead of the indirect switch statement. I could not measure a difference, and I went back to the clearer, more portable switch statement.
What it comes down to is that predictable indirect branches are well-predicted. It does not require any insider information to verify this about Intel architecture processors. I built a "fake interpreter" a while back and found that branch misses in predictable sequences of interpreter code are very unlikely unless there is a longish common prefix of interpreter indirect branches in two interpreter sequences (i.e. one sequence goes ABCDEFGH and the other goes ABCDEFGZ).
This led us down a path of also using an interpreter within our own regex engine (Hyperscan) - not for individual engines like NFA/DFA/literal match but for the overall orchestration layer. It's been quite an improvement of code cleanliness and generally has been either performance-neutral or even good. When you replace a great big ugly clagged "if this rare thing happens, go do this code, or else if this happens, go do this other things, then ..." with an interpreter you can get nice performance benefits in the common case - even though there are not branch misses in the "giant clagged pile of code" anti-pattern.
That's actually really interesting. All the literature I've read suggested that switch statements were some horrible performance killer and you had to speed or portability. Cool to see that it's changed.
The famous result is the old computed goto vs switch argument, which basically comes down to switch adds bounds-checking overhead. He didn't get that. Most commenters neither. He also cemented the poor wording "threaded interpreter" and "indirect branching" for different concepts. It's also a totally unimportant feature of making interpreters fast.
Indirect call overhead is significant and measurable. Jit compilers made that obsolete.
He didn't get the case against perl. perl has a huge op and data overhead, as is way too dynamic. But the dispatch loop is better than the others, as perl always returns the pointer to the next op already. There's no dispatch.
Much better interpreter optimization papers came later with lua, reiterating old lisp results. Note: not luajit. lua already had a perfect interpreter loop. lua uses the lisp architecture of small tagged dynamic data, plus a tiny opset, efficiently stored in a single word, with 2-3 arguments. python with its linearizing cfg optimizer tried to mimic it a bit, but is still too slow. Many one-word primitives allow fast stack allocation and updates, and one-word ops allow fast icache throughput.