This is one of those gotcha that trap senior devs way more than junior devs. Growing up with C being my first language, I intuitively associate switch with a jmp table (ie extremely fast). Even though I knew it’s unlikely to be as performant in JS, it never occurred to me that it would actually be slow, until I did a deep dive into low level JS optimization.
Related: Matt Godbolt’s talk on emulating 6502 in JS. He also mentions the poor performance of switch for opcodes: https://youtu.be/7WuRq-Wmw5o
I think you've slightly misunderstood TFA. The issue there wasn't that the switch statement was slow, per se - it was that V8's optimizing compiler didn't yet support such statements, so V8 was bailing out of its JIT step and executing the function via a (dog-slow) interpreter. In other words the function would have run painfully slow even if it returned before reaching the switch statement.
It's also maybe worth adding that gotchas like this were super common way back in the early days TFA describes, but AFAIK they are no longer a concern now. Back when V8's optimizer was primitive it used to bail out for all kinds of reasons - because it saw a "try()" statement, because a function was too long, or even just because it got confused. Nowadays it's a different world - I do a lot of JS perf work, but it's probably been two years since I've seen V8 deopt anything for any reason.
I think you also misread the article. It clearly said that an optimized switch would still run as a series of "if" tests, and that when he got the compiler to generate that code, it wasn't faster.
TFA doesn't cover any cases where switch statements got optimized. It starts with the relevant code getting deopted, then the author tries a fix but the code is still deopted, then he removes the switch statements.
(Mind you I'm not saying switch statements are fast - they may well be slower than the alternatives even after being optimized. I'm just pointing out that TFA isn't about the performance of switch statements, it's about avoiding deopts.)
The EcmaScript switch spec basically says that each case statement in turn is evaluated until one becomes equal or default is encountered, so to follow the spec the basic case is equal to a bunch of if-else's.
To optimize the runtime first has to figure out that all case values are constants (and insert guards to de-opt if it doesn't hold anymore), so in essence the machinery for running this code optimized has to be very conservative (and probably why they placed arbitrary limits that really aren't suited for emulators).
TFA references a post that states that optimized switch statements in V8 are still just if-elseif-else and not jump tables, at least back when it was written. They might be faster than unoptimized JS, but certainly not as fast as jump tables. The author didn't even bother to split up the functions/switch blocks further when they learned about the requirements to get V8 to actually optimize a switch, as it seemed not worth it.
Sure, none of that is in doubt. But in my experience the performance cost of hot code getting deopted is so massive as to overwhelm virtually any other factor, so realistically, fixing the deopt is what solved TFA's problem. Of course getting rid of the switch statements was presumably an additional speedup, but since TFA doesn't compare those cases I guess that's neither here nor there.
Famously, function inlining decisions were based on the size of the function, not the size of the AST of the function, which lead to situations where removing inline comments allowed a function to be faster.
Interpreters and basic compilers are remarkably similar, and it's easy to transform between them if you do some extra busy work. Once you have your final AST, instead of outputting bytecode instructions, output the implementation of each instruction. Or vice versa.
You have code that handles different types in the interpreter? Great, bake it into the compiled code.
The transform from bytecode interpreter to compiler is always pretty simple, but the other way around gets more difficult the smarter the compiler is. So yes it is very possible to have a baseline javascript compiler and no interpreter.
I guess I'm differing on this point. By "baking in" the dynamic dispatch the final compiled code is, in my mind, an interpreter. Which is how I always thought of the "Full Codegen" step of the old v8 architecture.
I guess the lines can be a bit blurry especially when the compiled code still relies heavily on dynamic dispatch, but V8 generated machine code for the target CPU, marked the page as executable, and then it would call that just generated code. I think most people would call that a JIT-compiler.
It wasn’t a very advanced compiler in the sense that it generated very inefficient machine code. But it still ran circles around a tree-walking or bytecode interpreter. Over time the compiler became more advanced and multi-tiered, but at the cost of startup time, and mobile was taking off, where memory was more scarce than on desktop. x86 or ARM code is a lot bigger than a specialized bytecode that is optimized for space efficiency. So eventually the Ignition interpreter was added. But unlike some of the other Javascript engines that started as an interpreter and later added native code generation, V8 started out generating native code and only later added a bytecode interpreter.
Well let's say you wrote C++ where most of your data is in a super union of float, string,*, bool, int, map*, etc, with gobs of repetitive dynamic dispatch whenever you touch one of those unions. That wouldn't mean your code is being interpreted, would it?
Could you please explain what do you mean by type information? I thought you never have type information in JavaScript... - except the base types (bool, number, string, object, array, function), but you always have this information...
Basically, modern JS engines are fast because they do a lot of inferring about types. They do this both statically (by inspecting the code at compile time) and dynamically (by watching the code as it executes), and they do it for objects as well as literals.
For a hand-waving theoretical example, if you have code like:
var a = { foo:1 }
var b = { foo:2 }
var c = { foo:3 }
doSomething(a)
doSomething(b)
doSomething(c)
then modern JS engines can see that the "doSomething" function always gets called with arguments of the same type signature, and if it's a hot function they might optimize the function around that information.
If you want to know more, searching on "V8 hidden classes" should turn up relevant articles.
In my first job out of school, I was writing driver code in C. In one of my projects, I had a code block that wasn't working, but all it did was a switch on ints. On a hunch, I tried ifs, and it worked. Turns out I found a bug in the in-house GCC fork. I'm sure my manager thought I was an idiot--I was a new grad saying my project isn't working, and I think it's a compiler bug.
We always assume compilers to be infallible, and for the most part this is right. But special compilers for special targets or certified compilers for special uses, I've found to often have compiler issues. Why did you have a GCC-fork?
I did! The compiler team agreed that it was broken and gave me a flag to disable jump tables. Surprisingly, that didn't cause any noticeable performance regressions.
> would actually be slow, until I did a deep dive into low level JS optimization
Can you explain why this is?
Intuitively it would seem like optimizing a switch statement would be really low-hanging fruit. I would have thought it would be optimized into something almost exactly like what the author hand rolled.
You can switch on arbitrary types in JS and you can add expressions:
switch(3) {
case true: break;
case "ab" + "c": break;
case Math.random() >= 0.5 ? "heads" : new Error("tails") : break;
}
The optimizer will have to be persuaded that the switch and each case is integral before optimizing with a jump table. On the other hand, Arrays typically have a fast path for integer indices.
The article quotes a Stack Overflow post with the conditions under which V8 will optimize the switch statement:
> @LGB actually in V8 (JS engine used by google chrome) you need to jump through a lot of hoops to get switch case optimized: All the cases must be of same type. All the cases must either be string literals or 31-bit signed integer literals. And there must be less than 128 cases. And even after all those hoops, all you get is what you would have gotten with if-elses anyway (I.E. no jump tables or sth like that). True story.
In the case of this emulator, it wasn’t being optimized because it had more than 128 cases.
that's not quite true, because arbitrary expressions can change meaning due to e.g. overloading of functions.
E.g. Random.random() will have to be called each time that case branch is evaluated, because that function could be overloaded or change its returned value.
I don't know. Maybe it's because I've spent too long thinking about such things, but interpreter dispatch routines are one of those areas of code where micro-optimizations are very pernicious and one should expect to review the landscape each time to rediscover best practice. See all the ways of compiling and executing Forth words for examples.
>> I intuitively associate switch with a jmp table
Me too, but I try to keep the set of values dense and use a mask (op & 0xff) to help the compiler know it doesn't need to do bounds checking. Verify these changes help of course.
I like function tables. I think they are cleaner than switch statements
But, (maybe this is an invalid test), checking today my results suggest a switch is faster than a function table in 2022? At least at a base level. Maybe the more complicated the code for each case gets the less likely it is to get optimized.... which is true in general? large code blocks are less likely to get optimized and a switch is considered one large code block.
In the meantime the entire v8 execution pipeline has been rewritten: a new optimising compiler was swapped in, a baseline interpreter was added, and the baseline compiler was removed (in more or less that order iirc).
Wouldn’t at all be surprising if that lack of optimisation had been long fixed.
Are function tables really all that cleaner? Honest question. If I can offer a counterpoint a function table requires us to code some function dispatch routine that looks at the intention (keywords in function table), then looks up the function, then finds a way to bind the arguments to the function (language specific) and finally call the function? Isn't this more complexity in code?
Isn't switch-case simple because it is available as a construct in the language?
From the title, I expected to find one of the frequently called case statements was unoptimized and they fixed it. Instead, the fix was rewriting all case statements in a way the javascript engine would optimize. Very frustrating if you're unaware of the obscure rules around it.
Not quite - the javascript engine failed to optimise reasonable situations, so the dude translated to a code structure (arrays of functions) that didn't require optimisation.
Essentially he was forced to manually code what should've been done under the hood anyway.
I'm unsure. They're different enough in structure that it feels to me that it should be a conscious decision. In a switch statememt, there's usually one large scope, while in the function table there's a lot of little scopes. The pathway through the code during execution is different ad well. Massive switch statements are also not reasonable situations from the perspective of a compiler, as they see little use outside of VMs.
Sure, I mean more in terms of producing a compiled form with O(1) rather than O(n) performance.
> Massive switch statements are also not reasonable situations from the perspective of a compiler, as they see little use outside of VMs.
I'm not really following this logic, massive switch statements are those that can benefit most significantly from the generation of jump tables.
And what needs to be done to generate them doesn't differ significantly based upon the branch count (when you hit case 0x100 you'll need to switch to 16-bit, so binary size will be impacted).
The fact that it's primary use case is reasonably niche is fairly irrelevant given there's bugger all compiler overhead required for support.
Now from a code perspective, I would say that a statement that large probably isn't ideal (depending on the size of your case blocks).
> Now from a code perspective, I would say that a statement that large probably isn't ideal (depending on the size of your case blocks).
This is what I had meant, yes. Optimizing such large blocks is quite difficult. A case label in C and C-like languages does not even need a block, making it harder.
> The fact that it's primary use case is reasonably niche is fairly irrelevant given there's bugger all compiler overhead required for support.
Even static C compilers still struggle with optimizing the switch-in-for-loop pattern, let alone a JIT compiler. Clang does the best job nowadays, but it still lags behind other methods like computed goto or continuation-passing. The Python VM will switch to computed goto if it's available on the C compiler.
It does take work to make a faster VM, it feels unreasonable to expect a compiler to do a good job at optimizing it without effort on the writer's side.
In the grand scheme of things, VMs and interpreters are a class of program that tend toward perhaps surprisingly pathological behaviour, among programs people are likely to write.
This kind of stuff is usually at least somewhat frustrating in the moment, but it’s just the nature of things: we’re always mapping some task onto some machine and it wouldn’t be such a great story if it wasn’t such an elegant hack. This one is particularly interesting because it’s one VM mapping onto a second VM and the impedance mismatch is very meta.
My only nitpick with a great read is that I want to know a little more gory details of how the automation was done, it could easily have involved a third or even fourth VM!
But knowing how the JIT deoptimizes is a good thing for anyone programming perf-sensitive code for that JIT. Even developing a sense of where branchy logic might deopt is a good thing, even if your code might not be perf-sensitive but could be called in a perf-sensitive call stack, because it’ll give you a good heuristic for when optimization is premature. Optimizing for the JIT isn’t and shouldn’t be a guiding principle for every dev, but it’s harmless or good for us to be aware.
> Optimizing for the JIT isn’t and shouldn’t be a guiding principle for every dev, but it’s harmless or good for us to be aware.
No, it can be quite harmful: you could end up overoptimizing for one engine and ruin performance on another engine, or even a different version of the same engine. When making non-trivial optimizations that are designed around the quirks of the optimizing compiler you should be very careful about the tradeoffs and also what the penalty will be if the optimization breaks in the future.
Yep, exactly. There are very few codebases that can actually write code directly to the optimizer and they are usually developed very closely with a specific compiler. For other cases usually you can trust that a number of basic optimizations can be performed and then be cautious about what the various compilers can do. As you gain more experience you can slowly expand the set of things you would expect to be "easy" for the compiler–when those passes don't run, they're generally considered to be bugs.
It is no different from compiled languages, that is one of the reasons why most compilers regardless of the language, allow to look into the generated Assembly.
The difference is control. You can choose which compiler to use for your project, even an outdated version if it works for your use case. But you have no control over what JIT engine visits your site.
For inner loops of interpreters in low-level languages, unless you are using some labels-are-values extension, it's often tough to convince the thing to generate the right code.
For anyone looking for an open source web-based TI emulator I just ported wxWabbitEmu (which I believe supports the TI-84) [1]. I only skinned the TI-85 since it's the calc I use, but the skins are just CSS.
I'm curious what the performance is like. I run it on my phone as a basic calculator, haven't messed with running binaries at all.
This is really cool! It's a shame that jsTIfied isn't open source; I would have liked to use it for live demos in my roundup [1] of cool TI software projects. Your project with a skin would be perfect.
Using map or array instead of switch-case is a common optimization way, but please do a benchmark if you have no confidence whether you need do or not.
Shameless plug for my 6502 JS emulator / assembler in an Observable notebook, which takes the opcode, converts into pseudo assembly and then uses a switch statement on the mnemonic. I expected it would be horribly slow but maybe this switch on the (50 something instructions) is OK?
> I was used to dealing with obfuscated code from Minecraft
I wonder how common this is for engineers. I’ve been in industry for the last 4-5 years or so, and it was how I got a pretty big jump into programming. Talked to a few similarly aged engineers at FAANG and a couple others got into programming through Minecraft as well.
Minecraft is too new for me, but I did get most of my pre-professional experience from modding in video games, other people I've worked with can say the same.
I understand that switch-cases may not be optimized well in JS, but what surprises me is that all the instructions in the cases are now grouped into functions and this is faster than regular switch case.
When I write Javascript, I expect function calls to be slow. So I see three possibilities:
- (unoptimized) switch statement is very slow in JS, so slow that function calls will be faster
- function calls are not that slow in the end.
- both
It's a shame that the switch statement cannot relied on to be fast though. They are very readable, and indeed, feel like code that can be highly optimized. I guess that would risk making the JIT compilation too slow, or that not enough code found on the web rely on switch statements for this to matter too much.
Is this still true today? (I've seen a comment that says the 128 limit is still in V8's code)
How is this handled in SpiderMonkey, JavascriptCore or other JS engines?
TFA is basically about a limitation in V8's compiler, circa 2013. Based on the bug report it looks like that limitation was gone a few years later, and either way V8 switched to a whole different compiler back in 2017 or so.
Interestingly I wrote a 6502 (actually Atari 6507) emulator and Z80 emulator for the MIPS R3000 in the original SONY PSX, and the only way to get the speed I needed was to use calculated jump tables rather than relying on a switch case. I know that C provides that calculated jump, but even that wasn't enough. Eventually I rewrote the entire emulator in R3000. And then I eventually switched to using CPU retargeting/recompilation from 6502 to R3000, and then used the same recompilation trick to get 6502 emulation running at a decent speed on the ARM CPU in the Nintendo Gameboy Advance.
That depends a lot on the compiler, I guess in the 90's compilers didn't optimize switch-case that well yet.
In modern compilers a switch-case may end up faster than a function-pointer jump table, because even though the compiler will turn the switch-case into a jump table as well, the code that's called through the jump table doesn't have the prologue/epilogue overhead of regular function call. In my emulators I also haven't seen a difference between a regular switch case, and 'computed goto' (but depending on how the decoder loop looks like, computed goto may still provide have an advantage, just not in my CPU emulators).
To add to this I didn't do a function-pointer jump table. A function pointer jump table was too slow and involved multiple look-ups. I took the opcode of the 6502 or Z80 instruction, shifted it, then added that value to the PC (program counter) register. The new value in the program counter was effectively the goto. All the assembly functions that handled each opcode were aligned on a multiple of powers of two value, 2048 bytes IIRC. Though it might have been 1024 bytes, this is over 25 years ago. The instruction immediately after the PC write set the the return address for the function about to be called, the return address write worked due to the pipelining in the MIPS CPU. The return address was the instruction immediately after the PC write. The opcode function executes, and then returns back to the write instruction, which then falls through to the no-op (opcode zero), which handled the gathering of the next opcode to be processed, any updates to the display and hardware, cleaned up the stack, and then looped to process the next opcode.
The old compilers optimized the switch/case, just not as well as something from 2022, and there were lots of edge cases too. And limited registers, and small caches and so forth. Plus, these were CPUs that were at the boundary of what could be done in terms of emulation of another CPU and everything else about the device being emulated too. Today we have the advantage of far more advanced compiler technology and far more advanced CPU technology that runs crazy fast, and a greater understanding of how to emulate the machines and best practices for creating those emulators. 25 years ago, not so much. The PS2 and the XBOX were the first consoles where you could just take MAME or the 2600 VCS emulator I wrote and compile it to the target console and "it just worked." I ported MAME to the PS2 and ran six different games on a tumbling cube, and then MAME to the XBOX shortly afterwards, and Dolphin emulator on XBOX and PS2 as well - though again, smacking up against what that generation of consoles could realistically emulate.
Don't know the guts of the optimization, but 128 64bit pointers is 1024, which is a round number. Could be to avoid memory or cache blowup, or maybe they're using that 8th bit for something internally
Tangentially related, but I remember reading years ago a piece from a graphics driver developer (nvidia maybe?) that one of the reasons why the opengl driver was slower than their directx driver (I believe) was that the structure of OpenGL was such that you had to put a massive switch in the code which really hurt performance.
Very unlikely that an actual switch statement was the problem (if that actually exists, and wasn't just 'figure of speech'). OpenGL state is a huge collection of tiny knobs and toggles ('switches' if you like) and each state change may in turn influence other state, while D3D (since D3D10 at least) groups the same state into a small number of immutable state objects.
Worth noting that SO comment calls out that the linked line points to the code as it was in 2013 and that there have been changes since then:
> The link into the V8 source code above is pretty old (2013), so I tried to find the modern equivalent. I didn't find a hard limit, but found several heuristics that decide between table lookups and tree (binary search) lookups (ia32, x86). When I plug in my numbers I don't quite get a borderline case where I found it, so I'm not sure this is the actual cause or whether there's another optimization not being triggered elsewhere.
I get the same sensation too. It's kind of funny how these little reminders of people and interactions from cemetech and Omni pop up elsewhere in the tech community. I'm still proud of some of the stuff I wrote back in those days.
Related: Matt Godbolt’s talk on emulating 6502 in JS. He also mentions the poor performance of switch for opcodes: https://youtu.be/7WuRq-Wmw5o