It was almost definitely HP Dynamo. (Edit: if you combine ideas from HP Dynamo, SafeTSA JIT-optimized bytecode, and IBM's AS/400's TIMI/Technology Independent Machine Interface, you get a better version of the current Android Run Time for bytecode-distributed apps that compile ahead of time to native code and self-optimize at runtime based on low-overhead profiling.)
The really nice thing about Dynamo was that it was a relatively simple trace-based JIT compiler from native code to native code (plus a native code interpreter for non-hotspots). This meant that it would automatically inline hotspots across DLLs and through C++ virtual method dispatches (with appropriate guards to jump back to interpreter mode if the virtual method implementation didn't match or the PLT entry got modified). They didn't have to do any special-casing of the interpreter to handle virtual method calls or cross-DLL calls, it's just a natural consequence of a trace-based JIT from native code to native code.
The only downsides of something like Dynamo are (1) a bit of complexity and space usage (2) some startup overhead due to starting in interpretive mode and (3) if your program is abnormal in not having a roughly Zipf distribution of CPU usage, the overhead is going to be higher.
Ever since I read about Michael Franz et al.'s SafeTSA SSA-based JVM bytecode that more quickly generated higher-performing native code, I've had a long-term back-burner idea to write a C compiler that generates native code in a particular way (functions are all compiled to arrays of pointers to strait-line extended basic blocks) that makes tracing easier, and also storing a SafeTSA-like SSA bytecode along with the native code. That way, a Dynamo-like runtime wouldn't use an interpreter, and when it came to generate an optimized trace, it could skip the first step of decompiling native code to an SSA form. (Also, the SSA would be a bit cleaner as input for an optimizer, as the compilation-decompilation round-trip tends to make the SSA a bit harder to optimize, as shown by Franz's modification of Pizza/JikesRVM to run both SafeTSA and JVM bytecode.) Once you have your trace, you don't need on-stack replacement to get code in a tight loop to go into the optimized trace, you just swap one pointer to native code in the function's array of basic blocks. (All basic blocks are strait-line code, so the only way to loop is to jump back to the start of the same basic block via the array of basic block pointers.)
The background for HP Dynamo is that during the Unix wars, there were a bunch of RISC system vendors vying for both the high-end workstation and server markets. Sun had SPARC, SGI had MIPS, DEC had Alpha AXP (and earlier, some MIPS DECStations) and HP had PA-RISC. The HP Dynamo research project wanted to show that emulation via dynamic recompilation could be fast, so to get an apples-to-apples comparison for emulation overhead, they wrote a PA-RISC emulator for PA-RISC.
This project grew into an insanely powerful tool. It's called DynamoRIO and is still under active development and use today. It's one of the coolest technologies I've ever worked with.
It's used by the winafl fuzzer to provide basic block coverage for black box binaries.
Yes, I have poked around DynamoRIO a few times. It's now geared toward dynamically modifying binaries for various purposes from code coverage to fuzzing to performance and memory profiling.
There doesn't appear to currently be a turn-key solution similar to the original Dynamo. DynamoRIO could be used to put a small conditional tracing stub at the start of every basic block at application startup time, and then do some binary rewriting, similar to the original Dynamo, but it doesn't seem there are downloadable binaries that currently do this.
This dynamic optimization would be much easier and lower overhead (but less general) with cooperation from the compiler.
Could such a compiler include the runtime for this in the binary as an option? That might make it a lot more likely to be used by people, because it is all nice and stand-alone.
Who would benefit from this most? Is the benefit so diffuse it would almost have to be an open-source project without funding? Or could there be parties that see enough of an advantage to fund this?
I guess you could try and get a certain instruction set vendor (probably RISC-V, maybe ARM or x86 based) to have this as a boost for their chips. I guess the "functions are pointers to blocks" compilation could benefit from hardware acceleration.
You could presumably statically link in the runtime. Also, without the dynamically-optimizing runtime, it would run just fine, just a bit slower than normal native code due to the extra indirection. Lots of indirect calls also increase the chances of address mispredictions due to tag collisions in the BTB (Branch Target Buffer).
Function calls/jumps through arrays of pointers are how virtual method calls/optimized virtual method tail calls are executed. Though, in this case, the table offsets would be held in a register instead of immediate values embedded within the instruction. I'm not aware of any instruction set where they've decided it's worthwhile making instructions specifically to speed up C++ virtual member function dispatch, so I doubt they'd find optimizing this worthwhile.
Also, if things go according to plan, your hot path is a long strait run of code, with only occasional jumps through the table.
I should add that the GP only asked about CPU instructions for faster indirect jumps, but I should add that there are at least 4 things that would help a system designed for pervasive dynamic re-optimization of native code:
1. Two special registers (trace_position pointer and trace_limit pointer) for compact tracing of native code. If the position is less than the limit, for all backward branches, indirect jumps, and indirect function calls, the branch target is stored at the position pointer, and the position pointer is incremented. Both trace_position and trace_limit are initialized to zero at thread start, disabling tracing. When the profiling timer handler (presumably SIGVTALRM handler on Linux) executes, it would do some heuristic to determine if tracing should start. If so, it would store the resumption instruction pointer to the start of a thread_local trace buffer, set trace_position to point to the second entry in the trace buffer, and set trace_limit to one after the end of the trace buffer. There is no need to implement a separate interrupt for when the trace buffer fills up, it just turns of tracing; instead, re-optimizing the trace can be delayed until the next time the profiling timer handler is invoked.
2. Lighter weight mechanism for profiling timers that can both be set up and handled without switching from user space to kernel space. Presumably it looks like a cycle counter register and a function pointer register that gets called when the counter hits zero. Either the size of the ABI's stack red zone would be hard-coded, or there would need to be another register for how much to decrement the stack pointer to jump over the red zone when going into the signal handler.
3. Hardware support for either unbiased reservoir sampling or a streaming N-most-frequent algorithm[0] to keep track of the instruction pointer of the instructions causing pipeline stalls. This helps static instruction scheduling for those spots where the CPU's re-order buffer isn't large enough to prevent stalls. (Lower power processors/VLIWs typically don't execute out of order, so this would be especially useful there.) Reservoir sampling can be efficiently approximated using a linear feedback shift register PRNG logical-anded against a mask based on the most significant set bit in a counter. I'm not aware of efficient hardware approximations of a streaming N-most-frequent algorithm. One of the big problems with Itanium is that it relies on very good static instruction scheduling by the compiler, but that involves being good at guessing which memory reads are going to be cache misses. On most RISC processors, the number of source operands is less than the number of bytes per instruction, so you could actually encode which argument wasn't available in cases where, for instance, you're adding two registers that were both recently destinations of load instructions.
4. A probabilistic function call instruction. For RISC processors, the target address would be ip-relative with an offset stored as an immediate value in the instruction. The probability the function is taken would be encoded in the space usually used to indicate which registers are involved. This allows lightweight profiling by calling into a sampling stub that looks back at the function return address. Presumably some cost estimation heuristic would be used to determine the probability embedded in the instruction to make the sampling roughly weighted by cost.