There is still significant sharing that can be achieved inside a VM, plus, a lot of the sharing come from zero pages (full of 0) which is still performed accross VMs.
Another benefit of the salting mechanism is that it allows the administrator to define groups of VMs that are trusted in which sharing will be performed.
disclaimer: I work at VMware and wrote the salting code.
A lot of those optimizations would no longer yield any benefits[0]. The CPU archictecture evolved a lot in 16 years, especially in branch/code prediction to the point where a correctly predicted branch (without branch_likely) has almost no cost.
As a CPU architect, I can confirm that all those except possibly 2) will not yield significant benefits. Prefetching hints will only be useful when the particular code fragment is highly memory-bound because most wide superscalar microarchitectures will easily hide L1/L2 miss latencies.
My qp trie code <http://dotat.at/prog/qp/> got a performance boost of about 20% by adding prefetch hints in the obvious places. The inner loop is fetch / compute / fetch / compute, chaining down into the true. The next fetch will (usually) be some small offset from a pointer we can get immediately after its preceding fetch, so prefetch the base pointer, then compute to work out the exact offset.
If a DDR stall is 50+ cpu cycles, (probably a lot more with today's 2, 3GHz CPU), I am not sure if superscalar microarchitectures would help too much.
At lease in my case of networking packet forwarding app, I had the profiling data to prove that was an issue.
The app code is not that long ~2000 lines of code after clean up. But it have a lot of table looks up (DDR stall) and branches for error condition checks.
A MIPS is probably the exact opposite to modern (which actually means anything P6 and above) x86 CPUs in terms of performance characteristics. If I were to guess what member of the x86 family might actually benefit from such optimisation, it would be NetBurst (which itself has very different performance characteristics from every other x86 family that came before or after it.)
I was trying to optimize for a network app. The goal of trying to get to 1 million pps. At that time 200Mz CPU, 1 cache miss is 50+ cycles. or 25% of the CPU budget, prefetch helped a lot in that case.
I've occasionally wondered how long it takes highly optimized C/C++ to be surpassed by optimizing compilers due to CPU advancement and the optimizations either making compiler optimization harder, or the optimizations target assumptions about CPU architecture that are no longer valid.
That is, what is the shelf life of a very low level CPU optimization for Intel hardware.
While it seemed to cover more of the compiler optimizations and how to do some low level benchmarking and optimizing and wasn't really addressing when those might become obsolete, it was really interesting and informative. Thanks!
It's more a matter of what you consider to be obvious in this case. In my experience, especially regex matching can become very not-so-obvious if you're coming back to code you wrote some time ago. Or if someone else is reading it.
Another benefit of the salting mechanism is that it allows the administrator to define groups of VMs that are trusted in which sharing will be performed.
disclaimer: I work at VMware and wrote the salting code.