I pretty much share your sentiment on this subject. My conclusion is was to play with specialized JITs. Although so far my attempts have been more like code generators that just concatenate instructions and take care of loop target alignment and so on. SSE2/SSE4.2/AVX2 depending on runtime CPU architecture. Achieved performance has been very good. There seems to be a huge potential, shame I have so little time to work on this.
Large pages (2MB+) can sometimes contribute a nice amount of extra performance, of course depending on access patterns. It can also have negative effect under some circumstances, like some very random access patterns. Gigabyte pages could help there, but support isn't great.
Other thing I've investigated is memory channel interleaving. Local memory seems to be mostly 64 bytes each channel round robin, but I guess it can be more complicated too. NUMA systems seem to be either round robin 4096 byte per NUMA region or all CPU local memory in one multi-gigabyte (?) chunk. Understanding memory interleaving can help balancing the work between different memory channels.
Large pages (2MB+) can sometimes contribute a nice amount of extra performance, of course depending on access patterns. It can also have negative effect under some circumstances, like some very random access patterns. Gigabyte pages could help there, but support isn't great.
Other thing I've investigated is memory channel interleaving. Local memory seems to be mostly 64 bytes each channel round robin, but I guess it can be more complicated too. NUMA systems seem to be either round robin 4096 byte per NUMA region or all CPU local memory in one multi-gigabyte (?) chunk. Understanding memory interleaving can help balancing the work between different memory channels.