Optimizing code for instruction cache

Hoff · on Aug 7, 2010

If you don't know some basics, you can get in trouble.

But if you do and if you get particularly clever, you can get in trouble on the next generation of whatever you're designing for.

TANSTAAFL.

With some of that C code shown, I'd try a "static inline" on the declaration, and see how well the C compiler and its code generator dealt with it. (If I can get out of dealing with it and cede the work to the compiler, all the better...)

And the caching locality discussions are the same ones that arise with virtual memory out in main memory; if you bounce around all over the place in address space (eg: setting up the wrong stride on an array, or partitioning your data differently from your access), you'll incur page faults.

And if you get particularly good with your packing of your data, you can sometimes trigger word tearing when you're accessing adjacent data within the same granule from different processes or different threads.

Caches and cache designs, too, are individually funky. There are processors which got faster by going to smaller and fewer caches and better (lower) latency. So if you fell out of cache in the new design, you'd assume you'd do worse, but with lower latency on the box out to memory, you actually did better pulling from main memory than if you were pulling from second-level on the previous generation.

Groxx · on Aug 7, 2010

Actually, in just about anything non-JITted (without faking it by hinting the compiler), I see no reason why (basic) locality optimizations would change with different architectures. A sufficiently smart branch predictor deciding what to load into caches could take care of such a thing, but that can't really be relied on (too hard to predict), and it would likely benefit from such hints as well.

This is all taking into account that there is indeed too much of a "good" thing. More extreme methods of optimization can definitely shoot you in the foot, and not just while you're writing them.

Hoff · on Aug 7, 2010

The word-tearing case had unrelated data values packed within the same granule of cache storage, and that derailed the running environment in a very subtle way. With just the right (wrong) timing, you very occasionally saw slightly different values in the adjacent variable within the granule when (apparently-unrelated) threads were spun up, and got tangled and torn. No shared references. Just sharing that granule.

okmjuhb · on Aug 7, 2010

It's worth mentioning that they're writing for DSP systems. More mainstream CPUs have a larger instruction cache (and smarter eviction policies), but the size of the hot code paths are (I imagine; I'm not terribly familiar with DSP) likely not much bigger, so that the problems they mention need solving less frequently for typical applications.

anamax · on Aug 7, 2010

> It's worth mentioning that they're writing for DSP systems.

Yes.

> More mainstream CPUs have a larger instruction cache (and smarter eviction policies), but the size of the hot code paths are (I imagine; I'm not terribly familiar with DSP) likely not much bigger, so that the problems they mention need solving less frequently for typical applications.

I'd have assumed the exact opposite. I assume that the size of the "hot enough" spots grows at least somewhat with the size of the binary. (It probably grows much slower than linear.) DSP binaries are smaller, so I'd assume that their hot spots are smaller even given "equivalent" programs.

But, DSP programs are different. Their data is more "regular" and their instruction sets are specialized for their typical uses. I'd expect better instruction cache behavior from DSP programs.

Fun fact - access times and feature sizes are related so if you maximize clock rates, the maximum size of a one cycle cache is basically independent of the access time. I suspect that almost all implementations can afford such a cache.

I suspect that the same rule applies to L2 caches but it's not clear that every implementation can afford the area. However, I suspect that implementations that have an L3 have maximized their L2.

DSP implementations are more likely to have traded cache size for dollars, so I expect that their caches are somewhat less effective on the same programs.

These two factors have opposing effects.