CPU Cache Essentials

js2 · on June 25, 2014

Seems obligatory to link to "Memory part 2: CPU caches" [1] from Ulrich Drepper's "What every programmer should know about memory" [2].

1. http://lwn.net/Articles/252125/

2. http://lwn.net/Articles/250967/

astigsen · on June 25, 2014

Direct link to the original presentation by Scott Meyers: http://vimeo.com/97337258 (the article is just a summary)

bainsfather · on June 26, 2014

I tried watching the talk, but he goes very slowly - could be compressed 3x easily.

The summary gives you the content of the talk much quicker.

gioele · on June 26, 2014

> use as much of cache line as possible.

> For example in the next code when else clause is happening, we are throwing complete cache line which was fetched by accessing the is_alive member

    struct Obj {
        bool is_alive;
        …
    };

    std::vector<Object> objs;

    for (auto o: objs) {
        if (o.is_alive)
            do_stuff(o);
        else {
            // just thrown a cache line
        }
    }

So, how do you fix that?

My first idea would be to remove the is_alive member from the objects and creating a is_alive array of boolean, one for each object. But this opens a whole new set of problems (how to map from objects to is_alive indexes, static vs dynamic set of object that can be alive, etc.).

ANTSANTS · on June 27, 2014

If you really want to optimize data cache usage, use a bitvector and (either through compiler intrinsics or inline assembly) your CPU's bit-scanning instructions. This can reduce the storage requirement to 1 bit per object and skip up to 32 or 64 not-alive objects in a single instruction.

https://news.ycombinator.com/item?id=6176091

devbug · on June 27, 2014

As Arelius mentioned, it is more performant to just keep an array of alive objects as you're only touching memory that needs to be touched. That said, however, bitvictors/bitsets are incredibly useful if you have no easy or clean way of separating variations of objects.

An example would be "scene-graphs" in my game engine. Ostensibly they are stored as components (but my game engine doesn't have components) which are accessed by an unique identifier via an indirection table. Something like Bitsquid.[0]

Like Bitsquid, my "scene-graphs" can be linked to other "scene-graphs." This allows me to "glue" weapons to character's hands or armour to their torsos. This means I have "scene-graphs" that can be updated (nodes repositioned) in parallel (as they have no data dependencies) and others that cannot (linked "scene-graphs"). For performance reasons, I identify which "scene-graphs" can be updated in parallel, so I can kick them off to appropriate tasks which satisfy their data-dependencies. Thus enters the bitvector/bitset because copying my "scene-graphs" around and then adjusting other "scene-graphs" pointers (that point to the moved "scene-graph") when linking or unlinking results in a complete iteration (once if you batch) over all the "scene-graphs," or some form of indirection (some with memory fragmentation too!). Not to mention such approach complicates the code for handling "scene-graphs" a lot. Beyond that, I have to sort my "scene-graphs" (pointer list) by link depth when updating, so I don't update children before their parents, and the like. It's messy, but bitvectors/bitsets are a great solution.

I am experimenting with other methods to manage linked "scene-graphs" which fall under a split (linked/unlinked) array method, but I don't have anything yet, especially when I consider that I get a free iteration over all scene-graphs anyways (which hides a lot of the cost of the current method).

[0]: http://bitsquid.blogspot.ca/2011/09/managing-decoupling-part...

Edit: And I completely forgot to mention the performance concerns of having multiple cores touch the same memory: irregular stalls!

Arelius · on June 26, 2014

Or even better, how about we just have an array of only alive objects.

phserr · on June 25, 2014

I see cache related optimizations being discussed most often in the context of C or C++. Being inexperienced in this area, I wonder how would does it affect garbage collected languages/systems. I understand that that are different approaches to GC, but can someone recommend some discussion to read on this topic? (cache optimization either in GC development itself or in development with GC languages)

Arelius · on June 26, 2014

The same principals apply, it just becomes more difficult to understand what your language may be doing for you in this regard.

Compound types in most languages will store their members contiguously, at least in cases where easy to do so. And arrays are generally contiguous in addition.

The problems often crop up in not being able to have compound value-types and are instead often included by reference.

Some languages, including C# IIRC, allow you to define specific types to be treated by value (structs in C#) which can often help. It may also be better to have a few arrays with primitive types in them, rather than a single array of compound types, for instance.

voidlogic · on June 25, 2014

Are you concerned about garbage collectors that do relocation/compaction or what? Otherwise garbage collection doesn't really effect much of what is discussed here (except of course if you are preempted by the GC running and it empties the cache of useful things (to you), but this is rare). Maybe when you talk about GC you mean the heap allocator (which of course is related but different)?

phserr · on June 25, 2014

What I meant was just how having a less direct control over memory (like you do with c/c++) affects (edit: in a practical way) the efficiency of cache usage. Having less control means being unable to pack things together as much as one would want, but I'm sure sophisticated GC algorithms (relocating/compacting/generational/incremental) directly or indirectly deal with that somehow. As I said, I'm inexperienced with GC and my understanding of them is probably a bit naive.

sliverstorm · on June 25, 2014

From what I've read, the real worry with (for example) C++ is not whether you can control packing or memory ordering. The real worry is when you create data structures that are fundamentally cache-unfriendly. The compiler won't reachitect an object into other objects to make it more memory friendly.

Simple example from the slides on optimizing for the PlayStation 3; if your object is large and full of different data, you miss a lot and burn lots of cache space every time you access it because it is so large. If your object is small and just a collection of pointers, you burn little cache space when you load it, and data can be arranged more appropriately.

http://research.scee.net/files/presentations/gcapaustralia09...

voidlogic · on June 26, 2014

Just to you are aware, you are presenting a false dichotomy. There are plenty of languages that are both garbage collected and give you quite a bit of control about memory layout (D, Go, Rust, etc).

For example, in Go if I wanted have a slice of Foobar ([]Foobar) but wanted to have them allocated close to one another I could write:

  const neededFoobars = 100
  backingSlice := make([]Foobar, neededFoobars)
  seqfoobarPtrs := make([]*Foobar, neededFoobars)
  for i := 0; i < neededFoobars; i++ {
      seqfoobarPtrs[i] = &backingSlice[i]
  }

dbaupp · on June 26, 2014

(FWIW, Rust is not garbage collected.)

voidlogic · on June 26, 2014

Rust has optional garbage collection

FreeFull · on June 26, 2014

There is a library type that does reference counting. It's not built into the language, beyond happening to be in the standard libraries.

dbaupp · on June 27, 2014

It doesn't. There is a type called Gc that may be garbage collected in future, but at the moment it is just a bad reference counted pointer (the Rc type is better if you want reference counting). And it's not even guaranteed that a GC will ever be implemented, Rust provides other abstractions so GC would be rarely used even if it was implemented.

gamebak · on June 25, 2014

Well when working with entities/objects you are not really getting more efficient at the cache level than an array. As seen in this example with array parsing.

Make sure to use simple arrays when you want to store some data, it's not just in C but it will work well in other languages as well including javascript, php, python, perl, etc

Because that's how processors algorithms are designed.