Their inefficiency at large-scale graph problems is the entire reason DARPA wrot...

danbruc · on Aug 30, 2023

I have no expertise in this area, but I think the presentation explains the issue quite well. The workloads they consider have a large number of indirection and poor memory locality, so a lot of memory operations have to wait for cache misses, but you have to do the same thing a lot of times independently with different data.

Normal multithreading is not a solution because it is not granular enough, you can not context switch in the middle of each memory access to run a different thread while waiting for the memory access to complete. You could manually unroll loops and instead of only following one path through the graph you could follow several paths at the same time by interleaving their operations. While this will be able to hide the waiting times because you can continue working on other paths, it will make your code more complex and increase register pressure as you still have the same number of registers but are operating on multiple independent paths. Letting the hardware interleave many parallel paths will keep your code simpler and as each thread has its own registers also not increase register pressure.

londons_explore · on Aug 30, 2023

The solution is SIMD with a queue of 'to be processed' nodes, and a SIMD program that grabs 32 nodes at once from the queue, processes them all in parallel, and adds any new nodes to process to the end of the queue. There is then plenty of parallelism, great branch predictor behaviour, and the length of the queue ensures that blocking on memory reads pretty much never happens.

Downside: Your business logic of what to do with every node needs to be written in SIMD intrinsics. But that has to be cheaper to do than asking Intel to design a new CPU for you.

danbruc · on Aug 30, 2023

Does this actually help? I don't know much about microarchitectural details, so I may be completely wrong. Let's take a simple example, I have some nodes and want to load a field and then do some calculations with the value. I load the pointers of 32 nodes into a SIMD register and issue a gather instruction, I get 32 cache misses. Now it seems to me there is nothing I can do but wait. I can not put this work item back into the queue in the middle of a memory access but even if I could, the overhead seems completely prohibitive to get things out and put them back into queues every other instruction or so.

But with hundreds of hardware threads I could just continue executing some other thread that already got his pointer dereferenced. SIMD is great when you can issue loads and they all get satisfied with the same cache line, but if you have to gather data, then values will arrive one after another and you can not do any dependent work until the last one arrived. I guess the whole point is that all the threads can be at slightly different execution points, while some are waiting for reads to complete others can continue working after their read completed without having to wait for an entire group of reads to complete.

yvdriess · on Aug 30, 2023

Correct.

This is also why hyperthreading yields a benefit for sparse/graph workloads. I was getting good results on the KNL Xeon Phi with its four hyperthreads. But you easily hit the 'outstanding loads' limit if your hyperthreads all do gather instructions every 4-8 instructions. The architecture needs to balanced around it.

yvdriess · on Aug 30, 2023

Those nodes in the queue have to come from somewhere and that is where the inefficiency lies.

If you use SIMD gather instructions to fetch all the children of a node, that is still going to be very inefficient on e.g. a Xeon. Each of those gather loads are likely to hit different cache lines, which are accessed only once and waste 7/8th of the memory bandwidth pulling them in. SIMD is also not going to help if you are memory bandwidth or latency bound, we're talking ~0.1 ops/byte here.