What are the kinds of challenges to write things efficiently?
From skimming Wikipedia, it looks like a big challenge is cache pollution. Is it possible that the hit to cache locality is what inhibits uptake? After all, most threads in the OS are sitting idle doing nothing, which means you’re penalized for any “hot code” that’s largely serial (ie typically you have a small number of “hot” applications unless your problem is embarrassingly parallel)
> What are the kinds of challenges to write things efficiently?
The challenge is mostly that you have to create enough fine-grained parallelism and that per thread performance is relatively low. Amdahl's law is in full effect here, a sequential part is going to bite you hard. That's why each die on this chip has two sequential-performance cores.
The graph problems this processor is designed to handle have plenty of parallelism and most of the time those threads will be waiting for (uncached) 8B DRAM accesses.
> From skimming Wikipedia, it looks like a big challenge is cache pollution
This processor has tiny caches and the programmer decides which accesses are cached. In practice, you cache the thread's stack and do all the large graph accesses un-cached, letting the barrel processor hide the latency. There are very fast scratchpads on this thing for when you do need to exploit locality.
In some aspects, yes. Arguably, it's much harder to max GPUs because they have the added difficulty of scheduling and executing by blocks of threads. If not all N threads in a block have their input data ready or are branching differently, you are idling some execution units.
It is not hard at all to fully utilize a GPU with a problem that maps well onto that type of architecture. It's impossible to fully utilize a GPU with a problem that does not.
A barrel processor would make branching more efficient than on a GPU, at the cost of throughput. The set of problems that are economically interesting and that would strongly profit from that is rather small, hence these processors remain niche. Conversely, the incentive to have problems that map well to a GPU is higher, because they are cheap and ubiquitous.
From skimming Wikipedia, it looks like a big challenge is cache pollution. Is it possible that the hit to cache locality is what inhibits uptake? After all, most threads in the OS are sitting idle doing nothing, which means you’re penalized for any “hot code” that’s largely serial (ie typically you have a small number of “hot” applications unless your problem is embarrassingly parallel)