That's a valid point, but it does not solve the question.
Obviously the memory throughput will not be as high as with matrix calculations, but the algorithm could still be optimized for the GPU. GPU's can do random access and have large and fast memories. What is the difference in memory lines? 64-bytes vs 128-bytes?
GPUs would still be faster than CPUs. You describe them as high-latency but their memory latency is comparable to CPUs. That's why ethash mining or equihash mining (workloads bottlenecked by short ≤32-byte random memory reads) is still faster on GPUs than on CPUs. Also see https://news.ycombinator.com/item?id=22505029
32-bytes accesses are not short. 8 bytes (double precision floating point) are shorter and that's makes sparse matrix multiplication hard on GPU.
Also, SHA256(d?) employed by ethash is, actually, quite long - 80 cycles, at the very least (cycle per round). In mining you can interleave mining computation for one header with loading required by computation of mining of another header and, from what I know, this is what CUDA on GPU will do.
The sheer amount of compute power makes ethash mining faster on GPU.
Reads shorter than 64 bytes on a CPU all cost you the same: a packet of 64 bytes on the memory bus, because that's the atom size of modern CPU's DDR4 memory controllers...
On GPUs the atom size is 32/64 bytes. So GPUs are always better than or equal to CPUs when it comes to small reads/writes.
It's true that the compute power of ethash is not negligible, but to give you one more data point: on equihash there is even less compute spent on hashing, and GPUs still dominate CPUs
A slightly more elaborate answer than the sibling post to drive home how much happens on a simple read that is not cached :
- request to L1D cache, misses
- request to L2D cache, misses
- packet is dropped on the mesh network to access L3D, likely misses
- L3D requests load from memory from the memory controller, load is put in queue
- dram access latency ~100-150
- above chain in reverse
This is the best case scenario on miss, because there could be a DTLB miss on the address (which is why huge tables are crucial in the paper) or there could be dirty cache lines somewhere in other cores that trigger the coherency mechanism.
The bottleneck is the:
edit: reading the paper, it's pointers to the data being stored, so I have to add the following as well: So that's two dependent random-access loads.