Since when is calculating the hash the bottleneck in hash table access? The bott...

nabla9 · on March 6, 2020

That's a valid point, but it does not solve the question.

Obviously the memory throughput will not be as high as with matrix calculations, but the algorithm could still be optimized for the GPU. GPU's can do random access and have large and fast memories. What is the difference in memory lines? 64-bytes vs 128-bytes?

snovv_crash · on March 6, 2020

GPUs have high latency, high throughput memory. Random access is a killer if your calculations are at all serialized

nabla9 · on March 6, 2020

The SLIDE algorithm is not serialized. The only issue is the sparsity.

mrb · on March 6, 2020

GPUs would still be faster than CPUs. You describe them as high-latency but their memory latency is comparable to CPUs. That's why ethash mining or equihash mining (workloads bottlenecked by short ≤32-byte random memory reads) is still faster on GPUs than on CPUs. Also see https://news.ycombinator.com/item?id=22505029

thesz · on March 6, 2020

32-bytes accesses are not short. 8 bytes (double precision floating point) are shorter and that's makes sparse matrix multiplication hard on GPU.

Also, SHA256(d?) employed by ethash is, actually, quite long - 80 cycles, at the very least (cycle per round). In mining you can interleave mining computation for one header with loading required by computation of mining of another header and, from what I know, this is what CUDA on GPU will do.

The sheer amount of compute power makes ethash mining faster on GPU.

mrb · on March 6, 2020

Reads shorter than 64 bytes on a CPU all cost you the same: a packet of 64 bytes on the memory bus, because that's the atom size of modern CPU's DDR4 memory controllers...

On GPUs the atom size is 32/64 bytes. So GPUs are always better than or equal to CPUs when it comes to small reads/writes.

It's true that the compute power of ethash is not negligible, but to give you one more data point: on equihash there is even less compute spent on hashing, and GPUs still dominate CPUs

johnlorentzson · on March 6, 2020

Forgive me if this should be obvious, but why would a simple read from a pointer take so many cycles?

yvdriess · on March 6, 2020

A slightly more elaborate answer than the sibling post to drive home how much happens on a simple read that is not cached :

- request to L1D cache, misses

- request to L2D cache, misses

- packet is dropped on the mesh network to access L3D, likely misses

- L3D requests load from memory from the memory controller, load is put in queue

- dram access latency ~100-150

- above chain in reverse

This is the best case scenario on miss, because there could be a DTLB miss on the address (which is why huge tables are crucial in the paper) or there could be dirty cache lines somewhere in other cores that trigger the coherency mechanism.

erikmolin · on March 6, 2020

because you have to fetch it from RAM, unless the problem is small enough to fit in cache

johnlorentzson · on March 6, 2020

Ah, right. I don't know how I forgot about cache and RAM.

signa11 · on March 6, 2020

> because you have to fetch it from RAM, unless the problem is small enough to fit in cache

... might have to...