I don't see how they would work, TBH. I wrote a good amount of low level, accele...

I don't see how they would work, TBH. I wrote a good amount of low level, accelerated code for deep learning kernels, and I've yet to see a case where sparse stuff is faster than dense. Moreover, based on my knowledge of low level details, I don't see how typical "academic" pruning can be made fast if you aren't using model-specialized hardware. The way you make things fast on CPU/GPU/TPU is by loading as wide a vector as possible, having as few branches as possible, and helping your prefetcher as much as possible. Sparsity gets in the way of all three, _especially_ on the GPU.