If this is the workload you are looking at, then you really should look at CuPy....

If this is the workload you are looking at, then you really should look at CuPy.

I know you discard this because you don't like the NVidia dependency, but it's not much different to switching a BLAS library or using Intel's numerically optimised Numpy distribution. Your code remains the same, you just change the import and get magic speed.

If you still refuse to look at it, then perhaps consider cuBLAS[1], which you can switch out for any other BLAS library (eg [2]). It's one thing that AMD actually has bothered to do and they have version available for CPUs[3] and GPUs[4].

[1] https://developer.nvidia.com/cublas

[2] https://towardsdatascience.com/is-your-numpy-optimized-for-s...

[3] https://developer.amd.com/amd-aocl/blas-library/

[4] https://rocmdocs.amd.com/en/latest/ROCm_Tools/rocblas.html