Yes and no. MPI and OpenMP are the primary abstractions from the hardware in HPC...

bee_rider · on Dec 30, 2023

Of course, not every problem can be solved by BLAS, but if you are doing linear algebra, the cache stuff should be mostly handled by BLAS.

I’m not sure how much multiplication vs addition matters on a modern chip. You can have a bazillion instructions in flight after all, as long as they don’t have any dependencies, so I’d go with whichever option shortens the data dependencies on the critical path. The computer will figure out where to park longer instruction if it needs to.

atrettel · on Dec 30, 2023

You're right that the addition vs. multiplication issue likely does not matter on a modern chip. I just gave the example because it shows how the CPU architecture can affect how you write the code. I do not recall precisely when or where I heard the idea, but it was about a decade ago --- ages ago by computing standards.