MPI and OpenMP are the primary abstractions from the hardware in HPC, with MPI being an abstracted form of distributed-memory parallel computing and OpenMP being an abstracted form of shared-memory parallel computing. Many researchers write their codes purely using those, often both in the same code. When using those, you really do not need to worry about the architectural details most of the time.
Still, some researchers who like to further optimize things do in fact fiddle with a lot of small architectural details to increase performance further. For example, loop unrolling is pretty common and can get quite confusing in my opinion. I vaguely recall some stuff about trying to vectorize operations by preferring addition over multiplication due to the particular CPU architecture, but I do not think I've seen that in practice.
Preventing cache misses is another major one, where some codes are written so that the most needed information is stored in the CPU's cache rather than memory. Most codes only handle this by ensuring column-major order loops for array operations in Fortran or row-major order loops in C, but the concept can be extended further. If you know the cache size for your processors, you could hypothetically optimize some operations to keep all of the needed information inside the cache to minimize cache misses. I've never seen this in practice but it was actively discussed in the scientific computing course I took in 2013.
The use of particular GPUs depends heavily on the problem being solved, with some being great on GPUs and others being too difficult. I'm not too knowledgeable about that, unfortunately.
Of course, not every problem can be solved by BLAS, but if you are doing linear algebra, the cache stuff should be mostly handled by BLAS.
I’m not sure how much multiplication vs addition matters on a modern chip. You can have a bazillion instructions in flight after all, as long as they don’t have any dependencies, so I’d go with whichever option shortens the data dependencies on the critical path. The computer will figure out where to park longer instruction if it needs to.
You're right that the addition vs. multiplication issue likely does not matter on a modern chip. I just gave the example because it shows how the CPU architecture can affect how you write the code. I do not recall precisely when or where I heard the idea, but it was about a decade ago --- ages ago by computing standards.
MPI and OpenMP are the primary abstractions from the hardware in HPC, with MPI being an abstracted form of distributed-memory parallel computing and OpenMP being an abstracted form of shared-memory parallel computing. Many researchers write their codes purely using those, often both in the same code. When using those, you really do not need to worry about the architectural details most of the time.
Still, some researchers who like to further optimize things do in fact fiddle with a lot of small architectural details to increase performance further. For example, loop unrolling is pretty common and can get quite confusing in my opinion. I vaguely recall some stuff about trying to vectorize operations by preferring addition over multiplication due to the particular CPU architecture, but I do not think I've seen that in practice.
Preventing cache misses is another major one, where some codes are written so that the most needed information is stored in the CPU's cache rather than memory. Most codes only handle this by ensuring column-major order loops for array operations in Fortran or row-major order loops in C, but the concept can be extended further. If you know the cache size for your processors, you could hypothetically optimize some operations to keep all of the needed information inside the cache to minimize cache misses. I've never seen this in practice but it was actively discussed in the scientific computing course I took in 2013.
The use of particular GPUs depends heavily on the problem being solved, with some being great on GPUs and others being too difficult. I'm not too knowledgeable about that, unfortunately.