For most HPC, you will not be able to maximize parallelism and throughput withou...

For most HPC, you will not be able to maximize parallelism and throughput without intimate knowledge of the hardware architecture and its behavior. As a general principle, you want the topology of the software to match the topology of the hardware as closely as possible for optimal scaling behavior. Efficient HPC software is strongly influenced by the nature of the hardware.

When I wrote code for new HPC hardware, people were always surprised when I asked for the system hardware and architecture docs instead of the programming docs. But if you understood the hardware design, the correct way of designing software for it became obvious from first principles. The programming docs typically contained quite a few half-truths intended to make things seem misleadingly easier for developers than a proper understanding would suggest. In fact, some HPC platforms failed in large part because they consistently misrepresented what was required from developers to achieve maximum performance in order to appear "easy to use", and then failing to deliver the performance the silicon was capable of if you actually wrote software the way the marketing implied would be effective.

You can write HPC code on top of abstractions, and many people do, but the performance and scaling losses are often unavoidably integer factor. As with most software, this was considered an acceptable loss in many cases if it allowed less capable software devs to design the code. HPC is like any other type of software in that most developers that notionally specialize in it struggle to produce consistently good results. Much of the expensive hardware used in HPC is there to mitigate the performance losses of worse software designs.

In HPC there are no shortcuts to actually understanding how the hardware works if you want maximum performance. Which is no different than regular software, in HPC the hardware systems are just bigger and more complex.