I’m with you generally, but having written some code targeting these instructions from a disinterested third-party perspective, there are big enough differences in some instructions in performance or even behavior that can sincerely drive you to inspect the particular CPU model and not just the cpuid bits offered.
Off the top of my head, SSSE3 has a very flexible instruction to permute the 16 bytes of one xmm register at byte granularity using each byte of another xmm register to control the permutation. On many chips this is extremely cheap (eg 1 cycle) and its flexibility suggests certain algorithms that completely tank performance on other machines, eg old mobile x86 chips where it runs in microcode and takes dozens or maybe even hundreds of cycles to retire. There the best solution is to use a sequence of instructions instead of that single permute instruction, often only two or three depending on what you’re up to. And you could certainly just use that replacement sequence everywhere, but if you want the best performance _everywhere_, you need to not only look for that SSSE3 bit but also somehow decide if that permute is fast so you can use it when it is.
Much more seriously, Intel and AMD’s instructions sometimes behave differently, within specification. The approximate reciprocal and reciprocal square root instructions are specified loosely enough that they can deliver significantly different results, to the point where an algorithm tuned on Intel to function perfectly might have some intermediate value from one of these approximate instructions end up with a slightly different value on AMD, and before you know it you end up with a number slightly less than zero where you expect zero, a NaN, square root of a negative number, etc. And this sort of slight variation can easily lead to a user-visible bug, a crash, or even an exploitable bug, like a buffer under/overflow. Even exhaustively tested code can fail if it runs on a chip that’s not what you exhaustively tested on. Again, you might just decide to not use these loosely-specified instructions (which I entirely support) but if you’re shooting for the absolute maximum performance, you’ll find yourself tuning the constants of your algorithms up or down a few ulps depending on the particular CPU manufacturer or model.
I’ve even discovered problems when using the high-level C intrinsics that correspond to these instructions across CPUs from the same manufacturer (Intel). AVX512 provided new versions of these approximations with increased precision, the instruction variants with a “14” in their mnemonic. If using intrinsics, instruction selection is up to your compiler, and you might find compiling a piece of code targeting AVX2 picks the old low precision version, while the compiler helpfully picks the new increased-precision instructions when targeting AVX-512. This leads to the same sorts of problems described in the previous paragraph.
I really wish you could just read cpuid, and for the most part you’re right that it’s the best practice, but for absolutely maximum performance from this sort of code, sometimes you need more information, both for speed and safety. I know this was long-winded, and again, I entirely understand your argument and almost totally agree, but it’s not 100%, more like 100-epsilon%, where that epsilon itself is sadly manufacturer-dependent.
(I have never worked for Intel or AMD. I have been both delighted and disappointed by chips from both of them.)
Off the top of my head, SSSE3 has a very flexible instruction to permute the 16 bytes of one xmm register at byte granularity using each byte of another xmm register to control the permutation. On many chips this is extremely cheap (eg 1 cycle) and its flexibility suggests certain algorithms that completely tank performance on other machines, eg old mobile x86 chips where it runs in microcode and takes dozens or maybe even hundreds of cycles to retire. There the best solution is to use a sequence of instructions instead of that single permute instruction, often only two or three depending on what you’re up to. And you could certainly just use that replacement sequence everywhere, but if you want the best performance _everywhere_, you need to not only look for that SSSE3 bit but also somehow decide if that permute is fast so you can use it when it is.
Much more seriously, Intel and AMD’s instructions sometimes behave differently, within specification. The approximate reciprocal and reciprocal square root instructions are specified loosely enough that they can deliver significantly different results, to the point where an algorithm tuned on Intel to function perfectly might have some intermediate value from one of these approximate instructions end up with a slightly different value on AMD, and before you know it you end up with a number slightly less than zero where you expect zero, a NaN, square root of a negative number, etc. And this sort of slight variation can easily lead to a user-visible bug, a crash, or even an exploitable bug, like a buffer under/overflow. Even exhaustively tested code can fail if it runs on a chip that’s not what you exhaustively tested on. Again, you might just decide to not use these loosely-specified instructions (which I entirely support) but if you’re shooting for the absolute maximum performance, you’ll find yourself tuning the constants of your algorithms up or down a few ulps depending on the particular CPU manufacturer or model.
I’ve even discovered problems when using the high-level C intrinsics that correspond to these instructions across CPUs from the same manufacturer (Intel). AVX512 provided new versions of these approximations with increased precision, the instruction variants with a “14” in their mnemonic. If using intrinsics, instruction selection is up to your compiler, and you might find compiling a piece of code targeting AVX2 picks the old low precision version, while the compiler helpfully picks the new increased-precision instructions when targeting AVX-512. This leads to the same sorts of problems described in the previous paragraph.
I really wish you could just read cpuid, and for the most part you’re right that it’s the best practice, but for absolutely maximum performance from this sort of code, sometimes you need more information, both for speed and safety. I know this was long-winded, and again, I entirely understand your argument and almost totally agree, but it’s not 100%, more like 100-epsilon%, where that epsilon itself is sadly manufacturer-dependent.
(I have never worked for Intel or AMD. I have been both delighted and disappointed by chips from both of them.)