I suspect by the time you’ve disabled all the compiler “optimisations” that would lie knock a few orders of magnitude of performance off the directly translated algorithm, you may as well have written the assembler to start with…
And you probably can’t fine tune your C/C++ to get this performance without knowing exactly what processor instructions you are trying to trick the compiler into generating anyway.
And you probably can’t fine tune your C/C++ to get this performance without knowing exactly what processor instructions you are trying to trick the compiler into generating anyway.