We did something slightly similar - for the very few isolated things it makes sense (e.g. image up/download and conversions in the gpu driver that weren't supported/large enough to be worth firing off a gpu job to complete), they were initially written in C and used the compiler annotations to specify things like the alignment or allowed pointer aliasing in order to make it generate the code wanted. GCC and Clang both support some vector extensions, that allow somewhat portable implementations of things like scatter-gather, or shuffling things around or masking elements in a single register that's hard to specify clearly enough so that it's both readable for humans and will always generate the expected code between compiler versions in "plain" C.
But due to needing to support other compilers and platforms we actually ended up importing the generated asm from those source files in the actual build.
But due to needing to support other compilers and platforms we actually ended up importing the generated asm from those source files in the actual build.