The inner loop of those comparisons is indeed the spot where you can still speed up as noted in the last part of the post, the kind of optimizations that you describe are extremely effective but qualify as 'micro optimizations' and I expressly left those out because they impact readability considerably. But, you're right, if that's what it takes then so be it and then readability would have to suffer in deference to the last couple of % of speed. Maximum gain from this optimization relative to the final runtime is about 20% by my estimation. (Inner loop will step 8 bytes at the time, but will have more instructions).
I concur that doing this directly in your code is extremely ugly; but note that you can get this speedup by just dropping in a call to glibc's memchr(), hiding the ugliness behind a well-known interface.