The blog post is effectively implementing memmem() or strstr() that searches a short string in a long string. If we are allowed to use GNU extensions, the cleanest solution is to call memmem(). Without memmem(), I would implement Boyer–Moore or Knuth–Morris–Pratt, which will be more scalable than the O(MN) implementation in the blog post. Time complexity matters.
Yes, this was my takeaway - it was a very poor choice of example, because a known faster algorithm will crush the sort of micro-optimizations done here and be more readable.
Or perhaps it was actually an incredibly good example, despite itself: a great illustration of precisely why jumping to micro-optimization isn't always the right solution.
This is also a problem where the algorithmically fastest solutions are also pretty much guaranteed to be faster, since they are equally cache-friendly and will almost always work out to fewer instructions.
That is not always the case, but this is one of the times when it is.