This is a side note to the main point being made, but on modern CPUs, "rep movsb" is just as fast as the fastest vectorized version, because the CPU knows to accelerate it. The name of the kernel function "copy_user_enhanced_fast_string" hints at this: the CPU features are ERMS ("Enhanced Repeat Move String", which makes "rep movsb" faster for anything above a certain length threshold) and FSRM ("Fast Short Repeat Move String", which makes "rep movsb" faster for shorter moves too).
Which is these is "faster" depends greatly on whether you have the very rare memcpy-only workload, or if your program actually does something useful. Many people believe, often with good evidence, that the most important thing is for memcpy to occupy as few instruction cache lines as is practical, instead of being something that branches all over kilobytes of machine code. For comparison, see the x86 implementations in LLVM libc.
It depends on the CPU. There is no good reason for "rep movsb" to be slower at any big enough data size.
On a Zen 3 CPU, "rep movsb" becomes faster than or the same as anything else above a length slightly greater than 2 kB.
However there is a range of multi-megabyte lengths, which correspond roughly with sizes below the L3 cache but exceeding the L2 cache, where for some weird reason "rep movsb" becomes slower than SIMD non-temporal stores.
At lengths exceeding the L3 size, "rep movsb" becomes again the fastest copy method.
Also worth noting that Linux has changed the way it uses ERMS and FSRM in x86 copy multiple times since kernel 6.1 used in the article. As a data-dote, my machine that has FSRM and ERMS — surprisingly, the latter is not implied by the former — hits 17GB/s using plain old pipes and a 32KiB buffer on Linux 6.8
It is likely that on recent CPUs they are always faster than C loop versions.
On my Zen 3 CPU, for lengths of 2 kB or smaller it is possible to copy faster than with "rep movsb", but by using SIMD instructions (or equivalently the builtin "memcpy" provided by most C compilers), not with a C loop (unless the compiler recognizes the C loop and replaces it with the builtin memcpy, which is what some compilers will do at high optimization levels).