This is a side note to the main point being made, but on modern CPUs, "rep movsb...

Lockal · 2024-08-26T00:21:44 1724631704

This is not the full truth, "rep movsb" is fast until another threshold, after which either normal or non-temporal store is faster.

All thresholds are described in https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch...

And they are not final, i. e. Noah Goldstein still updates them every year.

jeffbee · 2024-08-26T01:33:33 1724636013

Which is these is "faster" depends greatly on whether you have the very rare memcpy-only workload, or if your program actually does something useful. Many people believe, often with good evidence, that the most important thing is for memcpy to occupy as few instruction cache lines as is practical, instead of being something that branches all over kilobytes of machine code. For comparison, see the x86 implementations in LLVM libc.

https://github.com/llvm/llvm-project/blob/main/libc/src/stri...

adrian_b · 2024-08-26T05:43:00 1724650980

It depends on the CPU. There is no good reason for "rep movsb" to be slower at any big enough data size.

On a Zen 3 CPU, "rep movsb" becomes faster than or the same as anything else above a length slightly greater than 2 kB.

However there is a range of multi-megabyte lengths, which correspond roughly with sizes below the L3 cache but exceeding the L2 cache, where for some weird reason "rep movsb" becomes slower than SIMD non-temporal stores.

At lengths exceeding the L3 size, "rep movsb" becomes again the fastest copy method.

The Intel CPUs have different behaviors.

jeffbee · 2024-08-25T23:53:07 1724629987

Also worth noting that Linux has changed the way it uses ERMS and FSRM in x86 copy multiple times since kernel 6.1 used in the article. As a data-dote, my machine that has FSRM and ERMS — surprisingly, the latter is not implied by the former — hits 17GB/s using plain old pipes and a 32KiB buffer on Linux 6.8

koverstreet · 2024-08-26T00:32:45 1724632365

I'm still waiting for rep movsb and rep stosb to be fast enough to delete my simple C loop versions, for short memcpys.

adrian_b · 2024-08-26T05:59:16 1724651956

It is likely that on recent CPUs they are always faster than C loop versions.

On my Zen 3 CPU, for lengths of 2 kB or smaller it is possible to copy faster than with "rep movsb", but by using SIMD instructions (or equivalently the builtin "memcpy" provided by most C compilers), not with a C loop (unless the compiler recognizes the C loop and replaces it with the builtin memcpy, which is what some compilers will do at high optimization levels).

haberman · 2024-08-26T21:36:52 1724708212

If that’s the case, when can I expect C compilers to inline variable-length memcpy() the way they will inline fixed-length memcpy today?