I realize one needs a catchy title and some storytelling to get people to read a...

meowface · 2024-03-20T13:53:49 1710942829

Another semi-summary of the core part of the article:

>"echo '60016000526001601ff3' | xxd -r -p | zig build run -Doptimize=ReleaseFast" is much faster than "echo '60016000526001601ff3' | xxd -r -p | ./zig-out/bin/count-bytes" (compiling + running the program is faster than just running an already-compiled program)

>When you execute the program directly, xxd and count-bytes start at the same time, so the pipe buffer is empty when count-bytes first tries to read from stdin, requiring it to wait until xxd fills it. But when you use zig build run, xxd gets a head start while the program is compiling, so by the time count-bytes reads from stdin, the pipe buffer has been filled.

>Imagine a simple bash pipeline like the following: "./jobA | ./jobB". My mental model was that jobA would start and run to completion and then jobB would start with jobA’s output as its input. It turns out that all commands in a bash pipeline start at the same time.

anonymous-panda · 2024-03-20T15:20:11 1710948011

That doesn’t make sense unless you have only 1 or 2 physical CPUs with contention. In a modern CPU the latter should be faster and I’m left unsatisfied by the correctness of the explanation. Am I just being thick or is there a more plausible explanation?

masklinn · 2024-03-20T15:26:46 1710948406

The latter is faster in actual CPU time, however note that TFA the measurement only starts with the program, it does not start with the start of the pipeline.

Because the compilation time overlaps with the pipes filling up, blocking on the pipe is mostly excluded from the measurement in the former case (by the time the program starts there’s enough data in the pipe that the program can slurp a bunch of it, especially reading it byte by byte), but included in the latter.

anonymous-panda · 2024-03-20T15:32:01 1710948721

My hunch is that if you added the buffered reader and kept the original xxd in the pipe you’d see similar timings.

The amount of input data is just laughably small here to result in a huge timing discrepancy.

I wonder if there’s an added element where the constant syscalls are reading on a contended mutex and that contention disappears if you delay the start of the program.

vlovich123 · 2024-03-20T15:37:23 1710949043

Good hunch. On my machine (13900k) & zig 0.11, the latest version of the code:

> INFILE="$(mktemp)" && echo $INFILE && \ echo '60016000526001601ff3' | xxd -r -p > "${INFILE}" && \ zig build run -Doptimize=ReleaseFast < "${INFILE}" > execution time: 27.742µs

vs

> echo '60016000526001601ff3' | xxd -r -p | zig build run -Doptimize=ReleaseFast > execution time: 27.999µs

The idea that the overlap of execution here by itself plays a role is nonsensical. The overlap of execution + reading a byte at a time causing kernel mutex contention seems like a more plausible explanation although I would expect someone better knowledgeable (& more motivated) about capturing kernel perf measurements to confirm. If this is the explanation, I'm kind of surprised that there isn't a lock-free path for pipes in the kernel.

mtlynch · 2024-03-20T17:37:10 1710956230

Based on what you've shared, the second version can start reading instantly because "INFILE" was populated in the previous test. Did you clear it between tests?

Here are the benchmarks before and after fixing the benchmarking code:

Before: https://output.circle-artifacts.com/output/job/2f6666c1-1165...

After: https://output.circle-artifacts.com/output/job/457cd247-dd7c...

What would explain the drastic performance increase if the pipelining behavior is irrelevant?

vlovich123 · 2024-03-20T17:53:35 1710957215

That was just a typo in the comment. The command run locally was just a strait pipe.

Using both invocation variants, I ran:

8a5ecac63e44999e14cdf16d5ed689d5770c101f (before buffered changes)

78188ecbc66af6e5889d14067d4a824081b4f0ad (after buffered changes)

On my machine, they're all equally fast at ~28 us. Clearly the changes only had an impact on machines with a different configuration (kernel version or kernel config or xxd version or hw).

One hypothesis outlined above is that the when you pipeline all 3 applications, the single byte reader version is doing back-to-back syscalls and that's causing contention between your code and xxd on a kernel mutex leading to things going to sleep extra long.

It's not a strong hypothesis though just because of how little data there is and the fact that it doesn't repro on my machine. To get a real explanation, I think you have to actually do some profile measurements on a machine that can repro and dig in to obtain a satisfiable explanation of what exactly is causing the problem.

rofrol · 2024-03-20T17:09:21 1710954561

This @mtlynch

vlovich123 · 2024-03-20T17:34:10 1710956050

To sanity check myself, I reran this without the buffered reader and still don't see the slow execution time:

> echo '60016000526001601ff3' | xxd -r -p > | zig build run -Doptimize=ReleaseFast

> execution time: 28.889µs

So I think my machine config for whatever reason isn't representative of whatever OP is using.

Linux-ck 6.8 CONFIG_NO_HZ=y CONFIG_HZ_1000=y

Intel 13900k

zig 0.11

bash 5.2.26

xxd 2024-02-10

Would be good if someone that can repro it compares the two invocation variants with buffered reader implemented & lists their config.

DougBTX · 2024-03-20T15:26:40 1710948400

It depends on where the timing code is. If the timer starts after all the data has already been loaded, the time recorded will be lower (even if the total time for the whole process is higher).

anonymous-panda · 2024-03-20T15:29:10 1710948550

I’m not following how that would result in a 10x discrepancy. The amount of data we’re talking about here is laughably small (it’s like 32 bytes or something)

DougBTX · 2024-03-21T07:37:56 1711006676

I’ll admit to not having looked at the details at all, but a possible explanation is that almost all the time is spent on inter process communication overhead, so if that also happens before the timer starts (eg, the data has been transferred, just waiting to be read from a local buffer) then the measured time will be significantly lower.

masklinn · 2024-03-20T15:44:59 1710949499

> The amount of data we’re talking about here is laughably small

So is the runtime.

karmakaze · 2024-03-20T17:47:37 1710956857

I would definitely classify the title as clickbait because the app didn't go "10x faster".