I realize one needs a catchy title and some storytelling to get people to read a blog article, but for a summary of the main points:
* This is not about a build step that makes the app perform better
* The app isn't 10x faster (or faster at all; it's the same binary)
* The author ran a benchmark two ways, one of which inadvertently included the time taken to generate sample input data, because it was coming from a pipe
* Generating the data before starting the program under test fixes the measurement
Another semi-summary of the core part of the article:
>"echo '60016000526001601ff3' | xxd -r -p | zig build run -Doptimize=ReleaseFast" is much faster than "echo '60016000526001601ff3' | xxd -r -p | ./zig-out/bin/count-bytes" (compiling + running the program is faster than just running an already-compiled program)
>When you execute the program directly, xxd and count-bytes start at the same time, so the pipe buffer is empty when count-bytes first tries to read from stdin, requiring it to wait until xxd fills it. But when you use zig build run, xxd gets a head start while the program is compiling, so by the time count-bytes reads from stdin, the pipe buffer has been filled.
>Imagine a simple bash pipeline like the following: "./jobA | ./jobB". My mental model was that jobA would start and run to completion and then jobB would start with jobA’s output as its input. It turns out that all commands in a bash pipeline start at the same time.
That doesn’t make sense unless you have only 1 or 2 physical CPUs with contention. In a modern CPU the latter should be faster and I’m left unsatisfied by the correctness of the explanation. Am I just being thick or is there a more plausible explanation?
The latter is faster in actual CPU time, however note that TFA the measurement only starts with the program, it does not start with the start of the pipeline.
Because the compilation time overlaps with the pipes filling up, blocking on the pipe is mostly excluded from the measurement in the former case (by the time the program starts there’s enough data in the pipe that the program can slurp a bunch of it, especially reading it byte by byte), but included in the latter.
My hunch is that if you added the buffered reader and kept the original xxd in the pipe you’d see similar timings.
The amount of input data is just laughably small here to result in a huge timing discrepancy.
I wonder if there’s an added element where the constant syscalls are reading on a contended mutex and that contention disappears if you delay the start of the program.
The idea that the overlap of execution here by itself plays a role is nonsensical. The overlap of execution + reading a byte at a time causing kernel mutex contention seems like a more plausible explanation although I would expect someone better knowledgeable (& more motivated) about capturing kernel perf measurements to confirm. If this is the explanation, I'm kind of surprised that there isn't a lock-free path for pipes in the kernel.
Based on what you've shared, the second version can start reading instantly because "INFILE" was populated in the previous test. Did you clear it between tests?
Here are the benchmarks before and after fixing the benchmarking code:
On my machine, they're all equally fast at ~28 us. Clearly the changes only had an impact on machines with a different configuration (kernel version or kernel config or xxd version or hw).
One hypothesis outlined above is that the when you pipeline all 3 applications, the single byte reader version is doing back-to-back syscalls and that's causing contention between your code and xxd on a kernel mutex leading to things going to sleep extra long.
It's not a strong hypothesis though just because of how little data there is and the fact that it doesn't repro on my machine. To get a real explanation, I think you have to actually do some profile measurements on a machine that can repro and dig in to obtain a satisfiable explanation of what exactly is causing the problem.
It depends on where the timing code is. If the timer starts after all the data has already been loaded, the time recorded will be lower (even if the total time for the whole process is higher).
I’m not following how that would result in a 10x discrepancy. The amount of data we’re talking about here is laughably small (it’s like 32 bytes or something)
I’ll admit to not having looked at the details at all, but a possible explanation is that almost all the time is spent on inter process communication overhead, so if that also happens before the timer starts (eg, the data has been transferred, just waiting to be read from a local buffer) then the measured time will be significantly lower.
* This is not about a build step that makes the app perform better
* The app isn't 10x faster (or faster at all; it's the same binary)
* The author ran a benchmark two ways, one of which inadvertently included the time taken to generate sample input data, because it was coming from a pipe
* Generating the data before starting the program under test fixes the measurement