I realize one needs a catchy title and some storytelling to get people to read a blog article, but for a summary of the main points:
* This is not about a build step that makes the app perform better
* The app isn't 10x faster (or faster at all; it's the same binary)
* The author ran a benchmark two ways, one of which inadvertently included the time taken to generate sample input data, because it was coming from a pipe
* Generating the data before starting the program under test fixes the measurement
Another semi-summary of the core part of the article:
>"echo '60016000526001601ff3' | xxd -r -p | zig build run -Doptimize=ReleaseFast" is much faster than "echo '60016000526001601ff3' | xxd -r -p | ./zig-out/bin/count-bytes" (compiling + running the program is faster than just running an already-compiled program)
>When you execute the program directly, xxd and count-bytes start at the same time, so the pipe buffer is empty when count-bytes first tries to read from stdin, requiring it to wait until xxd fills it. But when you use zig build run, xxd gets a head start while the program is compiling, so by the time count-bytes reads from stdin, the pipe buffer has been filled.
>Imagine a simple bash pipeline like the following: "./jobA | ./jobB". My mental model was that jobA would start and run to completion and then jobB would start with jobA’s output as its input. It turns out that all commands in a bash pipeline start at the same time.
That doesn’t make sense unless you have only 1 or 2 physical CPUs with contention. In a modern CPU the latter should be faster and I’m left unsatisfied by the correctness of the explanation. Am I just being thick or is there a more plausible explanation?
The latter is faster in actual CPU time, however note that TFA the measurement only starts with the program, it does not start with the start of the pipeline.
Because the compilation time overlaps with the pipes filling up, blocking on the pipe is mostly excluded from the measurement in the former case (by the time the program starts there’s enough data in the pipe that the program can slurp a bunch of it, especially reading it byte by byte), but included in the latter.
My hunch is that if you added the buffered reader and kept the original xxd in the pipe you’d see similar timings.
The amount of input data is just laughably small here to result in a huge timing discrepancy.
I wonder if there’s an added element where the constant syscalls are reading on a contended mutex and that contention disappears if you delay the start of the program.
The idea that the overlap of execution here by itself plays a role is nonsensical. The overlap of execution + reading a byte at a time causing kernel mutex contention seems like a more plausible explanation although I would expect someone better knowledgeable (& more motivated) about capturing kernel perf measurements to confirm. If this is the explanation, I'm kind of surprised that there isn't a lock-free path for pipes in the kernel.
Based on what you've shared, the second version can start reading instantly because "INFILE" was populated in the previous test. Did you clear it between tests?
Here are the benchmarks before and after fixing the benchmarking code:
On my machine, they're all equally fast at ~28 us. Clearly the changes only had an impact on machines with a different configuration (kernel version or kernel config or xxd version or hw).
One hypothesis outlined above is that the when you pipeline all 3 applications, the single byte reader version is doing back-to-back syscalls and that's causing contention between your code and xxd on a kernel mutex leading to things going to sleep extra long.
It's not a strong hypothesis though just because of how little data there is and the fact that it doesn't repro on my machine. To get a real explanation, I think you have to actually do some profile measurements on a machine that can repro and dig in to obtain a satisfiable explanation of what exactly is causing the problem.
It depends on where the timing code is. If the timer starts after all the data has already been loaded, the time recorded will be lower (even if the total time for the whole process is higher).
I’m not following how that would result in a 10x discrepancy. The amount of data we’re talking about here is laughably small (it’s like 32 bytes or something)
I’ll admit to not having looked at the details at all, but a possible explanation is that almost all the time is spent on inter process communication overhead, so if that also happens before the timer starts (eg, the data has been transferred, just waiting to be read from a local buffer) then the measured time will be significantly lower.
Back in college, a friend of mine decided to learn how to program. He had never programmed before. He picked up the DEC FORTRAN-10 manual and read it cover to cover.
He then wrote a program that generated some large amount of data and wrote it to a file. Being much smarter than I am, his first program worked the first time.
But it ran terribly slow. Baffled, he showed it to his friend, who exclaimed why are you, in a loop, opening the file, appending one character, and closing the file? That's going to run incredibly slowly. Instead, open the file, write all the data, then close it!
The reply was "the manual didn't say anything about that or how to do I/O efficiently."
I firmly believe that teaching how to idiomatically do both character and line oriented file IO should be the first thing any language tutorial teaches, almost. Just as soon as you’ve introduced enough syntax.
FWIW, the Epitech cursus starts your very first C programming lesson by making you write a "my_putchar" function that does a "write" syscall with a single character. Then you spend the next few days learning how to create my_putnbr, my_putstr, etc, using that single building block.
I think that's the right choice, by the way. Baby developers don't need to learn efficient I/O, they need to learn how you make the pile of sand do smart things.
And if you've spent weeks doing I/O one syscall per character, getting to the point you write hundreds of lines that way, the moment some classmate shows you that you can 100x your program's performance by batching I/O gets burned in your memory forever, in a way "I'm doing it because the manual said so" doesn't.
I said idiomatic because that’s the form they’re going to encounter it in the wild, in library doc, in stack overflow answers.
I do t think rhe sort of low level bit banging you propose is a worthwhile use of a students time, given the vast amount they have to learn that won’t be immediately obsolete.
I firmly disagree with this. We're talking about learning C, not learning an arbitrary programming language. The course of study the GP comment suggests teaches syscalls, pipes, buffering, and most important, it teaches mechanical sympathy. All of which are things a C programmer needs to understand.
More programming tasks than you might imagine are low-level bit banging, and C remains the language of choice for doing them. It might be Zig one day, and if so, the same sort of deep-end dive into low-level detail will remain a good way to approach such a language.
I would argue that they should be teaching how file IO works at a low level at some point (preferably as one of the first "complicated" bits).
Everybody should, at some early point, interact with basic file descriptors and see how they map to syscalls. Preferably including building their own character and line oriented abstractions on top of that, so they can see how they work.
I'm convinced that IO is in the same category as parallelism; most devs understand it poorly at best, and the ones who do understand it are worth their weight in gold.
I beat benchmark after benchmark in 80s on disk I/O. I was amazed that nobody else figured out what my compiler was doing - using a 16K buffer for the floppy drive rather than 512 bytes.
Heh, wasn't 16k most of the memory in the machine? Large buffers do have other interesting and fun side effects, though back then you probably didn't have any threads or any/many of the things buffers cause these days.
I don't think they need to teach all of CS; I would agree your average language tutorial probably doesn't need a chapter on tree traversal or transitive closures or whatever.
I do think they should all teach IO, though. If I could only pick 1 part of a language to understand really, really well it would be IO.
The vast majority of apps spend the vast majority of their productive time on IO. Your average CRUD app is almost entirely IO. The user sends IO to the app, the app send IO to the database, the database does IO to get the results and then the IO propagates back up. The only part that isn't basically pure IO is transforming and marshalling the DB records into API responses.
If you you add parallelism, that's like 98% of what most apps do. Parallel IO dominates most apps.
I don't want to belittle the author, but I am surprised, that people using a low-level language on Linux wouldn't know how Unix pipelines work or that reading one byte per syscall is quite inefficient. I understand that the author is still learning (aren't we all?), but I just felt it is a pretty fundamental knowledge. At the same time author managed to have better performance that the official thing had. I guess many things feel fundamental in the retrospect.
>I am surprised, that people using a low-level language on Linux wouldn't know ... that reading one byte per syscall is quite inefficient.
In my defense, it wasn't that I didn't realize one byte per syscall was inefficient; it was that I didn't realize that I was doing one syscall per byte read.
I'm coming back to low-level programming after 8ish years of Go/Python/JS, so I wasn't really registering that I'd forgotten to layer in a buffered reader on top of stdin's reader.
Alex Kladov (matklad) made an interesting point on the Ziggit thread[0] that the Zig standard library could adjust the API to make this kind of mistake less likely:
>I’d say readByte is a flawed API to have on a Reader. While you technically can read a byte-at-time from something like TCP socket, it just doesn’t make sense. The reader should only allow reading into a slice.
That’s a rather short time. (It’s a lot of cycles, but there are plenty of things one might do on a computer that take time comparable to this, especially anything involving IO. It’s only a small fraction of a disk head seek if you’re using spinning disks, and it’s only a handful of non-overlapping random accesses even on NVMe.)
So, when you benchmark anything and get a time this short, you should make sure you’re benchmarking the right thing. Watch out for fixed costs or correct for them. Run in a loop and see how time varies with iteration count. Consider using a framework that can handle this type of benchmarking.
A result like “this program took half a millisecond to run” doesn’t really tell you much about how long any part of it took.
Zig certainly needs more work. That part is more on familiarity with Zig and how intuitive it is or isn't.
In any case I would recommend anyone investigating things like that to run things through strace. It is often my first step in trying to understand what happens with anything - like a cryptic error "No such file or directory" without telling me what a thing tried to access. You would run:
Am I doing it wrong? Or is there more training involved before one could usefully integrate this into debugging? Because to me, the output is pretty inscrutable.
There is a lot of output here, but you can grep around or filter with strace CLI. If you used -f option you should get PID numbers later on. Then you can look for all execve's to see how PIDs map to parts of the pipeline. For now maybe grep the log file with something like: "grep -e clone -e execve -e write -e read". You can do this with strace CLI, but I never remember the syntax and usually analyze the log extensively.
I think something like this could work:
strace -f -e execve,clone,write,read -o strace.log sh -c '...'
Clone is fork, so a creation of a new process, before eventual execve (with echo there will probably be just clone).
strace tells you every syscall the process under it makes. So very helpful to understanding how a program interacts with the operating system - and I/O as all IO mechanisms are managed by the operating system.
As for how to filter this I'll leave that to the other comments, but I personally would look at the man page or Google around for tips
There are many fundamental things People Should Know™, but we are not LLMs that have ingested entire libraries worth of books.
Exploratory programming and being curious about strange effects is a great way to learn the fundamentals. I already knew how pipes and processes work, but I don't know the Ethereum VM.
The author now knows both.
I feel like you are underestimating how many fundamental misunderstandings people can have (including both of us) even though they have deep understanding of adjacent issues.
This deleterious effect is a factor in computing. We deal with it every few years: kids graduate, having ignored the history of their elders in order to focus on the new and cool - and once they hit industry, find that they really, really should have learned POSIX or whatever.
Its not uncommon. As a professional developer I have observed this obfuscation of prior technology countless times, especially with junior devs.
There is a lot to learn. Always. It doesn't ever stop.
This is exactly what surprised me as well. I'm literally now learning in depth WebStreams[1] in JS (vs the traditional Node Streams) and I've seen too many times the comparison of how "pipe() and pipeTo() behave just like Unix's pipes |". Reading this article makes me think this might not be the best comparison, specially since for many webdevs it's the first time for approaching some CS concepts. OTOH, the vast majority of webdevs don't really need to learn WebStreams in-depth.
Most people have gaps somewhere in their knowledge. I learned very on, as a general superstition, to always try and batch things that dealt with the world without like file writes, allocations, network requests etc. But for years I had no idea what a syscall even was.
There is general wisdom about bash pipelines here that I think most people will miss simply because of the title. Interesting though, my mental model of bash piping was wrong too.
There were several reasons why pipes were added to Unix, and the ability to run producer/consumer processes concurrently was one of them. Before that (and for many years after on non-Unix systems) indeed the most prevalent paradigm were to run multi-stage pipelines with the moral equivalent of the following:
stage1.exe /in:input.dat /out:stage1.dat
stage2.exe /in:stage1.dat /out:stage2.dat
del stage1.dat
stage3.exe /in:stage2.dat /out:result.dat
del stage2.dat
Pipes are so useful. I find myself more and more using shell script and pipes for complex multi-stage tasks. This also simplifies any non-shell code I must write, as there are already high quality, performant implementations of hashing and compression algorithms I can just pipe to.
My biggest annoyance is when I get some tooling from some other team, and they're like "oh just extend this Python script". It'll operate on local files, using shell commands, in a non-reentrant way, with only customization from commenting out code. Maybe there's some argparse but you end up writing a program using their giant args as primitives.
Guys just write small programs and chain them. The wisdom of the ancients is continuously lost.
Python comes with a built-in module called fileinput that makes this very easy. It checks sys.argv[1] and reads from it or from stdin if it's empty or a dash.
It's just a preference thing, I loathe the small program chaining style and cannot work with it at all. Give me a python script and I'm good though. I can't for the life of me imagine why people would want to do pseudo programming through piping magic when chaining is so limited compared to actual programming
This is of course a false dichotomy, there's nothing pseudo about using bash (perhaps you mean sudo?) and bash scripts orchestrate what you call 'actual' programs.
I commonly write little python scripts to filter logs, which I have read from stdin. That means I can filter a log to stdout:
"The programmer scoffed at Master Foo and rose to depart. But Master Foo nodded to his student Nubi, who wrote a line of shell script on a nearby whiteboard, and said: “Master programmer, consider this pipeline. Implemented in pure C, would it not span ten thousand lines?”"
Well they all do, but in terms of ease of use, tar and zip are much simpler to implement in a cli pipeline than to write bespoke code. At least that has been my experience.
It is hard to compete with "| gzip" in any programming language. Just importing a library and you're already well past that. Just typing "import" and you're tied! Overbudget if I drop the space in "| gzip".
This is one of the reasons why, for all its faults, shell just isn't going anywhere any time soon.
You can also (assuming your language supports it), execute gzip, and assuming your language gives you some writable-handle to the pipe, then write data into it. So, you get the concurrency "for free", but you don't have to go all the way to "do all of it in process".
I've also done the "trick" of executing [bash, -c, <stuff>] in a higher language, too. I'd personally rather see the work better suited for the high language done in the higher language, but if shell is easier, then as such it is.
It's sort of like unsafe blocks: minimize the shell to a reasonable portion, clearly define the inputs/outputs, and make sure you're not vulnerable to shell-isms, as best as you can, at the boundary.
But I still think I see the reverse far more often. Like, `steam` is … all the time, apparently … exec'ing a shell to then exec … xdg-user-dir? (And the error seems to indicate that that's it…) Which seems more like the sort of "you could just exec this yourself?". (But Steam is also mostly a web-app of sorts, so, for all I know there's JS under there, and I think node is one of those "makes exec(2) hard/impossible" langs.)
Sometimes you want the intermediate files as well, though. For example, if doing some kind of exploratory analysis of the different output stages of the pipeline, or even just for debugging.
Tee can be useful for that. Maybe pv (pipe viewer) too. I have not tried it yet.
SPONGE(1) moreutils SPONGE(1)
NAME
sponge - soak up standard input and write to a file
SYNOPSIS
sed '...' file | grep '...' | sponge [-a] file
DESCRIPTION
sponge reads standard input and writes it out to the specified file.
Unlike a shell redirect, sponge soaks up all its input before writing
the output file. This allows constructing pipelines that read from and
write to the same file.
Usually mental models develop "organically" from when one was a n00b, without much thought, and sometimes it can take a long time for them to be unseated, even though it's kind of obvious in hindsight that the mental model is wrong (e.g. one can see that from "slow-program | less", and things like that).
I think a main reason for this is that you can have a "good enough" working mental model of a process, that holds up to your typical use cases and even moderate scrutiny. It's often only once you run into a case where your mental model fails that you even think to challenge the assumptions it was built on - at least, this has been my experience.
Can’t speak for OP, but one might reasonably expect later stages to only start execution once at least some data is available—rather than immediately, before any data is available for them to consume.
Of course, there many reasons you wouldn’t want this—processes can take time to start up, for example—but it’s not an unreasonable mental model.
Well, it could be implemented like this, it's just more cumbersome than "create N-1 anonymous pipes, fork N processes, wait for the last process to finish": at the very least you'll need to select() on the last unattached pipe, and when job control comes into the picture, you'd really would like the "setting up the pipeline" and "monitoring the pipeline's execution" parts to be disentangled.
Not even that they might be particularly slow to start in absolute terms, but just that they might be slow relative to how fast the previous stage starts cranking out some input for it.
I know this about Unix pipes from a very long time. Whenever they are introduced it is always said, but I guess people can miss it.
Though now I will break your mind as my mind was broken not a long time ago. Powershell, which is often said to be a better shell, works like that. It doesn't run things in parallel. I think the same is to be said about Windows cmd/batch, but don't cite me on that. That one thing makes Powershell insufficient to ever be a full replacement of a proper shell.
Not exactly. Non-native PowerShell pipelines are executed in a single thread, but the steps are interleaved, not buffered. That is, each object is passed through the whole pipeline before the next object is processed. This is non-ideal for high-performance data processing (e.g. `cat`ing a 10GB file, searching through it and gzipping the output), but for 99% of daily commands, it does not make any difference.
cmd.exe uses standard OS pipes and behaves the same as UNIX shells, same as Powershell invoking native binaries.
Oh, that's what I missed! I managed to find out about it while trying to do an equivalent of `curl ... | tar xzf -` in Powershell. I was stumped. I guess the thing is that a Unix shell would do a subshell automatically.
> Though now I will break your mind as my mind was broken not a long time ago. Powershell, which is often said to be a better shell, works like that. It doesn't run things in parallel. I think the same is to be said about Windows cmd/batch, but don't cite me on that. That one thing makes Powershell insufficient to ever be a full replacement of a proper shell.
A Pipeline is PowerShell is definitely streaming unless you accidentally forces the output into a list/array at some point, e.g. try this for yourself (somewhere you can interrupt the script obviously as it's going to run forever)
Whether it runs in parallel depends on the implementation of each side. Interpreted powershell code does not run in parallel unless you run it a job, use ForEach-Object -Parallel, or explicitly put it on another thread. But the data is not collected together before being sent from one step from the next.
The character "|" has been introduced in computers in the language NPL at IBM in December 1964 as a notation for bitwise OR, replacing ".OR.", which had been used by IBM in its previous programming language, "FORTRAN IV" (OR was between dots to distinguish it from identifiers, marking it as an operator).
The next year the experimental NPL (New Programming Language) has been rebranded as PL/I and it has become a commercial product of IBM.
Following PL/I, other programming languages have begun to use "&" and "|" for AND and OR, including the B language, the predecessor of C.
The pipe and its notation have been introduced in the Third Edition of UNIX (based on a proposal made by M. D. McIlroy), in 1972, so after the language B had been used for a few years and before the development of C. The oldest documentation about pipes that I have seen is in "UNIX Programmer's Manual Third Edition" from February 1973.
Before NPL, the vertical bar had already been used in the Backus-Naur notation introduced in the report about ALGOL 60 as a separator between alternatives in the description of the grammar of the language, so with a meaning somewhat similar to OR.
That is right, but I did not want to provide too many details that did not belong to the topic.
FORTRAN IV did not have bit strings, it had only Boolean values ("LOGICAL").
Therefore all the logical operators could be applied only to Boolean operands, giving a Boolean result.
The same was true for all earlier high-level programming languages.
The language NPL, renamed PL/I in 1965, has been the first high-level programming language that has introduced bit string values, so the AND, OR and NOT operators could operate on bit strings, not only on single Boolean values.
If PL/I would have remained restricted to the smaller character set accepted by FORTRAN IV in source texts, they would have retained the FORTRAN IV operators ".NOT.", ".AND.", ".OR.", extending their meaning as bit string operators.
However IBM has decided to extend the character set, which has allowed the use of dedicated symbols for the logical operators and also for other operators that previously had to use keywords, like the relational operators, and also for new operators introduced by PL/I, like the concatenation operator.
Not to get all semiotic about it, but |x| notation is a pair of vertical lines. I'm sure that someone has written a calculator program where two 0x7D characters bracketing a symbol means absolute value, but if I've ever seen it, I can't recall.
Although 0x7D is overly specific, since if a sibling comment is correct (I have no reason to think otherwise), | for bitwise OR originates in PL/1, where it would have been encoded in EBCDIC, which codes it as 0x4F.
I'm not really disagreeing with you, the |abs| notation is quite a bit older than computers, just musing on what should count as the first use of "|". I'm inclined to say that it should go to the first use of an encoding of "|", not to the similarly-appearing pen and paper notation, and definitely not the first use of ASCII "|" aka 0x7D in a programming language. But I don't think there's a right answer here, it's a matter of taste.
Because one could argue back to the Roman numeral I, if one were determined to do so: when written sans serif, it's just a vertical line, after all. Somehow, abs notation and "first use of an encoded vertical bar" both seem reasonable, while the Roman numeral and specifically-ASCII don't, but I doubt I can unpack that intuition in any detail.
The language APL\360 of IBM (August 1968) and the other APL dialects that have followed it have used a single "|" as a monadic prefix operator that computes the absolute value and also as a dyadic infix operator that computes the remainder of the division (but with the operand order reversed in comparison with the language C, which is usually much more convenient, especially in APL, where this order avoids the need for parentheses in most cases).
I was so confused about why this mattered/made such a difference - until I went back and re-read from the top: OP does the benchmark timing in `main`, in the Zig app under test.
If you don't do that, if you use the `time` CLI for example, this wouldn't have been a problem in the first place. Though sure you couldn't have compared to compiling fresh & running anyway, and at least on small inputs would've wanted to do the input prep first anyway.
But I think if you put the benchmark code inside the DUT you're setting yourself up for all kinds of gotchas like this.
There seems to be a small misunderstanding on the behavior of pipes here. All the commands in a bash pipeline do start at the same time, but output goes into the pipeline buffer whenever the writing process writes it. There is no specific point where the "output from jobA is ready".
The author's example code, "jobA starts, sleeps for three seconds, prints to stdout, sleeps for two more seconds, then exits" and "jobB starts, waits for input on stdin, then prints everything it can read from stdin until stdin closes" is measuring 5 seconds not because the input to jobB is not ready until jobA terminates but because jobB is waiting for the pipe to close which doesn't happen until jobA ends. That explains the timing of the output:
$ ./jobA | ./jobB
09:11:53.326 jobA is starting
09:11:53.326 jobB is starting
09:11:53.328 jobB is waiting on input
09:11:56.330 jobB read 'result of jobA is...' from input
09:11:58.331 jobA is terminating
09:11:58.331 jobB read '42' from input
09:11:58.333 jobB is done reading input
09:11:58.335 jobB is terminating
The bottom line is that it's important to actually measure what you want to measure.
>All the commands in a bash pipeline do start at the same time, but output goes into the pipeline buffer whenever the writing process writes it. There is no specific point where the "output from jobA is ready".
Right, I didn't mean to give the impression that there's a time at which all input from jobA is ready at once. But there is a time when jobB can start reading stdin, and there's a time when jobA closes the handle to its stdout.
The reason I split jobA's output into two commands is to show that jobB starts reading 3 seconds after the command begins, and jobB finishes reading 2 seconds after reading the first output from jobA.
This post is another example of why I like zig so much. It seems to get people talking about performance in a way which helps them learn how things work below today’s heavily abstracted veneer
If you want create something like the pipe behaviour the author expected (buffer all output before sending to the next command), the sponge command from moreutils can help.
My first guess involved caching but I was thinking about whether the binary itself had to be read from disk or was already cached in RAM. Great linux-fu post.
If I were trying to optimize my code, I would start with loading the entire benchmark bytecode to memory, then start the counter. Otherwise I can't be sure how much time is spent reading from a pipe/file to memory, and how much time is spent in my code.
Then I would try to benchmark what happens if it all fits in L1 cache, L2, L3, and main memory.
Of course, if the common use case is reading from a file, network, or pipe, maybe you can optimize that, but I would take it step by step.
And that the hardware is not overbooked. I found that my ci/cd runs would vary between 8 and 14 minutes (for a specific task in the pipeline, no cache involved) between reruns.
And it seemed correlated to time of day. So pretty sure they had some contention there.
Edit: and that was with all the same cpu’s reported to the os atleast
The TL:DR; is that the build step masks the wait for input from a shell pipe. With a side dish of "do buffered input" and then a small "avoid memory allocation for fun."
You can easily hit a similar problem in other languages too. For example, in Rust, std::fs::File isn't buffered, so reading single bytes from it will also be rather slow.
* This is not about a build step that makes the app perform better
* The app isn't 10x faster (or faster at all; it's the same binary)
* The author ran a benchmark two ways, one of which inadvertently included the time taken to generate sample input data, because it was coming from a pipe
* Generating the data before starting the program under test fixes the measurement