There really is no benefit of splitting a functionality from it's test. Then you just have a commit in the history which is not covered by tests.
Splitting "handling the happy vs error path" sounds even worse. Now I first have to review something that's obviously wrong (lack of error handling). That would commit code that is just wrong.
What is next, separating the idea from making it typecheck?
One should split commits into the minimum size that makes sense, not smaller.
"Makes sense" should be "passes tests, is useful for git bisect" etc, not "has less lines than arbitrary number I personally like to review" - use a proper review tool to help with long reviews.
Depends entirely on your workflow - we squash PRs into a single commit, so breaking a PR into pieces is functionally identical to not doing so for the purposes of the commit history. It does, however, make it easier to follow from the reviewer's perspective.
Don't give me 2000 lines unless you've made an honest good-faith attempt to break it up, and if it really can't be broken up into smaller units that make sense, at least break it up into units that let me see the progression of your thought as you solve the problem.
Worth pointing out that with Nix/NixOS this problem doesn't exist.
The problem in other distros is that if you prefix PATH so that it contains your executable "foo", and then run a program that invokes "foo" from PATH and expects it to do something else, the program breaks.
With Nix, this problem does not exist because all installed programs invoke all other programs not via PATH but via full absolute paths starting with /nix/store/HASH...
NixOS simultaneously smooths the path to using absolute paths while putting some (admittedly minor) speed-bumps in the way when avoiding them. If you package something up that uses relative paths it will probably break for someone else relatively quickly.
What that means is that you end up with a system in which absolute paths are used almost everywhere.
This is why the killer feature of NixOS isn't that you can configure things from a central place; RedHat had a tool to do that at least 25 years ago; it's that since most of /etc/ is read-only, you must configure everything from a central place, which has two important effects:
1. The tool for configuring things in a central place can be much simplified since it doesn't have to worry about people changing things out from under it
2. Any time someone runs into something that is painful with the tool for configuring things in a central place, they have to improve the tool (or abandon NixOS).
If it's a one off, you just use something like "nix shell" to add it to your path for running the script.
For non one-off sorts of things, you would substitute in the nix expression "${gnugrep}/bin/grep" the "${gnugrep}" will expand to "/nix/store/grep-hash" and also make a dependency on the gnugrep package, so that the grep install won't get garbage-collected as long as your package is still around.
Here's an example[1] from a package expression for e-mail client I use, which will shell out to base64 and file. Upstream relies on these two programs being in $PATH, but this replaces the string used for shelling out with the absolute path in the nix store.
For shell scripts, I'll just do something like this near the top:
GREP="${GNU_GREP:-$(command -v grep)}"
Then I use "$GREP" in the script itself, and develop with grep in my path, but it's trivial to prepend all of my dependencies when I bundle it up for nix.
[user@nixos:~]$ which grep
/run/current-system/sw/bin/grep
[user@nixos:~]$ ls -l /run/current-system/sw/bin/grep
lrwxrwxrwx 1 root root 65 Jan 1 1970 /run/current-system/sw/bin/grep -> /nix/store/737jwbhw8ji13x9s88z3wpp8pxaqla92-gnugrep-3.12/bin/grep
Basically, it is still in your environment, so I don't see how he can claim that this problem doesn't exist in Nix, unless you use flakes like a proper Nix afficionado.
Yes, the original comment that this problem doesn't exist in Nix is wrong for a typical user environment.
It does contain the issue a bit though:
I'm running isync in a systemd service, yet the program "mbsync" is not in my path. I have several services installed, yet their programs aren't in my path. My e-mail client shells out to "file" for mime-type verification, yet "file" is not in my path.
Run "compgen -c |wc -l" to get a list of commands; its over 7000 on my Ubuntu system and right around 2000 on my NixOS system.
As an aside, the packages that put the most executables in my path are probably going to be in the path for most NixOS installs (231 just for coreutils+util-linux):
True enough, but in my experience it's not really much of a problem because if I'm not doing Nix, then I'm doing containers which are widely available.
What can be a problem is muscle memory, when you expect it to autocomplete one way and it doesn't because something you want now shares first two or three letters with something else in your path. That's where FIGNORE comes in.
The goal of such sandboxing is that you can allow the agent to freely write/execute/test code during development, so that it can propose a solution/commit without the human having to approve every dangerous step ("write a Python file, then execute it" is already a dangerous step). As the post says: "To safely run a coding agent without review".
You would then review the code, and use it if it's good. Turning many small reviews where you need to be around and babysit every step into a single review at the end.
What you seem to be asking for (shipping the generated code to production without review) is a completely different goal and probably a bad idea.
If there really were a tool that can "scan the generated code" so reliably that it is safe to ship without human review, then that could just be part of the tool that generates the code in the first place so that no code scanning would be necessary. Sandboxing wouldn't be necessary either then. So then sandboxing wouldn't be "half the picture"; it would be unnecessary entirely, and your statement simplifies to "if we could auto-generate perfect code, we wouldn't need any of this".
Yeah I think we're actually agreeing more than it seems. I'm not arguing for shipping without review - more that the review itself is where things fall through.
In practice, that "single review at the end" is often a 500-line diff that someone skims at 5pm. The sandbox did its job, the code runs, tests pass. But the reviewer misses that the auth middleware doesn't actually check token expiry, or that there's a path traversal buried in a file upload handler. Not because they're bad at reviewing - because AI-generated code has different failure modes than human-written code and we're not trained to spot them yet.
Scanning tools don't replace review, they're more like a checklist that runs before the human even looks at it. Catches the stuff humans consistently miss so the reviewer can focus on logic and architecture instead of hunting for missing input validation.
If that's the goal, why not just have Claude Code do it all from your phone at that point? Test it when its done locally you pull down the branch. Not 100% frictionless, but if it messes up an OS it would be anthropic's not yours.
idk, having your own sandbox matters if you're doing anything sensitive or want offline capability. Claude Code on phone is fine for simple stuff but you're still sending everything through Anthropic's infra - latency, costs, and you're trusting them with whatever code/data you're working on. plus what happens when they change pricing or terms? I've been burned by that kind of dependency before. self-hosted gives you control, even if it's more work to set up.
The 2MiB are per SSH "channel" -- the SSH protocol multiplexes multiple independent transmission channels over TCP [1], and each one has its own window size.
rsync and `cat | ssh | cat` only use a single channel, so if their counterparty is an OpenSSH sshd server, their throughput is limited by the 2MiB window limit.
rclone seems to be able to use multiple ssh channels over a single connection; I believe this is what the `--sftp-concurrency` setting controls.
Some more discussion about the 2MiB limit and links to work for upstreaming a removal of these limits can be found in my post [3].
Looking into it just now, I found that the SSH protocol itself already supports dynamically growing per-channel window sizes with `CHANNEL_WINDOW_ADJUST`, and OpenSSH seems to generally implement that. I don't fully grasp why it doesn't just use that to extend as needed.
I also found that there's an official `no-flow-control` extension with the description
> channel behaves as if all window sizes are infinite.
>
> This extension is intended for, but not limited to, use by file transfer applications that are only going to use one channel and for which the flow control provided by SSH is an impediment, rather than a feature.
So this looks exactly as designed for rsync. But no software implements this extension!
I wrote those things down in [4].
It is frustrating to me that we're only a ~200 line patch away from "unlimited" instead of shitty SSH transfer speeds -- for >20 years!
I get 40 Gbit/s over a single localhost TCP stream on my 10 years old laptop with iperf3.
So the TCP does not seem to be a bottleneck if 40 Gbit/s is "high" enough, which it probably is currently for most people.
I have also seen plenty situations in which TCP is faster than UDP in datacenters.
For example, on Hetzner Cloud VMs, iperf3 gets me 7 Gbit/s over TCP but only 1.5 Gbit/s over UDP. On Hetzner dedicated servers with 10 Gbit links, I get 10 Gbit/s over TCP but only 4.5 Gbit/s over UDP. But this could also be due to my use of iperf3 or its implementation.
I also suspect that TCP being a protocol whose state is inspectable by the network equipment between endpoints allows implementing higher performance, but I have not validated if that is done.
Aspera was/is designed for high latency links. Ie sending multi terabytes from london to new Zealand, or LA
For that use case, Aspera was the best tool for the job. It's designed to be fast over links that single TCP streams couldn't
You could, if you were so bold, stack up multiple TCP links and send data down those. You got the same speed, but possible not the same efficiency. It was a fucktonne cheaper to do though.
> I get 40 Gbit/s over a single localhost TCP stream on my 10 years old laptop with iperf3.
Do you mean literally just streaming data from one process to another on the same machine, without that data ever actually transiting a real network link? There's so many caveats to that test that it's basically worthless for evaluating what could happen on a real network.
To measure other overhead of what's claimed (TCP the protocol being slow), one should exclude other things that necessarily affect alternative protocols as well (e.g. latency) as much as possible, which is what this does.
It sounds like you're reasoning starting from an assumption that any claimed slowness of TCP would be something like a fixed per-packet overhead or delay that could be isolated and added back in to the result of your local testing to get a useful prediction. And it sounds like you think alternative protocols must be equally affected by latency.
But it's much more complicated than that; TCP interacts with latency and congestion and packet loss as both cause and effect. If you're testing TCP without sending traffic over real networks that have their own buffering and congestion control and packet reordering and loss, you're going to miss all of the most important dynamics affecting real-world performance. For example, you're not going to measure how multiplexing multiple data streams onto one TCP connection allows head of line blocking to drastically inflate the impact of a lost or reordered packet, because none of that happens when all you're testing is the speed at which your kernel can context-switch packets between local processes.
And all of that is without even beginning to touch on what happens to wireless networks.
Somebody made a claim that TCP isn't high performance without specifying what that means, I gave a counterexample of just how high performance TCP is picking some arbitrary notion of "high performance".
Almost like it makes the point that arguing about "high performance" is useless without saying what that means.
That said:
> you're not going to measure how multiplexing multiple data streams onto one TCP connection
Of course not: When I want to argue against "TCP is not a high performance protocol", why would I want to measure some other protocol that multiplexes connections over TCP? That is not measuring the performance of TCP.
I could conjure any protocol that requires acknowledgement from the other side for each emitted packet before sending the next, and then claim "UDP is not high performance" when running that over UDP - that doesn't make sense.
No, the example exercises the full TCP protocol including all its logic.
There have been plenty of examples of protocols where that is far from infinitely fast and does not scale even on localhost, e.g. with OpenVPN or any protocol that requires full acknowledgements from the other side before sending more.
High performance means transferring files from NZ to a director's yacht in the Mediterranean with a 40Mbps satellite link and getting 40Mbps, to the point that the link is unusable for anyone else.
UDP by itself cannot be used to transfer files or any other kind of data with a size bigger than an IP packet.
So it is impossible to compare the performance of TCP and UDP.
UDP is used to implement various other protocols, whose performance can be compared with TCP. Any protocol implemented over UDP must have a performance better than TCP, at least in some specific scenarios, otherwise there would be no reason for its existence.
I do not know how UDP is used by iperf3, but perhaps it uses some protocol akin to TFTP, i.e. it sends a new UDP packet when the other side acknowledges the previous UDP packet. In that case the speed of iperf3 over UDP will always be inferior to that of TCP.
Sending UDP packets without acknowledgment will always be faster than for any usable transfer protocol, but the speed for this case does not provide any information about the network, but only about the speed of executing a loop in the sending computer and network-interface card.
You can transfer data without using any transfer protocol, by just sending UDP packets at maximum rate, if you accept that a fraction of the data will be lost. The fraction that is lost can be minimized, but not eliminated, by using an error-correcting code.
> perhaps it [..] sends a new UDP packet when the other side acknowledges the previous UDP packet. In that case the speed of iperf3 over UDP will always be inferior to that of TCP
It does not, otherwise it would be impossible by a factor ~100x to measure 4.5 Gbit/s as per the bandwidth-delay calculation (the ping is around the usual 0.2 ms).
With iperf3, as with many other UDP measurement tools, you set a sending rate and the other side reports how many bytes arrived.
It is a long time since I have last used iperf3, but now that you have mentioned it I have also remembered this.
So the previous poster has misinterpreted the iperf3 results, by believing that UDP was slower, as iperf3 cannot demonstrate a speed difference between TCP and UDP, since for the former the speed is determined by the network, while for the latter the speed is determined by the "--bandwidth" iperf3 command-line option, so the poster has probably just seen some default UDP speed.
The simple model for scp and rsync (it's likely more complex in rsync):
for loop over all files. for each file, determine its metadata with fstat, then fopen and copy bytes in chunks until done. Proceed to next iteration.
I don't know what rsync does on top of that (pipelining could mean many different things), but my empirical experience is that copying 1 1 TB file is far faster than copying 1 billion 1k files (both sum to ~1 TB), and that load balancing/partitioning/parallelizing the tool when copying large numbers of small files leads to significant speedups, likely because the per-file overhead is hidden by the parallelism (in addition to dealing with individual copies stalling due to TCP or whatever else).
I guess the question is whether rsync is using multiple threads or otherwise accessing the filesystem in parallel, which I do not think it does, while tools like rclone, kopia, and aws sync all take advantage of parallelism (multiple ongoing file lookups and copies).
> I don't know what rsync does on top of that (pipelining could mean many different things), but my empirical experience is that copying 1 1 TB file is far faster than copying 1 billion 1k files (both sum to ~1 TB), and that load balancing/partitioning/parallelizing the tool when copying large numbers of small files leads to significant speedups, likely because the per-file overhead is hidden by the parallelism (in addition to dealing with individual copies stalling due to TCP or whatever else).
That's because of fast paths:
- For a large file, assuming the disk isn't fragmented to hell and beyond, there isn't much to do for rsync / the kernel: the source reads data and copies it to the network socket, the receiver copies data from the incoming network socket to the disk, the kernel just dumps it in sequence directly to the disk, that's it.
- The slightly less performant path is on a fragmented disk. Source and network still doesn't have much to do, but the kernel has a bit more work every now and then to find a contiguous block on the disk to write the data to. For spinning rust HDDs, the disk also has to do some seeking.
- Many small files? Now that's more nasty. First, the source side has to do a lot of stat(2) calls to get basic attributes of the file. For HDDs, that seeking can incur a sometimes significant latency penalty as well. Then, this information needs to be transferred to the destination, the destination has to do the same stat call again, and then the source needs to transfer the data, involving more seeking, and the destination has to write it.
- The utter worst case is when the files are plenty and small, but large enough to not fit into an inode as inline data [1]. That means two writes and thus seeks per small file. Utterly disastrous for performance.
And that's before stepping into stuff such as systems disabling write caches, soft-RAID (or the impact of RAID in general), journaling filesystems, filesystems with additional metadata...
> I guess the question is whether rsync is using multiple threads or otherwise accessing the filesystem in parallel
No, that is not the question. Even Wikipedia explains that rsync is single-threaded. And even if it was multithreaded "or otherwise" used concurent file IO:
The question is whether rsync _transmission_ is pipelined or not, meaning: Does it wait for 1 file to be transferred and acknowledged before sending the data of the next?
Somebody has to go check that.
If yes: Then parallel filesystem access won't matter, because a network roundtrip has brutally higher latency than reading data sequentially of an SSD.
Note that rsync on many small files is slow even within the same machine (across two physical devices), suggesting that the network roundtrip latency is not the major contributor.
The filesystem access and general threading is the question because transmission is pipelined and not a thing "somebody has to go check". You just quoted the documentation for it.
The dead time isn't waiting for network trips between files, it's parts of the program that sometimes can't keep up with the network.
I quoted the documentation that claims _something_ is pipelined.
That is extremely vague on what that is and I also didn't check that it's true.
Both the original claim "the issue is the serialization of operations" and the counter-claim all sound like extreme guesswork or me. If you know for certain, please link the relevant code.
Otherwise somebody needs to go check what it actually does; everything else is just speculating "oh surely it's the files" and then people remember stuff that might just be plain wrong.
The question was what exactly rsync pipelines, and whether it serialises its network sends. If true, that would be a plausible cause of parallelism speeding it up.
Serial local reads are not a plausible cause, because the autor describes working on NVMe SSDs which have so low latency that they cannot explain that reading 59 GB across 3000 files take 8 minutes.
However:
You might actually be half-right because in the main output shown in the blog post, the author is NOT using local SSDs. The invocation is `rsync ... /Volumes/mercury/* /Volumes/...` where `mercury` is a network share mount (and it is unspecified what kind of share that is). So in that case, every read that looks "local" to rsync is actually a network access. It is totally possible that rsync treats local reads as fast and thus they are not pipelined.
In fact, it is even highly likely that rsync will not / cannot pipeline reading files that appear local to it, because normal POSIX file IO does not really offer any ways to non-blocking read regular files, so the only way to do that is with threads, which rsync doesn't use.
So while "the dead time isn't waiting for network trips between files" would be wrong -- it absolutely would wait for network trips between files -- your "filesystem access and general threading is the question" would be spot-on.
So in that case rclone is just faster because it reads from his network mount in parallel. This would also explain why he reports `tar` as not being faster, because that, too, reads files serially from the network mount. Supposedly this situation could be avoided by running rsync "normally" via SSH, so that file reads are actually fast on the remote side.
The situation is extra confused by the author writing below his run output:
even experimenting with running the rsync daemon instead of SSH
when in fact the output above didn't rsync over SSH at all.
Another weird thing I spotted is that the rsync output shown in the post
Unmatched data: 62947785101 B
seems impossible: The string "Unmatched data" doesn't seem to exist in the rsync source code, and hasn't since 1996. So it is unclear to me what version of rsync used.
> Serial local reads are not a plausible cause, because the autor describes working on NVMe SSDs which have so low latency that they cannot explain that reading 59 GB across 3000 files take 8 minutes.
But the people you responded to were talking about slowdowns that exist in general, not just ones that apply directly to the post.
For the post, my personal guess is that per-file overhead isn't a huge factor here, and it's mostly rsync having trouble doing >1Gbps over the network.
> In fact, it is even highly likely that rsync will not / cannot pipeline reading files that appear local to it, because normal POSIX file IO does not really offer any ways to non-blocking read regular files, so the only way to do that is with threads, which rsync doesn't use.
Makes sense.
> it absolutely would wait for network trips between files
I don't see why you're saying this. I expect it to serially read files and then put that data into a buffer that can have data from multiple files at the same time. In other words, pipelined networking. As long as the transfer queue doesn't bottom out it shouldn't have to wait for any network round trips. What leads you to think otherwise?
> But the people you responded to were talking about slowdowns that exist in general, not just ones that apply directly to the post.
I think that's incorrect though. These slowdowns do not exist in general (see my next reply where I run rsync and it immadiately maxes out my 10 Gbit/s).
I think original poster digiown is right with "Note there is no intrinsic reason running multiple streams should be faster than one [EDIT: 'at this scale']. It almost always indicates some bottleneck in the application". In this case it's the user running rsync as a serially-reading program reading from a network mount.
> rsync having trouble doing >1Gbps over the network
rsync copies at 10 Gbit/s without problem between my machines.
Though I have to give `-e 'ssh -c aes256-gcm@openssh.com'` or aes128-gcm, otherwise encryption bottlenecks at 5 Gbit/s with the default `chacha20-poly1305@openssh.com`.
> I don't see why you're saying this.
Because of the part you agreed making sense: It read each file with the sequence `open()/read()/.../read()/close()`, but those files are on the network mount ("/Volumes/mercury"), so each `read()` of size `#define IO_BUFFER_SIZE (32*1024)` is a network roundtrip.
I see, so you're saying the file end of rsync is forced to wait for the network because the filesystem itself waits, not the network end of rsync. That makes sense.
Though I wonder what the actual delay is. The numbers in the post implied several milliseconds, enough to maybe account for 30 seconds of the 8 minutes. But maybe changing files resets the transfer speed a bunch.
To be clear, when I said "delay" in my last post I meant the per-file penalty, not the round-trip latency.
Radxa Orion O6/.DS_Store, 6KB at 4.8MBps, that means it took 1.3ms to transfer
Radxa Orion O6/Micro Center Visit Details.pages, 141KB at 9.8MBps, 14ms
Radxa Orion O6/Radxa Orion O6.md, 19KB at 1.9MBps, 10ms
We know the link can do well over 100MBps, so that time is almost all overhead. But it doesn't seem to be a simple delay. Perhaps a fresh TCP window scaling up on a per-file basis? That would be an unfortunate filesytem design.
Since the two 4MB files both get up to ~100MBps, the same speed as the 250MB file, it seems like the maximum impact from switching files isn't much more than 15ms. If the average is below 10ms then we're looking at half a minute wasted over 3564 files. If the average is 20ms then switching files is responsible for 71 seconds wasted.
By that estimate the file-level serialization is a real issue, but the bigger issue is whatever's preventing that 250MB file from ramping up all the way to 10Gbps.
I guess until we know what that network share is and how it works, we cannot really progress. Taking 10 ms to fetch the 19 KB file is definitely not great, it would be ~50 LAN roundtrips.
> whatever's preventing that 250MB file from ramping up all the way to 10Gbps
Probably that network share does small reads if small reads are requested.
Might be good if the author ran `dd` or a similar tool where the read buffer size can be controlled.
I’m not sure why, but just like with scp, I’ve achieved significant speeds ups by tarring the directory first (optionally compressing it), transferring and then decompressing. Maybe because it makes the tar and submit, and the receive, untar/uncompress, happen on different threads?
It's typically a disk-latency thing, as just stat-ing the many files in a directory can have significant latency implications (especially on spinning HDDs) vs opening a single file (the tar) and read-()ing that one file in memory before writing to the network.
If copying a folder with many files is slower than tarring that folder and the moving the tar (but not counting the untar) then disk latency is your bottleneck.
Not useful very often, but fast and kind of cool: You can also just netcat the whole block device if you wanted a full filesystem copy anyway. Optionally zero all empty space before using a tool like zerofree and use on-the-fly compression / decompression with lz4 or lzo. Of course, none of the block devices should be mounted, though you could probably get away with a source that's mounted read-only.
dd is not a magic tool that can deal with block devices while others can't. You can just cp myLinuxInstallDisk.iso to /dev/myUsbDrive, too.
Okay. In this case the whole operation is faster end to end. That includes the time it takes to tar and untar. Maybe those programs do something more efficient in disk access than scp and rsync?
I've never verified this, but it feels like scp starts a new TCP connection per file. If that's the case, then scp-ing a tarred directory would be faster because you only hit the slow start once. https://www.rfc-editor.org/rfc/rfc5681#section-3.1
Good point. Seems like I enabled it in ~/.ssh/config ages ago and did forget about it. Nonetheless, it's good to check whether it's enabled when using rsync to transfer large, already well compressed files.
In my experience, "purely functional" always means "you can express pure functions on the type level" (thus guaranteeing that it is referentially transparent and has no side effects) -- see https://en.wikipedia.org/wiki/Pure_function
Splitting "handling the happy vs error path" sounds even worse. Now I first have to review something that's obviously wrong (lack of error handling). That would commit code that is just wrong.
What is next, separating the idea from making it typecheck?
One should split commits into the minimum size that makes sense, not smaller.
"Makes sense" should be "passes tests, is useful for git bisect" etc, not "has less lines than arbitrary number I personally like to review" - use a proper review tool to help with long reviews.
reply