How useful should copy_file_range be?

jeffrallen · on Feb 19, 2021

As a Go user and former contributor, it makes me pleased that the rigor of the Go team occasionally gives the Linux kernel developers heartburn. As long as everyone stays professional, the end result is better for both groups.

Linux gets feature velocity by playing fast and loose sometimes with stability. Demanding users like the Go authors are a necessary and welcome counterbalance.

jandrese · on Feb 19, 2021

It seems kind of strange that this is a kernel function at all. It seems like something that should live in libc or the like. Is there a performance benefit from having it up in kernel space? It seems a bit outside of the scope of the kernel IMHO.

I can understand functions like sendfile() being able to cut down on context switches and being helpful for bulk data transfer, but is that the case here? How much benefit do you get from copy_file_range() vs. a read/write loop?

I do note with some amusement how the kernel developer basically went "Why are you using copy_file_range on things that aren't actually files?"

webstrand · on Feb 19, 2021

Given all of the effort being put into zero-copy read/write in the kernel, I would assume there are significant performance gains available.

I suspect that some, correctly aligned, ranges could be copied with CoW semantics, thereby skipping the read/write altogether.

bonzini · on Feb 19, 2021

On NFS you can avoid network traffic completely.

jandrese · on Feb 19, 2021

Is this theoretical or is there support for it already in the kernel and NFS daemons?

Reading the manpage for copy_file_range the notes section states:

       copy_file_range() gives filesystems an opportunity to  implement  "copy
       acceleration"  techniques,  such  as  the use of reflinks (i.e., two or
       more inodes that share pointers to the same copy-on-write disk  blocks)
       or server-side-copy (in the case of NFS).

But doesn't mention if these techniques are actually used. I guess there's some help in future proofing your code.

Edit: I tested this on my Ubuntu 20.04 machine and a 1GB file full of random data sitting in the file cache. Using copy_file_range I could make a local-local copy in 0.595 seconds on average. Using a primitive copy/write loop took 0.600 seconds. But these values are somewhat noisy and the difference is down in the error margin. It doesn't appear that my 5.4.0 kernel on ext4 is employing the reflinks optimization.

bonzini · on Feb 19, 2021

It's totally practical and already implemented by Linux on both the server and the client side.

dallbee · on Feb 19, 2021

Can't share details, I've seen the copy_file_range optimization represent roughly 25% increased throughput for a prior employer. It's more than theoretical.

the8472 · on Feb 19, 2021

ext4 has no reflink support. btrfs does by default, xfs under some configurations. server-side offload is supported by nfs and cifs but also needs server-side support.

gravypod · on Feb 19, 2021

Even if it wasn't smart you'd still get a huge benifit from not context switching to/from kernel/user space for a standard copy implementation. Far fewer syscalls. There's a real performance benifit to be had there.

bonzini · on Feb 19, 2021

ext4 does not support sharing blocks across multiple files, indeed. Try btrfs or perhaps xfs.

7786655 · on Feb 19, 2021

>The copy_file_range() system call looks like a relatively straightforward feature; it allows user space to ask the kernel to copy a range of data from one file to another, hopefully applying some optimizations along the way.

jandrese · on Feb 19, 2021

That's exactly what I was asking. The Go developers are knocking themselves out chasing this syscall in the hopes that it might improve performance? Has it been benchmarked?

dataflow · on Feb 19, 2021

It sounds wrong to depend on the file length for correctness even on physical file systems. What if the file length shrinks during the copy? You need to just keep going until you can't anymore...

cpuguy83 · on Feb 19, 2021

That's assuming you want to copy the the whole thing. And even if you do want to do that, this is what you do with copy_file_range as well, just that in many cases you can do it with a single call instead of multiple read/write calls in addition to being able to take advantage of performance optimizations (such as reflinking).

dataflow · on Feb 19, 2021

> That's assuming you want to copy the the whole thing

Right, but if you don't, then you definitely don't need to query the file length in the first place...

kccqzy · on Feb 20, 2021

It seems that copy_file_range is in fact quite similar to splice, which the article has also mentioned as a fallback. But what's fundamentally the difference between them? That copy_file_range should only work on regular files on a non-virtual file system, and splice should only work if one end is a pipe?

tyingq · on Feb 19, 2021

Something to change the start point of an existing file would be neat. Sort of like truncate(), but for the start.

jandrese · on Feb 19, 2021

When you think about how files are stored on the filesystem it becomes clear why this functionality doesn't exist. Each file is basically a list of blocks and some metadata like the total length of the file. What it doesn't have is a length for each block--they are assumed to be full length except for the last, which is stored as the total length of the file.

So if you wanted to add bytes to the front of the file you would have to allocate new blocks to store it, but since there is no map of length for each block you would have to only move it by exact block lengths. Same for shrinking the file by cutting off the head, you can't handle values other than full blocks.

It's certainly possible to build a filesystem where this would work, but when you wrote programs using the feature they wouldn't be portable to any other commonly used filesystem. People also don't change filesystems very often, so even if you got the change into Ext and waited a decade many people would still be incompatible.

Finally, it's a feature that is helpful only rarely. So there isn't enough demand to push through such a massive change given the headwinds it has.

scaladev · on Feb 20, 2021

I don't understand this comment. All modern filesystems are extent-based. If you have enough contiguous space, a hundred gigabyte file could easily be stored in a single extent. Block-based allocation hasn't been used since ext2/fat32. Could you elaborate?

tyingq · on Feb 19, 2021

That's true, though it seems like storing a block offset for just the first block of a file would satisfy a lot of use cases. Then a trunc() from the front would only rewrite one block and some pointers.

the8472 · on Feb 19, 2021

fallocate(..., FALLOC_FL_COLLAPSE_RANGE) will do that but it comes with limitations such as alignment requirements and limited filesystem support.

You could also try creating a new file and use copy_file_range to copy the tail of the file to the new one, then move it over the old one. That might reuse a good chunk of the storage on a CoW filesystem.

_dh54 · on Feb 19, 2021

Why not optimistically copy the file until EOF and report number of bytes copied? Why is stat() consulted at all? That seems broken.

callesgg · on Feb 19, 2021

I used to ponder on an idea that partial file copy’s could be done with file fragmentaion.