I wrote a little copy program at my last job to copy files in a reasonable time ...

laymil · on Sept 12, 2014

And it's interesting and useful for scientific computing where you already have an MPI environment and distributed/parallel filesystems. However, it's not really applicable to this workload, as the paper itself says.

There is a provision in most file systems to use links (symlinks, hardlinks, etc.). Links can cause cycles in the file tree, which would result in a traversal algorithm going into an infinite loop. To prevent this from happening, we ignore links in the file tree during traversal. We note that the algorithms we propose in the paper will duplicate effort proportional to the number of hardlinks. However, in real world production systems, such as in LANL (and others), for simplicity, the parallel filesystems are generally not POSIX compliant, that is, they do not use hard links, inodes, and symlinks. So, our assumption holds.

The reason this cp took such large amounts of time was the desire to preserve hardlinks and the resize of the hashtable used to track the device and inode of the source and destination files.

encoderer · on Sept 12, 2014

Sure, but if you read that article you walk away with a sense of thats a lot of files to copy. And the GP built a tool for jobs 2-3 orders of magnitude larger?! Clearly there are tradeoffs forced on you at that size...

jlafon · on Sept 12, 2014

Author of the paper here. The file operations are distributed strictly without links, otherwise we could make no guarantees that work wouldn't be duplicated, or even that the algorithm would terminate. We were lucky in that because the parallel file system itself wasn't POSIX, so we didn't have to make our tools POSIX either.

novaleaf · on Sept 12, 2014

man! and here I am feeling like a champ conquering NTFS's long file name limitations :/

scott_karana · on Sept 15, 2014

Is your conquering public? We have issues occasionally, and a toolkit would be nice :)

halayli · on Sept 12, 2014

couple nitpicks:

Check return value of malloc

You don't need \ when breaking function parameters

dredmorbius · on Sept 12, 2014

NB, that PDF seems to have a number of formatting glitches, e.g., the first sentence: "The amount of scienti c data". Numerous others as well, under both xpdf and evince.

DougBTX · on Sept 12, 2014

There is an fi ligature between the i and the c, perhaps those PDF renderers don't support them? Could be some sort of font loading issue.

dredmorbius · on Sept 12, 2014

Perhaps. That's only one of many similar issues.