Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A weird CPIO discrepancy (colindou.ch)
46 points by vimda on Dec 31, 2023 | hide | past | favorite | 20 comments


Probably the most authoritative current definition os cpio file format is description of pax utility in Single UNIX Specification, which states that "FIFO special files, directories, and the trailer shall be recorded with c_filesize equal to zero.".

What to do with CPIO archive that has non-zero c_filesize for directory is an interesting question. It seems to me that it is equally likely that such broken CPIO archive has c_filedata really empty and that it contains actual mostly meaningless data in there (most likely series of struct dirents in some format as returned by calling read(2) on directory).


I wrote this. My 2024 side project is to build a Linux user space, but I definitely didn't expect to come across weird things so early


As someone who has been wanting to do the same, please blog about it! Good luck!


The problem is: like regexes, there's no 1 true cpio but rather many flavors.

pax's cpio supports[0]:

- bcpio - Old binary cpio format. Selected by −6.

- cpio - Old octal character cpio format. Selected by −c.

- sv4cpio - SVR4 hex cpio format.

- sv4crc - SVR4 hex cpio format with checksums. This is the default format for creating new archives.

GNU cpio supports[1]:

- binary

- old ASCII [hex?]

- new ASCII

- crc

- HPUX binary

- HPUX old ASCII

0. http://www.mirbsd.org/MirOS/cats/mir/cpio/cpio-20200904.pdf

1. https://www.gnu.org/software/cpio/manual/cpio.html


cpio archives have some very useful properties (you can merge them by concatenation!), but the tooling is quite a PITA.

On unrelated note, is there any usecase for cpio files other than initramfs? Back in the day rpm2cpio was the recommended way to unpack rpm files, but nowadays rpm2archive is much faster.


RPM files are also no longer reliably CPIO compatible. 99.999% are, but as CPIO doesn't allow file sizes larger than 4gb, creating such an RPM will use a custom archive scheme incompatible with CPIO.

However, it's not particularly common practice to stuff a 4gb file into an RPM, so at least at the moment, it's not a very big practical issue.


macOS packages are gzipped cpio archives:

file Diva.au.pkg/Payload

Payload: gzip compressed data, from Unix, original size modulo 2^32 25391104

gzip -dc Diva.au.pkg/Payload > unpacked file unpacked

unpacked: ASCII cpio archive (pre-SVR4 or odc)


A long time ago, when Apple was on the verge of releasing OS X, they ran a series of dev conferences. These were to inform then-current Mac devs on the huge changes from 9 to X.

One slide was "we've come up with a new way to install apps! Just run /bin/sh configure, then make, then make install!". After the wave of consternation swept the assembled devs, the next slide was "only joking". With reassurance that you'll still be able to drag an icon to install.

Very amusing to me that CPIO is underneath the pkg install method/format. I wonder what those old school Mac devs would have thought?


You can also concatenate tar files, you just need to remove the last two zero blocks before gluing the files together (GNU tar has an option to do this for you, it's somewhere in the man, or use head -c-2b) or tell tar to ignore them when extracting.


Beware that tar has some bugs regarding certain kinds of concatenation. I forget the exact details, but double check that the sequence of files/arguments you input gives a valid tarball result. Probably a good idea to always do this when creating an archive, actually.


Another nice feature I used a long time ago from a shell script: you can send filenames to archive via a pipe (stdin). So you can fork a cpio and send it file for file over a period of time during their creation. That wasn't possible with tar at the time. (I think it's the same today)


You can do it with GNU tar -T but it often requires shell redirection or spawning a process to be concise.

    tar zcvf tarball.tar.gz -T <(file_name_generator_function_or_process)
Or

    echo /etc/hosts | tar zcvf tarball.tar.gz -T-

I'm long in the tooth enough to remember when tar on SunOS was used for system and data backups to SCSI tape drives.


Not SunOS, but this was on AIX (V3, so long ago) tar, but thanks for the gnu-tar parameter hint!


I miss using the word tarballs, I really like that word.


Are there any guidelines for tar files? Are you supposed to list all containing directories too?

I noticed when creating container image layers the answer seems to be generally no, but AWS lambda functions with these types of images error with missing directory errors. Is AWS too strict or not?


There are 6 slightly different variants of tar that GNU tar can unpack. In general the tar formats do not have to store the directory record, but the more recent can as to preserve its metadata. Also, modern tar formats generally store UID and GID as an symbolic name that might be useful on different system in contrast to numeric UID/GID. The format still is this 70's absurdity with octal-in-ASCII headers (like the “traditional” CPIO) and block aligned. Well, it is a format designed for magnetic tape.

In practice there are huge incompatibilities between non-standard SVr4 and GNU implementations, but both of them go to great lengths to detect that the format is the other one. And then there are two POSIX/SUS variants that can be easily distinguished from each other (and the SVr4/GNU variants) which are most of what you will see today.


> The format still is this 70's absurdity with octal-in-ASCII headers (like the “traditional” CPIO) and block aligned. Well, it is a format designed for magnetic tape.

Yeah, most of the ways tar is weird stop seeming weird when you consider the context. It can easily be appended without rewinding the tape to update metadata, it can be read/edited/fixed with just a text editor, it ends with 2 blocks of null bytes because tape doesn't tell you when you hit the end of your data like a file would.

Also, I have a love-hate relationship with the "numbers as octal in ASCII" thing - on the bright side, it avoids questions of endianness;P (Edit: Well, I guess it's big endian...)


.. and the above is the reason why linux kernel uses cpio. Yes, the default tools suck, but the format itself is much simpler and there is only one version.


Instead of an arbitrary octet-stream link, I would highly recommend hosting your bad cpio archive on GitHub. A `hexdump -C` would be preferred.


  curl --silent https://blog.colindou.ch/lets-make-an-os/bad.cpio | hexdump -C | less




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: