The article spends a lot of time discussing XZ's behavior when reading corrupt archives, but in practice this is not a case that will ever happen.
Say you have a file `m4-1.4.19.tar.xz` in your source archive collection. The version I just downloaded from ftp.gnu.org has the SHA-256 checksum `63aede5c6d33b6d9b13511cd0be2cac046f2e70fd0a07aa9573a04a82783af96`. The GNU site doesn't have checksums for archives, but it does have a PGP signature file, so it's possible to (1) verify the archive signature and (2) store the checksum of the file.
If that file is later corrupted, the corruption will be detected via the SHA-256 checksum. There's no reason to worry about whether the file itself has an embedded CRC-whatever, or the behavior with regards to variable-length integer fields. If a bit gets flipped then the checksum won't match, and that copy of the file will be discarded (= replaced with a fresh copy from replicated storage) before it ever hits the XZ code.
If this is how the lzip format was designed -- optimized for a use case that doesn't exist -- it's no wonder it sees basically no adoption in favor of more pragmatic formats like XZ or Zstd.
I have no idea why you think that use case does not exist. Your whole idea about archive seems to be it is to ensure a blob doesn't change (same hash). But that's far from the only use of archive. (Hell, even with that, you are assuming you know the correct hash of the file to begin with, which isn't guaranteed.)
"Repairing" corrupt archives, as in to get as much as usable data from that archive is a pretty useful thing and I have done it multiple times. For example, an archive can have hundreds of files inside and if you can recover any of them that's better than nothing. It is also one of the reason I still use WinRAR occasionally due to its great recovery record (RR) feature.
>replaced with a fresh copy from replicated storage
The process of long-term archival starts with replication. A common approach is two local copies on separate physical media, and one remote copy in a cloud storage service with add-only permissions. This protects against hardware failure, bad software (accidental deletion, malware), natural disasters (flood, fire) and other 99th-percentile disaster conditions. The cloud storage providers will have their own level of replication (AWS S3 has a 99.999999999% durability SLA).
If you have only one copy of some important file and you discover it no longer matches the stored checksum, then that's not a question of archival, but of data recovery. There's no plausible mechanism by which a file might suffer a few bitflips and be saved by careful application of a CRC -- bitrot often zeros out entire sectors.
> There's no plausible mechanism by which a file might suffer a few bitflips and be saved by careful application of a CRC -- bitrot often zeros out entire sectors.
A CRC, no, absolutely not. But this is exactly what PAR2 recovery records do, they do it well, and they (or their equivalents) need to be easier to enable in more places.
Setting up a replication and durability scheme is a major pain in the ass. Passing the `--add-recovery-record` switch on the command line is very, very easy, and it is good enough for many cases where "best effort" protection against corruption is all that is needed.
It's simply not sufficient to have multiple copies. It's way too easy to propagate errors in a way that slips under the radar, which then screws you over 5 years later. The main idea of long-time archival is redundancy. Replication is one form of redundancy, but it's not the only one and not the only one you should use.
This is nonsense. Replication to storage in different failure domains is quite sufficient to ensure long-term data preservation, and errors cannot "propagate" to archives unless your risk model involves angry wizards.
None of this addresses the criticism levied in the article, nor does it defend xz's inconsistent design decisions that are all over the place.
Why shouldn't we try to squeeze the highest rate of data recovery out of the unlikely event that we're left with the only remaining copy if it costs nothing extra (just the choice of one archiver over another)?
Should you choose xz for future archival purposes? No.
Should Debian make an active effort to switch away from xz now? Probably not, as their primary concern is distribution, not archival, and xz is good enough.
> None of this addresses the criticism levied in the article
The article's criticism isn't worth addressing (or reading). Nothing it complains about is important.
> nor does it defend xz's inconsistent design decisions that are all
> over the place.
The level of inconsistency that the article complains about doesn't matter. Pretty much every popular format looks like that. Try writing a Matroska or PDF decoder some time.
> Why shouldn't we try to squeeze the highest rate of data recovery
> out of the unlikely event that we're left with the only remaining
> copy if it costs nothing extra
Because it isn't important.
For cases where recovery of data from a corrupt compressed stream is important, you'd wrap the compressed data in a container with built-in error correction. Then you'd use that format's error correction to recover the correct compressed data stream, and feed that to your decompressor.
> Should you choose xz for future archival purposes? No.
Yes you should. XZ is fine, despite the article's silliness. It's better than gzip or bzip2 and nearly as widely supported.
If there's a replacement for XZ as a general-purpose compression format for archived data then it'll be selected on the quality of its compression, not whether the bitstream format can produce valid output from invalid input.
> The level of inconsistency that the article complains about doesn't matter. Pretty much every popular format looks like that. Try writing a Matroska or PDF decoder some time.
So a format should be designed wrong because other formats are also designed wrong.
> Because it isn't important.
So? You don't have to go out of your way to micro-optimize, others like TFA's author are already doing it for you. You just have to pick the micro-optimized product off the shelf.
"It isn't important" is a complete non-argument when the bad choices are made for no reason and no benefit at all and especially not even making the design process quicker and easier.
> If there's a replacement for XZ as a general-purpose compression format for archived data then it'll be selected on the quality of its compression, not whether the bitstream format can produce valid output from invalid input.
The article also addresses design problems that harm the compression ratio.
To my own surprise I have seen a corrupted file in 2022. Only 70 MB, so really tiny compared to many files handled today.
The file had been built in Europe and after installing it to a US system we wondered why it did not work.
Haven't seen anything like that for many years so it took us a while to even consider the option that it could be corrupted.
(No, we did not spend any time to check whether corruption showed any interesting pattern like a single bit flip, a block of zeros or anything like that. Transferring it again just did it.)
I've had a few experiences trying to recover data from old hard drives or even tape drives. The general experience was that either it works perfectly or the drive is covered in bad sectors and large chunks are unreadable. I don't dispute bitrot exists but there does seem to be an awful lot of discussion on the internet about an issue that is not not the most likely failure mode.
* At the sector level in physical media (tapes, disk drives, flash). The file will be largely intact, but 4- or 8-KiB chunks of it will be zero'd out.
* At the bit level, when copying goes wrong. Usually this is bad RAM, sometimes a bad network device, very occasionally a bad CPU. You'll see patterns like "every 64th byte has had its high bit set to 1".
In both cases, the only practical option is to have multiple copies of the file, plus some checksum to decide whether a copy is "good". Files on disk can be restored from backup, bad copies over the network can be detected by software and re-transmitted.
It is a case that will happen, when you long-term archive files. Which is exactly what the article discusses. Bit-rot is a real thing that really happens. The argument makes the case that XZ is a poor choice for such a case where you possibly have only one copy and can't just download a new un-corrupted copy.
If you want to archive data, you need multiple copies.
XZ or not doesn't matter here. Even if you have a completely uncompressed tar file, if you only have one copy of it and lose access to that copy (whether bitrot, or disaster, or software error) then you've lost the data.
No, you need redundancy. Multiple copies isn't sufficient without an appropriate form of data storage. I have no idea why people think a single solution is necessary or sufficient.
I don't know what you mean by that, and I suspect you don't either.
If I have a copy on my NAS, on a local backup disk, and in GCS, then there's no plausible risk to that data. I could go further and put another copy into AWS Glacier, or write it to tape and store it at my bank. At enterprise price points there's vendors like Iron Mountain who will store tapes by the container-load.
To claim that multiple copies is insufficient is absurd.
In the case of storage and servers, redundancy usually maps to uptime and availability. Think RAID or HA.
Copies are distinct copies. If your RAID catches fire, you want a copy that's somewhere else. Think external drive.
In terms of backups, while you might want redundancy for availability, you want distinct copies in separate places ideally. So that if your building catches fire and takes your RAID, at least you have a copy somewhere else.
A lot of these copies aren't in real time. That's what makes them an easy backup solution. A backup is a snapshot that allows you to go back to a point in time and see/recover things. Redundancy won't protect you against someone deleting their home folder. If the copies are in real time, that's gone too.
So, copies aren't always backups and a backup isn't always a copy.
Say you have a file `m4-1.4.19.tar.xz` in your source archive collection. The version I just downloaded from ftp.gnu.org has the SHA-256 checksum `63aede5c6d33b6d9b13511cd0be2cac046f2e70fd0a07aa9573a04a82783af96`. The GNU site doesn't have checksums for archives, but it does have a PGP signature file, so it's possible to (1) verify the archive signature and (2) store the checksum of the file.
If that file is later corrupted, the corruption will be detected via the SHA-256 checksum. There's no reason to worry about whether the file itself has an embedded CRC-whatever, or the behavior with regards to variable-length integer fields. If a bit gets flipped then the checksum won't match, and that copy of the file will be discarded (= replaced with a fresh copy from replicated storage) before it ever hits the XZ code.
If this is how the lzip format was designed -- optimized for a use case that doesn't exist -- it's no wonder it sees basically no adoption in favor of more pragmatic formats like XZ or Zstd.