The article spends a lot of time discussing XZ's behavior when reading corrupt a...

thrdbndndn · on July 24, 2022

I have no idea why you think that use case does not exist. Your whole idea about archive seems to be it is to ensure a blob doesn't change (same hash). But that's far from the only use of archive. (Hell, even with that, you are assuming you know the correct hash of the file to begin with, which isn't guaranteed.)

"Repairing" corrupt archives, as in to get as much as usable data from that archive is a pretty useful thing and I have done it multiple times. For example, an archive can have hundreds of files inside and if you can recover any of them that's better than nothing. It is also one of the reason I still use WinRAR occasionally due to its great recovery record (RR) feature.

>replaced with a fresh copy from replicated storage

Lots of times you don't have other copy.

jmillikin · on July 24, 2022

The process of long-term archival starts with replication. A common approach is two local copies on separate physical media, and one remote copy in a cloud storage service with add-only permissions. This protects against hardware failure, bad software (accidental deletion, malware), natural disasters (flood, fire) and other 99th-percentile disaster conditions. The cloud storage providers will have their own level of replication (AWS S3 has a 99.999999999% durability SLA).

If you have only one copy of some important file and you discover it no longer matches the stored checksum, then that's not a question of archival, but of data recovery. There's no plausible mechanism by which a file might suffer a few bitflips and be saved by careful application of a CRC -- bitrot often zeros out entire sectors.

exmadscientist · on July 24, 2022

> There's no plausible mechanism by which a file might suffer a few bitflips and be saved by careful application of a CRC -- bitrot often zeros out entire sectors.

A CRC, no, absolutely not. But this is exactly what PAR2 recovery records do, they do it well, and they (or their equivalents) need to be easier to enable in more places.

Setting up a replication and durability scheme is a major pain in the ass. Passing the `--add-recovery-record` switch on the command line is very, very easy, and it is good enough for many cases where "best effort" protection against corruption is all that is needed.

Sakos · on July 24, 2022

It's simply not sufficient to have multiple copies. It's way too easy to propagate errors in a way that slips under the radar, which then screws you over 5 years later. The main idea of long-time archival is redundancy. Replication is one form of redundancy, but it's not the only one and not the only one you should use.

jmillikin · on July 24, 2022

This is nonsense. Replication to storage in different failure domains is quite sufficient to ensure long-term data preservation, and errors cannot "propagate" to archives unless your risk model involves angry wizards.

grosswait · on July 24, 2022

Bit rot is a thing. It seems like you have a different idea of what archival means than most of us.

mike_hock · on July 24, 2022

None of this addresses the criticism levied in the article, nor does it defend xz's inconsistent design decisions that are all over the place.

Why shouldn't we try to squeeze the highest rate of data recovery out of the unlikely event that we're left with the only remaining copy if it costs nothing extra (just the choice of one archiver over another)?

Should you choose xz for future archival purposes? No.

Should Debian make an active effort to switch away from xz now? Probably not, as their primary concern is distribution, not archival, and xz is good enough.

jmillikin · on July 24, 2022

  > None of this addresses the criticism levied in the article

The article's criticism isn't worth addressing (or reading). Nothing it complains about is important.

  > nor does it defend xz's inconsistent design decisions that are all
  > over the place.

The level of inconsistency that the article complains about doesn't matter. Pretty much every popular format looks like that. Try writing a Matroska or PDF decoder some time.

  > Why shouldn't we try to squeeze the highest rate of data recovery
  > out of the unlikely event that we're left with the only remaining
  > copy if it costs nothing extra

Because it isn't important.

For cases where recovery of data from a corrupt compressed stream is important, you'd wrap the compressed data in a container with built-in error correction. Then you'd use that format's error correction to recover the correct compressed data stream, and feed that to your decompressor.

  > Should you choose xz for future archival purposes? No.

Yes you should. XZ is fine, despite the article's silliness. It's better than gzip or bzip2 and nearly as widely supported.

If there's a replacement for XZ as a general-purpose compression format for archived data then it'll be selected on the quality of its compression, not whether the bitstream format can produce valid output from invalid input.

mike_hock · on July 24, 2022

> The level of inconsistency that the article complains about doesn't matter. Pretty much every popular format looks like that. Try writing a Matroska or PDF decoder some time.

So a format should be designed wrong because other formats are also designed wrong.

> Because it isn't important.

So? You don't have to go out of your way to micro-optimize, others like TFA's author are already doing it for you. You just have to pick the micro-optimized product off the shelf.

"It isn't important" is a complete non-argument when the bad choices are made for no reason and no benefit at all and especially not even making the design process quicker and easier.

> If there's a replacement for XZ as a general-purpose compression format for archived data then it'll be selected on the quality of its compression, not whether the bitstream format can produce valid output from invalid input.

The article also addresses design problems that harm the compression ratio.

maxloh · on July 24, 2022

In 2022, it should be quite rare that an archive got corrupted.

Internet connection is so much better now and almost 100% of downloads completed successfully.

usr1106 · on July 24, 2022

To my own surprise I have seen a corrupted file in 2022. Only 70 MB, so really tiny compared to many files handled today.

The file had been built in Europe and after installing it to a US system we wondered why it did not work.

Haven't seen anything like that for many years so it took us a while to even consider the option that it could be corrupted.

(No, we did not spend any time to check whether corruption showed any interesting pattern like a single bit flip, a block of zeros or anything like that. Transferring it again just did it.)

pmoriarty · on July 24, 2022

The larger the file the more chance it will get corrupted by cosmic rays. Media also physically decays over time.

Error correction and redundancy is essential, which is why I use par2 and dvdisaster on all my archives.

maxloh · on July 24, 2022

Shouldn't the download client check for corruption before any download complete?

technion · on July 24, 2022

I've had a few experiences trying to recover data from old hard drives or even tape drives. The general experience was that either it works perfectly or the drive is covered in bad sectors and large chunks are unreadable. I don't dispute bitrot exists but there does seem to be an awful lot of discussion on the internet about an issue that is not not the most likely failure mode.

jmillikin · on July 24, 2022

Bitrot is generally from two sources:

* At the sector level in physical media (tapes, disk drives, flash). The file will be largely intact, but 4- or 8-KiB chunks of it will be zero'd out.

* At the bit level, when copying goes wrong. Usually this is bad RAM, sometimes a bad network device, very occasionally a bad CPU. You'll see patterns like "every 64th byte has had its high bit set to 1".

In both cases, the only practical option is to have multiple copies of the file, plus some checksum to decide whether a copy is "good". Files on disk can be restored from backup, bad copies over the network can be detected by software and re-transmitted.

loeg · on July 24, 2022

> In both cases, the only practical option is to have multiple copies of the file, plus some checksum to decide whether a copy is "good".

Well, or FEC blocks (eg, par2). Might be insufficient for the every 64th byte case, but probably enough for a few zeroes sectors.

lathiat · on July 24, 2022

It is a case that will happen, when you long-term archive files. Which is exactly what the article discusses. Bit-rot is a real thing that really happens. The argument makes the case that XZ is a poor choice for such a case where you possibly have only one copy and can't just download a new un-corrupted copy.

jmillikin · on July 24, 2022

If you want to archive data, you need multiple copies.

XZ or not doesn't matter here. Even if you have a completely uncompressed tar file, if you only have one copy of it and lose access to that copy (whether bitrot, or disaster, or software error) then you've lost the data.

Sakos · on July 24, 2022

No, you need redundancy. Multiple copies isn't sufficient without an appropriate form of data storage. I have no idea why people think a single solution is necessary or sufficient.

jmillikin · on July 24, 2022

I don't know what you mean by that, and I suspect you don't either.

If I have a copy on my NAS, on a local backup disk, and in GCS, then there's no plausible risk to that data. I could go further and put another copy into AWS Glacier, or write it to tape and store it at my bank. At enterprise price points there's vendors like Iron Mountain who will store tapes by the container-load.

To claim that multiple copies is insufficient is absurd.

loeg · on July 24, 2022

What’s the difference between redundancy and multiple copies? I think you’re agreeing with GP but you’ve framed your comment as disagreement.

rovr138 · on July 25, 2022

In the case of storage and servers, redundancy usually maps to uptime and availability. Think RAID or HA.

Copies are distinct copies. If your RAID catches fire, you want a copy that's somewhere else. Think external drive.

In terms of backups, while you might want redundancy for availability, you want distinct copies in separate places ideally. So that if your building catches fire and takes your RAID, at least you have a copy somewhere else.

A lot of these copies aren't in real time. That's what makes them an easy backup solution. A backup is a snapshot that allows you to go back to a point in time and see/recover things. Redundancy won't protect you against someone deleting their home folder. If the copies are in real time, that's gone too.

So, copies aren't always backups and a backup isn't always a copy.