Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Xz format considered inadequate for long-term archiving (2016) (nongnu.org)
248 points by goranmoomin on July 24, 2022 | hide | past | favorite | 157 comments


We're talking about long-term archiving here. That means centuries.

My brother the archaelogical archivist of ancient (~2000 years BCE) mesopotamian artifacts has a lot to say about archival formats. His raw material is mostly fired clay tablets. Those archives keep working, partially, even if broken. That's good, because many of them are in fact broken when found.

But their ordering and other metadata about where they were found is written in archaeologists' notebooks, and many of those notebooks are now over a century old. Paper deteriorates. If a lost flake of paper from a page in the notebook rendered the whole notebook useless, that would be a disastrous outcome for that archive.

A decade ago I suggested digitizing the notebooks and storing the bits on CD-ROMs. He laughed, saying "we don't know enough about the long-term viability of CD-ROMs and their readers."

Now, when they get around it it, they're digitizing the notebooks and storing them on PDFs on backed-up RAID10 volumes. But they're also printing them on acid-free paper and putting them with the originals in the vaults where they store old books.

My point: planning for centuries long archiving is difficult. Formats with redundancy, at least with forward error correction codes, are very helpful. Formats that can be rendered useless by a few bit-flips, not so much.


I would be equally concerned about the stability of the file formats for the data stored inside the archives. Even plain ASCII text files have not been around very long - about 60 years since standardisation, but it took a while for the standard to become largely universal. And ASCII is pretty restricted in what it can represent. Note that I'm talking about plain text files, not things like Markdown which might use ASCII.

Most more complex file formats suffer from variant formats. Some, like Markdown and RTF just have multiple versions. Some like TIFF and PDF are envelope formats, so the possible contents of the envelope change over time, introducing incompatibility. Then there is bit-rot as formats go out of use, e.g. .DOC (as opposed to .DOCX).

My own objectives are simple compared to your brother's. I want to preserve simple formatted text files until about 40 year from now, in a way that is likely to allow cut and paste. I started accumulating them about 20 years back. Note that this is before Markdown (which is in any case poor for recording formatting). LaTeX was around and seemed ok in terms of expected lifetime, but is poor for cut and paste because the rendering of a chunk of text depends on instructions which are not local to it. I settled for RTF, which this carries significant long term risk for both compatibility and availability, but is documented well enough that migrating out may be possible.

That's just formatted text. Images have been worse, particularly if you are handling meta-data such as camera characteristics, satellite orientation, etc.


I very much doubt the bit representation matters very much as long as it simple. Even if ASCII text viewers are lost they would be extremely simple to implement. It is a counterpoint to things like latex. That would be hard to recreate.


Not to sound like the usual evangelist, and I am sure everyone has already thought of these points, but regarding RAID-10, is that the best call?

My thought process is that the majority of modern hardware/software RAID solutions don’t do error correction properly. And even assuming something that does actually do checksumming and such semi-properly is employed, I think if we are talking tens or hundreds of years from now, it’ll be nearly impossible to find a compatible hardware card should the ones in use die.

I’m aware it already sucks trying to build a project six months in the future, let alone six or sixty years, but perhaps something purely software-based that does care a ton about integrity, like ZFS, would be the best bet in terms of long-term compatibility of a hard-drive storage solution.

So long as the drives can still be plugged into a system, any system, and even if OpenZFS eventually drops backwards compatibility with the version used to create the pool, it’ll likely still be possible to virtualize whatever version of Linux/BSD/Illumos is compatible with that particular version of ZFS and then import the pool.


I think QR codes are actually great for this (so long as you are storing basic data like plain text that is likely to be recoverable for a long time). It has built in error correction and software for reading them is extremely widespread.

However, more than just error correction, we should really try to make formats that are resilient to data corruption. For example, zip files seem much more resilient than gzipped tarballs because each file is compressed separately.

Digital copies are really not that durable, we just sometimes confuse ease of copying with durability. This sometimes helps, but only if you can have distributed copies.


> Now, when they get around it it, they're digitizing the notebooks and storing them on PDFs on backed-up RAID10 volumes. But they're also printing them on acid-free paper and putting them with the originals in the vaults where they store old books.

I actually have a personal digitization project for some stuff I've inherited, and it's glad to get a little validation for my strategy.

Basically my plan is to scan the documents/photos, create some kind of printed book with the most important/interesting ones and an index, and have a M-DISC with all the scans in the back.


What about archiving them in torrent format - in this way, as long as there is one nerd who values history out there, there will be a copy


Bittorrent is not archival, it's distribution.


Can't distribution be archival?

Thomas Jefferson: "Let us save what remains: not by vaults and locks which fence them from the public eye and use in consigning them to the waste of time, but by such a multiplication of copies, as shall place them beyond the reach of accident."


No, distribution cannot be archival because you don't know when someone else will stop distributing. Proof of that is how many torrents have zero seeders and zero leechers, rendering them useless.


The hardest part is digitizing those notes. Once digitized, the best approach is to make copies every few years. Copying of digital data is easy and lossless, and the cost of digital storage is constantly going down.


Correct, xz is no longer particularly useful, mostly annoying.

For read-only long-term filesystem-like archives, use squashfs with whatever compression option is convenient. The cool thing about squashfs is you can efficiently list the contents and extract single files. This is why it's an ultimate best option for long-term filesystem-esque archives.

https://en.m.wikipedia.org/wiki/SquashFS

For everything else, or if you want (very) fast compression/decompression and/or general high-ratio compression, use .zstd.

https://github.com/facebook/zstd

I've used this combination professionally to great effect. You're welcome :)


I still think there is a place for a streamable, concatenable archive format with no builtin compression, plus an index sidecar for it when you want to trade streamability for seeking (PDF does something like this within a single file, in case you don’t value your sanity). People have made a number of those on top of tar, but they are all limited in various ways by the creators’ requirements, and hardly ubiquitous; and, well, tar is nuts—both because of how difficult and dubiously compatible it is to store some aspects of filesystem metadata in it, and because of how impossible it is to not store others except by convention. Not to mention the useless block structure compared to e.g. cpio or even ar. (Seek indexes for gzip and zstd are also a solved problem in that annoying essentially-but-not-in-practice way, but at least the formats themselves are generally sane.)

Incidentally, the author of the head article also disapproves of the format (not compression) design of zstd[1] on the same corruption-resistance grounds (although they e.g. prohibit concatenability), even though to me the format[2] seems much less crufty than xz.

[1] https://lists.gnu.org/archive/html/lzip-bug/2016-10/msg00002...

[2] https://www.rfc-editor.org/rfc/rfc8878.html


> I still think there is a place for a streamable, concatenable archive format with no builtin compression, plus an index sidecar for it when you want to trade streamability for seeking (PDF does something like this within a single file.

I wholeheartedly agree! That use-case is not currently covered by any widely known OSS project AFAIK.


What about tar?


Tar is really great for its intended use case (archiving files to tape, with the constraints of late 1970s computing power), and kind of weird in basically all other uses. The block format wastes tons of space unless you compress it with run length encoding (which is why .tar.gz is so common). It's dirt simple but also kind of brittle. Its data model is not really a good fit for storing things that don't look like posix files, too much metadata in some ways while missing other kinds.


I'm actually working on just such a thing! It's definitely a low-priority side project at this point, but I think the general technique has legs. It was born out of a desire for easier to use streaming container formats that can save arbitrary data streams.

I call it SITO and it's based on a stream of messagepack objects. If folks are interested in such a thing, I can post what I have currently of the spec, and I'd love to get some feedback on it.

I agree that TAR is problematic, compression is problematic, and there needs to be a mind towards ECC from the get-go.

I could really use some technical discussion to hammer out some of the design decisions that I'm iffy about, and also hone in on what is essential for the spec. For example, how to handle seek-indexing, and whether I should use named fields vs fixed schema, or allow both.


This is about long-term archiving though. For that you want a wide-spread format that is well-documented and has many independent implementations.

Like zip/7z (with external error recovery), or maybe RAR (with error-recovery records)

Fast compression or decompression is almost entirely meaningless in that context, and compression ratio is also only of secondary importance.

This is why PDF is still considered the best format for long-term archiving of documents, even though there might be things that compress better (djvu/jp2)


I have came to the same conclusion and was surprised to find that RAR is the only popular archive format that included parity.

Personally I use 7zip for compression and par2[0] for parity.

[0] https://en.wikipedia.org/wiki/Parchive


I'm still looking for a full-fledged backup and archiving solution that has the following characteristics:

  0) error-correction (not just detection!)
  1) end-to-end encryption
  2) deduplication
  3) compression
  4) cross-platform implementations
  5) (at least one) user interface
  6) open source
Both Borg [0] and Restic [1] have long standing open issues for error-correction, but seem to consider it off strategy. I find that decision kind of strange, since to me the whole purpose of a backup solution is to restore your system to a correct state after any kind of incident.

My current solution is an assembly of shell scripts that combine borg with par2, but I'm rather unhappy with it. For one, I trust my home-brewn solution rather faintly (i.e. similar to `don't roll your own crypto` I think there should be an adagium `don't roll your own back-up solutions`). In addition I think an error-correcting mechanism should be available also for the less technology-savvy.

[0]: https://github.com/borgbackup/borg/issues/225

[1]: https://github.com/restic/restic/issues/256


Paper/master thesis/nerd snipe idea: does availability of Reed-Salomon information of unencrypted files weaken the encryption of their encrypted counterparts?


I have yet to find a conclusive analysis on how well RAR with recovery works for different failure modes.

I mean I can guess it works pretty well for single-bit flips, but how about burst errors, how long can those be? Usually you want to have protection from at least 1 or 2 filesystem blocks, which can be 4 or 8k or even more, depending on the file system. How about repeating error patterns, data deletions, etc.?


PAR files (parity) are battle tested, if you are really concerned about recovering from corruption.

https://en.m.wikipedia.org/wiki/Parchive


It appears lzip is resistant to single bit flips, but can't be configured with more resistance.


I like the inbuilt parity as you don't have to hold on to another set of files.


> The cool thing about squashfs is you can efficiently list the contents and extract single files.

What’s the story look like around reading files out of squashfs archives stored in an S3 compatible storage system? Can what you mention above be done with byte range requests versus retrieving the entire object?


https://dr-emann.github.io/squashfs/squashfs.html suggests this may be possible with a small handful of HTTP requests (less than 10, likely 5 or 6).


Thank you!


https://github.com/mhx/dwarfs

"DwarFS compression is an order of magnitude better than SquashFS compression, it's 6 times faster to build the file system, it's typically faster to access files on DwarFS and it uses less CPU resources."


DwarFS may be good, but it's not in the Linux kernel (depends on FUSE). That makes it less universal, potentially significantly slower for some uses cases, and also less thoroughly tested. SquashFS is used by a lot of embedded Linux distros among other use cases, so we can have pretty high confidence in its correctness.


Thank you coldblues, I'd not heard of dwarfs!

Went ahead and submitted, I think it deserves its own discussion:

https://news.ycombinator.com/item?id=32216275


Are you recommending zstd based on compression speed and ratio only? Because as the linked article explains, those are not the only criteria. How does zstd rate with everything else?


Yes, I felt that's the biggest deficiency of the article , not covering zstd.

Some comment here says that the author of the article disapproves zstd with similar arguments as for xz. Have not verified the claim.


zstd still doesn't have a seekable format as part of the official standard (I wish it did): https://github.com/facebook/zstd/issues/395#issuecomment-535...


Zstd is still just as bad when it comes to the most important point:

>Xz does not provide any data recovery means

For the common use case of compression a tar archive, this is a critical flaw. One small area of corruption will render the remainder of the archive unreadable.

I don’t know how we found ourselves with the best formats for data recovery/robustness fading into obscurity - i.e lzip and to lesser extent bzip2. The only compressed format left in common use that handles corruption is the hardware compression in LTO tape drives


> I don’t know how we found ourselves with the best formats for data recovery/robustness fading into obscurity - i.e lzip and to lesser extent bzip2.

Because monotonically increasing technological progress is a commonly-believed fairy tale. Nowadays, capabilities are lost as often as they're gained. Either because the people who designed the successor are being opinionated or just focusing on something else.


Zstd is terrible for archiving since it doesn't even detect corruption. The --check switch described in the manpage as enabling checksums (in a super-confusing way) seems to do absolutely nothing.

You can test by intentionally corrupting a .zstd file that was created with checksums enabled and then watch as zstd happily proceeds to decompress it, without any sort of warning. This is the stuff of nightmares.

After all these years, RAR remains the best option for archiving.


I got "Decoding error (36)" when data was wrong, so --check (enabled by default during compression) is working for me:

  echo 'a b c' > test
  zstd test
  zstdcat test.zst
  sed -i s/a/b/g test.zst # corrupt the file on purpose
  zstdcat test.zst


zstd hardly replaces xz, the compression ratio is quite worse. zstd seem more of a replacement for gz.


Most tests I've seen, such as [0] don't support your statement. Zstd can compress almost as well as xz but decompresses much faster.

It can also compress more than xz with tweaks, though I don't know the compute/memory tradeoffs.

[0] https://archlinux.org/news/now-using-zstandard-instead-of-xz...

Edit to fix typo.


Sure, if you look at the Pareto frontier for xz and zstd, zstd does not seem like a “replacement” for xz. It’s not a replacement for PPMd.

The problem is that xz has kind of horrible performance when you crank it up to the high settings. On the medium settings, you can get the same ratio for much, much less CPU (round-trip) by switching to zstd.

YMMV, use your own corpus and CPU to test it.


Depends on your use case. Saving the last couple of Bytes is often less important than fast compression.


Do you have any data to support this claim? In my experience, zstd is way better in every way compared to gzip. Additionally, xz is good compression but horribly crazy slow to decompress. Xz also only operates on one single file at a time, which is annoying.


I just did a quick test with a copy of GNU M4, which is reasonably representative of a source code archive.

  $ time xz -9k m4-1.4.19.tar 
  real 0m2.928s
  user 0m2.871s
  sys 0m0.056s
  $ time zstd -19 m4-1.4.19.tar
  real 0m3.411s
  user 0m3.380s
  sys 0m0.032s
  $ ls -l m4-1.4.19.tar*
  -rw-rw-r-- 1 john john 14837760 Jul 24 14:40 m4-1.4.19.tar
  -rw-rw-r-- 1 john john  1674612 Jul 24 14:40 m4-1.4.19.tar.xz
  -rw-rw-r-- 1 john john  1726155 Jul 24 14:40 m4-1.4.19.tar.zst
In this test, XZ was both faster and had better compression than Zstd.


Howdy jmillikin.

That is a very small file.

See: https://news.ycombinator.com/item?id=25455846#:~:text=xz%20o....

> xz on highest (normal) compression level easily beats the compression ratio of zstd on highest compression level (--ultra -22) on any data I've tested. However with xz reading the compressed files easily becomes a bottleneck, zstd has great read speeds regardless of compression ratio


Are you agreeing or disagreeing with ars' claim that XZ provides a better compression ratio than Zstd? My data shows that it's true in at least one common use case (distribution of open-source software source archives).

I've seen similar comparative ratios from files up to the multi-gigabyte range, for example VM images. In what cases have you seen XZ produce worse compression ratios than Zstd?


Generally speaking, the top-end of xz very slightly beats the top-end of zstd. However, xz typically takes several times as long to extract. And generally I've seen xz take longer to compress than zstd, as well.

Example with a large archive (representative of compiled software distribution, such as package management formats):

    $ time xz -T0 -9k usrbin.tar 
    
    real 2m0.579s
    user 8m46.646s
    sys 0m2.104s
    
    $ time zstd -T0 -19 --long usrbin.tar 
    real 1m47.242s
    user 6m34.845s
    sys 0m0.544s
    /tmp$ ls -l usrbin.tar*
    -rw-r--r-- 1 josh josh 998830080 Jul 23 23:55 usrbin.tar
    -rw-r--r-- 1 josh josh 189633464 Jul 23 23:55 usrbin.tar.xz
    -rw-r--r-- 1 josh josh 203107989 Jul 23 23:55 usrbin.tar.zst
    /tmp$ time xzcat usrbin.tar.xz >/dev/null

    real 0m9.410s
    user 0m9.339s
    sys 0m0.060s
    /tmp$ time zstdcat usrbin.tar.zst >/dev/null
    
    real 0m0.996s
    user 0m0.894s
    sys 0m0.065s
Comparable compression ratio, faster to compress, 10x faster to decompress.

And if you do need a smaller compression ratio than xz, you can get that at a cost in time:

    $ time zstd -T0 -22 --ultra --long usrbin.tar 

    real 4m32.056s
    user 9m2.484s
    sys 0m0.644s
    $ ls -l usrbin.tar*
    -rw-r--r-- 1 josh josh 998830080 Jul 23 23:55 usrbin.tar
    -rw-r--r-- 1 josh josh 189633464 Jul 23 23:55 usrbin.tar.xz
    -rw-r--r-- 1 josh josh 186113543 Jul 23 23:55 usrbin.tar.zst
And it still takes the same amount of time to extract, 10x faster than xz.


That seems fine -- it's a tradeoff between speed and compression ratio, which has existed ever since compression went beyond RLE.

Zstd competes against Snappy and LZ4 in the market of transmission-time compression. You use it for things like RPC sessions, where the data is being created on-the-fly, compressed for bandwidth savings, then decompressed+parsed on the other side. And in this domain, Zstd is pretty clearly the stand-out winner.

When it comes to archival, the wall-clock performance is less important. Doubling the compress/decompress time for a 5% improvement in compression ratio is an attractive option, and high-compression XZ is in many cases faster than high-compression Zstd even delivering better ratios.

---

EDIT for parent post adding numbers: I spot-tested running zstd with `-22 --ultra` on files in my archive of source tarballs, and wasn't able to find cases where it outperformed `xz -9`.


I think you're missing the point that in terms of tradeoffs people are willing to make: absolute compression ratio loses to 80% of the compression ability with big gains to decompression speed (aka include round trip cpu time if you want something to agree / disagree with, we're not talking about straight compression ratios).

Arch Linux is a case study in a large distributor of open source software that switched from xz compressed binaries to zstd and they didn't do it for teh lulz[0].

[0] https://archlinux.org/news/now-using-zstandard-instead-of-xz...


I'm not missing the point. I'm responding to the thread, which is about whether XZ offers better compression ratios than Zstd.

Whether it's faster in some, many, or most cases isn't really relevant.


Yup and how much better is about 1%. "zstd and xz trade blows in their compression ratio. Recompressing all packages to zstd with our options yields a total ~0.8% increase in package size on all of our packages combined, but the decompression time for all packages saw a ~1300% speedup."


Yes, because ratio is the only thing that matters. Decompression speed doesn't matter at all. Who cares if Zstd is 9 times faster?


In my experience the normal squashfs kernel driver is quite slow at listing/traversing very large archives (several GB+). For some reason squashfuse is MUCH faster for just looking around inside.


Does SquashFS support cross-file compression? - i.e. how well does it compress a folder with a number of similar files?


I recently came to the same determination looking for a better way to package sosreports (diagnostic files from linux machines). The pieces are there for indexed file lists and also seekable compression but basically nothing else implements them in a combined fassion with a modern compression format (mainly zstd).


I use lzop, as it has faster compression/decompression. Is there a specific reason to prefer zstd?


Zstd can achieve good compression ratio for many data patterns and is super fast to create, and decompresses at multiple GB/s with a halfway decent CPU, often hitting disk bandwidth limits.

I've never tried lzop or met someone who advocated to use it. Needs research, perhaps. Until then I'm healthily skeptical.


LZO has been around since the 90's. There are multiple distros which use it (as an option) for archive downloads. It was the recommended algorithm to use for btrfs compressed volumes for years. It's pretty standard, about as common as .bz2 in my experience.


LZO has been around for 25 years if that helps.

I've also used it especially when resources are constrained (low performance CPU, ram, or when executable size really matters)


When does it make sense to use lzop (or the similar but more widely recommended LZ4) for static storage? My impression was that it was a compressor for when you want to put less bytes onto the transmission / storage medium at negligible CPU cost (in both directions) because your performance is limited by that medium (fast network RPC, blobs in databases), not because you want to take up as little space as possible on your backup drive. And it does indeed lose badly in compression ratio even to zlib on default settings (gzip -6), let alone Zstandard or LZMA.


I used lzop in the past when speed was more of a concern than compressed size (big disk images etc.)

For a couple of years now I have switched to zstd. It is both fast and well compressed by default in that use case, no need to remember any options.

No, I have NOT done any deeper analysis except a couple of comparisons which I haven't even documented. But they ended up in slight favor of zstd, so I switched. But nothing dramatic making me say forget lzop.

Edit: NOT


> For read-only long-term filesystem-like archives, use squashfs with whatever compression option is convenient

Is there a way to extract the files without mounting the filesystem?



I thought DwarFS[0] is better than SquashFS?

[0] https://github.com/mhx/dwarfs


How good is zstd for long term archival vs e.g. 7z? The latter appears (I have no data to back it up) to be vastly more popular at the moment.


7z was popular in the Windows world when I still used that more than 10 years ago because Windows contained nothing reasonable.

I have never really seen it in the Linux world. There are several alternatives installed in most distros, all except zstd discussed in the article.


It was really disappointing when dpkg actively deprecated support (which I feel they should never do) for lzma format archives and went all-in on xz. The decompressor now needs the annoying flexibility mentioned in this article and the only benefit of the format--the ability to do random access on the file--is almost entirely defeated by dpkg using it to compress a tar file (which barely supports any form of accelerated access even when uncompressed; like the best you can do is kind of attempt to skip through file headers, which only helps if the files in the archive are large enough) and, to add insult to injury, the files are now all slightly larger to account for the extra headers :/.

Regardless, this is a pretty old article and if you search for it you will find a number of discussions that have already happened about it that all have a bunch of comments.

https://news.ycombinator.com/item?id=20103255

https://news.ycombinator.com/item?id=16884832

https://news.ycombinator.com/item?id=12768425


pixz (https://github.com/vasi/pixz) is a nice parallel xz that additionally creates an index of tar files so you can decompress individual files. I wonder if dpkg could be extended to do something similar.


I disagree with the premise of the article. Archive formats are all inadequate for long-term resilience and making them adequate would be a violation of the “do one thing and do it right” principle.

To support resilience, you don’t need an alternative to xz, you need hashes and forward error correction. Specifically, compress your file using xz for high compression ratio, optionally encrypt it, then take a SHA-256 hash to be used for detecting errors, then generate parity files using PAR[1] or zfec[2] to be used for correcting errors.

[1] https://wiki.archlinux.org/title/Parchive

[2] https://github.com/tahoe-lafs/zfec


Folks seem to be comparing xz to zstd, but if I am understanding correctly the true competitor to xz is the article author’s “lzip” format, which uses the same LZMA compression as xz but with a much better designed container format (at least according to the author).


I'd not say it's necessarily better designed - it's just simpler. Few bytes of some headers, LZMA compressed data, checksum, done. No support for seeks and stuff like that


Emphasis on “according to the author,” yeah. And the concerns are silly — FEC can and should be added outside the compressed data (e.g., with par2).


But why outside the compression container? I'd love to have format that let me repair their contents up to the number of parity bits it stored.


I agree that a single file is super convenient. It looks like the in-progress par3 will support that (par3 inside zip, I'd add an mtree file as well for metadata if needed):

https://parchive.github.io/doc/Parity_Volume_Set_Specificati...

I don't use par2 right now for backups due to the extra hastle, although I might use par3. It might end up a waste of space, though, since I haven't had an issue with corrupted files in at least 20 years (would be great for optical media that can get scratched).

For the longest term archiving I'm not sure any kind of compression is a good idea, I'd think along the lines of uncompressed copy + parity info in a simple container where all files are contiguous (maybe a raw zip file without compression).


The par3 spec has been in development for so long that I have a hard head in it's completion and uptake.

I've thought of other integrations as well. A file system that would have integrated parity (much like ZFS/btrfs checksumming, but then recoverable up to the parity you allocated). Of even integrated into 'regular' file formats: there could be an MP3 tag/header that holds parity for it's data stream. Alas, it seems that people have decided this will be solved through other means, par2 is rare as it is.

Shameless plug: I wrote a GUI tool to make it slightly easier to work with par2 and a big dir of archive data: https://pypi.org/project/par2deep/


The vast majority of the discussion is around xz’s inability of dealing with corrupted data. That said, couldn’t you argue that that needs to be solved at a lower level (storage, transport)? I’m not convinced the compression algorithm is the right place to tackle this.

Just use a file system that does proper integrity checking/resilvering. Also use TLS to transfer data over the network.


The article is about usage of xz for long term archival, so transport is not relevant, the concern seems to be bitrot and forward compatibility.

Storage with integrity checking would be the solution to bitrot, but TFA also seems concerned with "how do you unarchive/recover a random file you found?" which seems a somewhat valid concern.

And xz does have support for integrity checking, so it seems reasonable to have a discussion on whether that is a good support, rather than on whether it should be there at all.


> xz does have support for integrity checking

Archives need not only a way to check their integrity but also error correction, which xz does not have.

However, you can easily combine xz with par2, which does provide error correction.


> couldn’t you argue that that needs to be solved at a lower level

Self-referentially incorruptible data is the standard to beat here, moving the concern to a different layer doesn't increase the efficiency or integrity of the data itself.

It is arguably less efficient, as you now rely on some lower layer of protection in addition to whatever is built into the standard itself.

It is less flexible - a properly protected archive format could be scrawled on to the side of a hill, or more reasonably onto an archive medium (BD-disk), and should be able to survive any file-system change, upgrade, etc. Self-repairing hard drives with multiple redundancies are nice, but not cheap, and not wide-spread.

It also does nothing for actually protecting the data - I don't care how advanced the lower level storage format is, if you overwrite data e.g. with random zeros (conceivable due a badly behaving program with too much memory access, e.g. a misbehaving virus, or a bad program or kernal driver someone ran with root access, also conceivable due to EM interference or solar radiation causing a program to misbehave), the file system will dutifully overwrite the correct data with incorrect data, including updating whatever relevant checks it has for the file. The only way around this is to maintain a historical archive of all writes ever made, and that should be evidently both absurdly impractical (how do you maintain the integrity of this archive? With a self-referentially incorruptible data archive perhaps?) and expensive.

Compared to a single file, which can be backed up, agnostic to the filesystem/hardware/transport/major world-ending events, which can be simply read/recovered, far into the future. There's a pretty clear winner here.


> Self-referentially incorruptible data is the standard to beat here, moving the concern to a different layer doesn't increase the efficiency or integrity of the data itself.

I respectfully disagree. By putting it in the layer below, there is the ability to do repairs.

For example, consider storing XZ files on a Ceph storage cluster. Ceph supports Reed-Solomon coding. This means that if data corruption occurs, Ceph is capable of automatically repairing the data corruption by recomputing the original file and writing it back to disk once more.

Even if XZ were able to recover from some forms of data corruption, is it realistic that such repairs propagate back to the underlying data store? Likely not.


You are thinking in the wrong direction, I tried to explain but maybe I can be clearer:

If you can't read the data in question though, you cannot do the repairs, it doesn't matter if you do Reed-Solomon coding or not. You are thinking about coding for the underlying hardware, which is what the data corruption you are talking about is designed to fix - it does not solve the problem for writes coming from above.

To do that, you actually have to decode the data in question and perform a reed-solomon encoding on the actual file inside of the archive, and this only gets worse e.g. if you have nested archives.

If the data is self-referentially repairable, however, it doens't matter if the file gets overwritten with e.g. a cat gif, the format will work around that. The filesystem on the other hand will have written the cat gif to the file and updated the Reed Solomon encoding for your file, assuming (incorrectly) the file writes were valid.

I suppose you could mandate that for any file to be written to your filesystem it must first be completely decompressed, and then store some encoding information alongside the archive, but this would be inefficient to the extreme, since merely copying a file onto the system would mean you have to decompress the file and then checksum it.

At any rate, even if you did decompress the file in question, you have failed to separate the layers like you want to, since now you have mandated the XZ and LZMA algorithms also be baked directly into the filesystem itself.

Better not to needlessly couple the filesystem to some compression algorithm, let the compression system handle its own error correction.


You use CoW on filesystem level and take snapshot every 5 mins. Your concerns about write coming from above is gone.

The point about being able to use different media like bluray drives is a valid point but since xz doesn't do any correction it doesn't really matter, it has to be done out-of-band anyway.


Right, but my concerns about an overly complex system coming in as a stand-in for a simple one are not.

The simple cost effective thing is not to engineer a complex redundancy system above and below to try to adhere to some misguided "separation of concerns", its to use the simplest, most effective solution which presents itself.

When you try to separate things which should not be separated in a software (or other) system, you get high coupling, low cohesion. Not everything should be attempted to be "de-coupled".


The article spends a lot of time discussing XZ's behavior when reading corrupt archives, but in practice this is not a case that will ever happen.

Say you have a file `m4-1.4.19.tar.xz` in your source archive collection. The version I just downloaded from ftp.gnu.org has the SHA-256 checksum `63aede5c6d33b6d9b13511cd0be2cac046f2e70fd0a07aa9573a04a82783af96`. The GNU site doesn't have checksums for archives, but it does have a PGP signature file, so it's possible to (1) verify the archive signature and (2) store the checksum of the file.

If that file is later corrupted, the corruption will be detected via the SHA-256 checksum. There's no reason to worry about whether the file itself has an embedded CRC-whatever, or the behavior with regards to variable-length integer fields. If a bit gets flipped then the checksum won't match, and that copy of the file will be discarded (= replaced with a fresh copy from replicated storage) before it ever hits the XZ code.

If this is how the lzip format was designed -- optimized for a use case that doesn't exist -- it's no wonder it sees basically no adoption in favor of more pragmatic formats like XZ or Zstd.


I have no idea why you think that use case does not exist. Your whole idea about archive seems to be it is to ensure a blob doesn't change (same hash). But that's far from the only use of archive. (Hell, even with that, you are assuming you know the correct hash of the file to begin with, which isn't guaranteed.)

"Repairing" corrupt archives, as in to get as much as usable data from that archive is a pretty useful thing and I have done it multiple times. For example, an archive can have hundreds of files inside and if you can recover any of them that's better than nothing. It is also one of the reason I still use WinRAR occasionally due to its great recovery record (RR) feature.

>replaced with a fresh copy from replicated storage

Lots of times you don't have other copy.


The process of long-term archival starts with replication. A common approach is two local copies on separate physical media, and one remote copy in a cloud storage service with add-only permissions. This protects against hardware failure, bad software (accidental deletion, malware), natural disasters (flood, fire) and other 99th-percentile disaster conditions. The cloud storage providers will have their own level of replication (AWS S3 has a 99.999999999% durability SLA).

If you have only one copy of some important file and you discover it no longer matches the stored checksum, then that's not a question of archival, but of data recovery. There's no plausible mechanism by which a file might suffer a few bitflips and be saved by careful application of a CRC -- bitrot often zeros out entire sectors.


> There's no plausible mechanism by which a file might suffer a few bitflips and be saved by careful application of a CRC -- bitrot often zeros out entire sectors.

A CRC, no, absolutely not. But this is exactly what PAR2 recovery records do, they do it well, and they (or their equivalents) need to be easier to enable in more places.

Setting up a replication and durability scheme is a major pain in the ass. Passing the `--add-recovery-record` switch on the command line is very, very easy, and it is good enough for many cases where "best effort" protection against corruption is all that is needed.


It's simply not sufficient to have multiple copies. It's way too easy to propagate errors in a way that slips under the radar, which then screws you over 5 years later. The main idea of long-time archival is redundancy. Replication is one form of redundancy, but it's not the only one and not the only one you should use.


This is nonsense. Replication to storage in different failure domains is quite sufficient to ensure long-term data preservation, and errors cannot "propagate" to archives unless your risk model involves angry wizards.


Bit rot is a thing. It seems like you have a different idea of what archival means than most of us.


None of this addresses the criticism levied in the article, nor does it defend xz's inconsistent design decisions that are all over the place.

Why shouldn't we try to squeeze the highest rate of data recovery out of the unlikely event that we're left with the only remaining copy if it costs nothing extra (just the choice of one archiver over another)?

Should you choose xz for future archival purposes? No.

Should Debian make an active effort to switch away from xz now? Probably not, as their primary concern is distribution, not archival, and xz is good enough.


  > None of this addresses the criticism levied in the article
The article's criticism isn't worth addressing (or reading). Nothing it complains about is important.

  > nor does it defend xz's inconsistent design decisions that are all
  > over the place.
The level of inconsistency that the article complains about doesn't matter. Pretty much every popular format looks like that. Try writing a Matroska or PDF decoder some time.

  > Why shouldn't we try to squeeze the highest rate of data recovery
  > out of the unlikely event that we're left with the only remaining
  > copy if it costs nothing extra
Because it isn't important.

For cases where recovery of data from a corrupt compressed stream is important, you'd wrap the compressed data in a container with built-in error correction. Then you'd use that format's error correction to recover the correct compressed data stream, and feed that to your decompressor.

  > Should you choose xz for future archival purposes? No.
Yes you should. XZ is fine, despite the article's silliness. It's better than gzip or bzip2 and nearly as widely supported.

If there's a replacement for XZ as a general-purpose compression format for archived data then it'll be selected on the quality of its compression, not whether the bitstream format can produce valid output from invalid input.


> The level of inconsistency that the article complains about doesn't matter. Pretty much every popular format looks like that. Try writing a Matroska or PDF decoder some time.

So a format should be designed wrong because other formats are also designed wrong.

> Because it isn't important.

So? You don't have to go out of your way to micro-optimize, others like TFA's author are already doing it for you. You just have to pick the micro-optimized product off the shelf.

"It isn't important" is a complete non-argument when the bad choices are made for no reason and no benefit at all and especially not even making the design process quicker and easier.

> If there's a replacement for XZ as a general-purpose compression format for archived data then it'll be selected on the quality of its compression, not whether the bitstream format can produce valid output from invalid input.

The article also addresses design problems that harm the compression ratio.


In 2022, it should be quite rare that an archive got corrupted.

Internet connection is so much better now and almost 100% of downloads completed successfully.


To my own surprise I have seen a corrupted file in 2022. Only 70 MB, so really tiny compared to many files handled today.

The file had been built in Europe and after installing it to a US system we wondered why it did not work.

Haven't seen anything like that for many years so it took us a while to even consider the option that it could be corrupted.

(No, we did not spend any time to check whether corruption showed any interesting pattern like a single bit flip, a block of zeros or anything like that. Transferring it again just did it.)


The larger the file the more chance it will get corrupted by cosmic rays. Media also physically decays over time.

Error correction and redundancy is essential, which is why I use par2 and dvdisaster on all my archives.


Shouldn't the download client check for corruption before any download complete?


I've had a few experiences trying to recover data from old hard drives or even tape drives. The general experience was that either it works perfectly or the drive is covered in bad sectors and large chunks are unreadable. I don't dispute bitrot exists but there does seem to be an awful lot of discussion on the internet about an issue that is not not the most likely failure mode.


Bitrot is generally from two sources:

* At the sector level in physical media (tapes, disk drives, flash). The file will be largely intact, but 4- or 8-KiB chunks of it will be zero'd out.

* At the bit level, when copying goes wrong. Usually this is bad RAM, sometimes a bad network device, very occasionally a bad CPU. You'll see patterns like "every 64th byte has had its high bit set to 1".

In both cases, the only practical option is to have multiple copies of the file, plus some checksum to decide whether a copy is "good". Files on disk can be restored from backup, bad copies over the network can be detected by software and re-transmitted.


> In both cases, the only practical option is to have multiple copies of the file, plus some checksum to decide whether a copy is "good".

Well, or FEC blocks (eg, par2). Might be insufficient for the every 64th byte case, but probably enough for a few zeroes sectors.


It is a case that will happen, when you long-term archive files. Which is exactly what the article discusses. Bit-rot is a real thing that really happens. The argument makes the case that XZ is a poor choice for such a case where you possibly have only one copy and can't just download a new un-corrupted copy.


If you want to archive data, you need multiple copies.

XZ or not doesn't matter here. Even if you have a completely uncompressed tar file, if you only have one copy of it and lose access to that copy (whether bitrot, or disaster, or software error) then you've lost the data.


No, you need redundancy. Multiple copies isn't sufficient without an appropriate form of data storage. I have no idea why people think a single solution is necessary or sufficient.


I don't know what you mean by that, and I suspect you don't either.

If I have a copy on my NAS, on a local backup disk, and in GCS, then there's no plausible risk to that data. I could go further and put another copy into AWS Glacier, or write it to tape and store it at my bank. At enterprise price points there's vendors like Iron Mountain who will store tapes by the container-load.

To claim that multiple copies is insufficient is absurd.


What’s the difference between redundancy and multiple copies? I think you’re agreeing with GP but you’ve framed your comment as disagreement.


In the case of storage and servers, redundancy usually maps to uptime and availability. Think RAID or HA.

Copies are distinct copies. If your RAID catches fire, you want a copy that's somewhere else. Think external drive.

In terms of backups, while you might want redundancy for availability, you want distinct copies in separate places ideally. So that if your building catches fire and takes your RAID, at least you have a copy somewhere else.

A lot of these copies aren't in real time. That's what makes them an easy backup solution. A backup is a snapshot that allows you to go back to a point in time and see/recover things. Redundancy won't protect you against someone deleting their home folder. If the copies are in real time, that's gone too.

So, copies aren't always backups and a backup isn't always a copy.


Archival formats have always been of interest to me, given the very practical need to store a large amount of backups across any number of storage mediums - documents, pictures, music, sometimes particularly good movies, even the occasional software or game installer.

Right now, I've personally settled on using the 7z format: https://en.wikipedia.org/wiki/7z

The decompression speeds feel good, the compression ratios also seem better than ZIP and somehow it still feels like a widely supported format, with the 7-Zip program in particular being nice to use: https://en.wikipedia.org/wiki/7-Zip

Of course, various archivers on *nix systems also seem to support it, so so far everything feels good. Though of course having the chance of an archive getting corrupt and no longer being properly able to decompress it and read all of those files, versus just using the filesystem and having something like that perhaps occur to a single file still sometimes bothers me.

Then again, on a certain level, I guess nothing is permanent and at least it's possible to occasionally test the archives for any errors and look into restoring them from backups, should something like that ever occur. Might just have to automate those tests, though.

Yet, for the most part, going with an exceedingly boring option like that seems like a good idea, though the space could definitely use more projects and new algorithms for even better compression ratios, so at the very least it's nice to see attempts to do so!


7zip also uses LZMA under the hoid, just like xz and lzip, but recovery is also poor.


A format for archives, that is contractually built to last, has an impressive test suite, is easily browseable, where blobs can be retrieved individually if needed and is already known everywhere ?

Sounds like SQlite, yet again: https://www2.sqlite.org/sqlar.html


> A format for archives, that is contractually built to last

Who would you be contracting to build a file format?


Not the file format itself but sqlite: Airbus is using it for its A350 line, so it is guaranteed they will do what it takes to keep it working as long as the plane exists.


I routinely distribute a ~5meg .xz (~20meg uncompressed) to 20k+ servers across multiple physical data centers on a regular basis. Haven't seen a single failure. It ends up being 1.85megs smaller than the tgz version. Unless someone comes up with a better solution (ie: smaller), I probably won't change that any time soon.


You may be interested in bsdiff[0] or Courgette[1]. If the file is a new version of what you deployed previously, you can use these programs to produce binary patch files. Diff the old and new binaries, and you'll only need to transmit the patch.

bsdiff is a generic binary differ that is widely available. Courgette is optimized for executables and has an uncommon build system, but claims to produce files ~85% smaller than bsdiff.

[0] https://www.daemonology.net/bsdiff/

[1] https://www.chromium.org/developers/design-documents/softwar...


True. The issue is that adds additional complexity on the receiving end that I didn't want to engineer for.


> Unless someone comes up with a better solution (ie: smaller)

Be careful what you wish for! If you care about smaller above all else, there are much better compression schemes nowadays.

NNCP is on the range of practical, and beats LZMA: https://bellard.org/nncp/

If you really want to go crazy, large language models like BLOOM can be repurposed for compression; the Chinchilla paper lists a 0.3 bit-per-byte compression ratio on GitHub code.

Of course, the cost is in GPU hardware, or in time.


You are not using for long term archival though, it seems you are using it for deployment.


  But the tradeoff between availability and integrity is different for data transmission than for data archiving. When transmitting data, usually the most important consideration is to avoid undetected errors (false negatives for corruption), because a retransmission can be requested if an error is detected. Archiving, on the other hand, usually implies that if a file is reported as corrupt, "retransmission" is not possible. Obtaining another copy of the file may be difficult or impossible. Therefore accuracy (freedom from mistakes) in the detection of errors becomes the most important consideration.
Part of the issue with archival is transmission. I transmit these files over the internet (github->cloudflare->server) and I haven't seen a single failure of a file to unxz after transmission. This implies that I should see issues... but I really haven't.


Indeed. But if you transmit over TCP/IP, maybe SSL, probably over reliable links (github<->cloudflare<->server), seeing corruption is unlikely. And if unxz fails, in your case, you can just re-transmit.

So using xz may work well for you and using it is not a big deal, but that does not make the format reliable for long term archival and as a sibling wrote, you probably could use lzip instead for the same benefits (and actually saving a few bytes because there are no extraneous headers) but none of the problems discussed in the article.


What about lzip? The author claims the compression size is comparable or better.


It will be the same, it also just uses LZMA. It has less headers, but that's not going to make much of a difference on 5M file


Compressed to 5mb. Original is about 20.


I haven't tried that one yet. I'll play around and see what I can find out. 7zip didn't perform as well.

  tar c -C ./build $(BINARY) | gzip -9 - > $(PKG_NAME_GZ)
  tar c -C ./build $(BINARY) | xz -z -9e - > $(PKG_NAME)


It should, if the compression algorithm is actually the same, just the format specification that differs.


I'd recommend checking out zpaq[1], it purposed for backups, and has great compression (even on low setting) for large 100GB+ file collections. However for smaller stuff I use zstd at level 22 in a tar for most things since it's much faster, though a little heavier.

[1] http://mattmahoney.net/dc/zpaq.html


ZPAQ is the name of the tool but ZPAQ is also the name of the container format that gets used. ZPAQ embeds the decompression algorithm in the archive. One could store zstd-compressed blocks in ZPAQ archives as soon as a zpaql decompressor exists (e.g., for brotli there is a slow one implemented in a python subset and compiled to zpaql https://github.com/pothos/zpaqlpy).

I don't know exactly whether other formats are better for seeking and streaming, but since the baseline is tar, ZPAQ (in the 2.0 spec) is already better as it supports deduplication and files can even be updated append-only, and the compression is not an afterthought wrapped around it but well integrated.


One thing that seems to be unmentioned so far in the conversation: xz is public domain, while lzip is subject to the full-blown GPL (v2 or later).

In any case, I don't really bother with compression for my own archival needs. Storage is cheap, and encrypted data is kinda hard to reasonably compress anyway.


Why would you not compress first, then encrypt?


Because I usually think about encryption long before I think about compression; the latter's a bit of an afterthought. Ain't the most logical answer in the world, and if I planned from the outset to both compress and encrypt then I'd do it in that order, but compression usually doesn't cross my mind for archival (whereas encryption is pretty much the default for any data I have that's worth archiving).


This can lead to side-channel attacks, see https://en.m.wikipedia.org/wiki/CRIME


Isn't that a limitation of the implementation, which can be worked around by creating a new implementation with whatever licensing you want based on the format specification?

OTOH, a limitation of the spec cannot be worked around by any new implementation.

(OT - isn't it generally recommended to compress before encrypting? Encrypting is CPU-intensive so the less you have to encrypt the better, also length can be a side-channel, and don't some encryption methods leak the existence of patterns in the source data which compression will eliminate?)


Interesting, i used xz till today for compression, but i think i will use gzip and zstd ffrom now on.


I'll continue using xz - the compression ratio is far better than gzip or zstd. Maybe there will be some new future format to switch to, but zstd is not it.


The rest of the sane world will continue to use gzip because the marginal compression savings is not worth the compatibility issues. I have lost count of the number of times I have urgently needed to extract an xz archive on a system only to find out xz isn't available. Most notably Solaris and embedded systems.


> The rest of the sane world will continue to use gzip because the marginal compression savings is not worth the compatibility issues.

It is definitely worth it to use a modern compression format. I regularly take backups of one service I run. Originally I used gzip, and then on someone's recommendation I tried zstd. Here are the results:

    gzip compression time: 204s
    zstd compression time: 19.1s

    gzip decompression time: 28.2s
    zstd decompression time: 6.4s

    gzip compressed size: 3.0GB
    zstd compressed size: 1.7GB
I ain't ever going back to gzip, especially when it wastes so much disk space and CPU cycles.

> sane world [..] Solaris

...I think we're living in a different world. (:


lzip? Not sure how closely commentators have read to have missed this (obviously plugged by the author) choice...

You are not getting compatibility out of xz, not even with itself, if you cared about maximum compatibility you'd be using zip (understood and supported by all major OS at this point), and it sounds like if you just cared about compression ratio you'd use lzip...

If it was speed, I could see zstd (or maybe bzip2? not sure how it stacks up), or just compatibility with linux/embedded folks, gzip. The use case for XZ seems actually pretty marginally small, its like holding onto .rar files or some obscure file splitting tool...


I don't need compatibility with other machines, I just need to know that in 20 years I'll be able to decompress my files.

lzip might get there eventually, but it's not there now. xz has enough usage behind it that I'm not concerned.

bzip2 has enough usage but the compression ratio is worse. zstd has neither the usage nor the ratio.


> I just need to know that in 20 years I'll be able to decompress my files

The way the author was talking I wouldn't be sure that's even true, if the authors are mucking about with the container format and xz makes some non-compatible breaking change, you may not even know you can't open old xz files with the new version (it reads versions based upon heuristics rather than a version number, IIRC from the article).


Yup. To my surprise xz -1 is faster and compresses better than gzip -6, the default setting. Guess what I'm using now? I'll use zstd when the GNU tar in my distribution supports it directly without resorting to pipes and shell magic. It would be nice if one could specify the compression level on the GNU tar command line. BSD tar already does all of the above.


zstd got me already better results, but not all machines i am administering have it installed or even available.


look into lzip as well


This is fairly old. When it came up last time, there were robust arguments that xz was characterized unfairly, and that the author’s format wasn’t very good at recovering in most cases either.


Can you serialise a ZFS filesystem into a disk image? I feel like ZFS is the leader in data integrity, redundancy, and compression?


Yes, you can snapshot to a file.


How about just not compressing things for archival? A few bit errors in uncompressed files would end up as just a few bad characters. Whereas a few errors in an uncorrectable compression format might render the entire content useless. Sure they files are huge, but were talking about long term archival. In fact, if the documents are that important, have RAID-style redundancy and multiple-bit ECC in multiple geographic locations as well.


The format of the uncompressed files matters just as much in terms of bit error resilience.

Compression or not, you can always use an additional tool to produce and store extra parity data which can be used to both correct and repair bit errors, which seems like the correct answer for digital archival.


Could anyone recommend a broad scoped evaluation of current compression / archiving formats / algorithms which explores their various merits and failings?


xz got a bit of hype there about 10 years ago, I used it until a couple years ago when I noticed how slow it was with huge DB dumps and how much faster zstd was while still having decent compression.

So I have no idea about all this low level stuff, I just know that zstd is overall better for sysadmins.

But next time I'm doing any sort of scripting that involves compression I'll take a look at squashfs now due to this thread.


Here's a thought: vinyl.

While I haven't done intensive research on this, it occurs to me that plastic lasts a long time. Vinyl records are a format that seems fit for long-term archiving. The format is so obvious that it could be reverse engineered by any future civilization.

So at least they'll know something about our taste in music.


about 4 years ago I had to choose a compression format for streaming database backups, and so I compared every option supported by 7z and it was xz that was the best compromise between performance and compression ratio


This serves as another example to me that governance and conflict resolution in the Debian project is really poor.

Maintainers are free to do whatever they want, even if it doesn't make any sense at all.


xz makes sense for Debian. The article author links to a Debian mailing list thread as if that somehow provides evidence that Debian made a bad decision. I just read the thread and that is the opposite of that thread's conclusion. The consensus in that thread is that Debian's existing use of xz is just fine for its own purposes.

Debian uses external cryptographic verification for its packages and apt archives so it does not need additional robustness in the compression format.


I was surprised to find that .tar format has no checksums/crc. You need .tar.gz to get a crc during compression.


No one seems to mention lrzip ?


(2016)


At least it isn't considered harmful


Wow xz just got owned. How do you recover from this.


This article is not new, this is the second time I encounter it, it seems it has been published in 2016.

Most users of xz probably won't read it and will keep using it. And it's probably fine for most use cases, though lzip would most likely be better in every scenario if available.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: