Correct, xz is no longer particularly useful, mostly annoying. For read-only lon...

mananaysiempre · on July 24, 2022

I still think there is a place for a streamable, concatenable archive format with no builtin compression, plus an index sidecar for it when you want to trade streamability for seeking (PDF does something like this within a single file, in case you don’t value your sanity). People have made a number of those on top of tar, but they are all limited in various ways by the creators’ requirements, and hardly ubiquitous; and, well, tar is nuts—both because of how difficult and dubiously compatible it is to store some aspects of filesystem metadata in it, and because of how impossible it is to not store others except by convention. Not to mention the useless block structure compared to e.g. cpio or even ar. (Seek indexes for gzip and zstd are also a solved problem in that annoying essentially-but-not-in-practice way, but at least the formats themselves are generally sane.)

Incidentally, the author of the head article also disapproves of the format (not compression) design of zstd[1] on the same corruption-resistance grounds (although they e.g. prohibit concatenability), even though to me the format[2] seems much less crufty than xz.

[1] https://lists.gnu.org/archive/html/lzip-bug/2016-10/msg00002...

[2] https://www.rfc-editor.org/rfc/rfc8878.html

metadat · on July 24, 2022

> I still think there is a place for a streamable, concatenable archive format with no builtin compression, plus an index sidecar for it when you want to trade streamability for seeking (PDF does something like this within a single file.

I wholeheartedly agree! That use-case is not currently covered by any widely known OSS project AFAIK.

ranger_danger · on July 24, 2022

What about tar?

kortex · on July 24, 2022

Tar is really great for its intended use case (archiving files to tape, with the constraints of late 1970s computing power), and kind of weird in basically all other uses. The block format wastes tons of space unless you compress it with run length encoding (which is why .tar.gz is so common). It's dirt simple but also kind of brittle. Its data model is not really a good fit for storing things that don't look like posix files, too much metadata in some ways while missing other kinds.

kortex · on July 24, 2022

I'm actually working on just such a thing! It's definitely a low-priority side project at this point, but I think the general technique has legs. It was born out of a desire for easier to use streaming container formats that can save arbitrary data streams.

I call it SITO and it's based on a stream of messagepack objects. If folks are interested in such a thing, I can post what I have currently of the spec, and I'd love to get some feedback on it.

I agree that TAR is problematic, compression is problematic, and there needs to be a mind towards ECC from the get-go.

I could really use some technical discussion to hammer out some of the design decisions that I'm iffy about, and also hone in on what is essential for the spec. For example, how to handle seek-indexing, and whether I should use named fields vs fixed schema, or allow both.

dark-star · on July 24, 2022

This is about long-term archiving though. For that you want a wide-spread format that is well-documented and has many independent implementations.

Like zip/7z (with external error recovery), or maybe RAR (with error-recovery records)

Fast compression or decompression is almost entirely meaningless in that context, and compression ratio is also only of secondary importance.

This is why PDF is still considered the best format for long-term archiving of documents, even though there might be things that compress better (djvu/jp2)

rookderby · on July 24, 2022

I have came to the same conclusion and was surprised to find that RAR is the only popular archive format that included parity.

Personally I use 7zip for compression and par2[0] for parity.

[0] https://en.wikipedia.org/wiki/Parchive

Koffiepoeder · on July 25, 2022

I'm still looking for a full-fledged backup and archiving solution that has the following characteristics:

  0) error-correction (not just detection!)
  1) end-to-end encryption
  2) deduplication
  3) compression
  4) cross-platform implementations
  5) (at least one) user interface
  6) open source

Both Borg [0] and Restic [1] have long standing open issues for error-correction, but seem to consider it off strategy. I find that decision kind of strange, since to me the whole purpose of a backup solution is to restore your system to a correct state after any kind of incident.

My current solution is an assembly of shell scripts that combine borg with par2, but I'm rather unhappy with it. For one, I trust my home-brewn solution rather faintly (i.e. similar to `don't roll your own crypto` I think there should be an adagium `don't roll your own back-up solutions`). In addition I think an error-correcting mechanism should be available also for the less technology-savvy.

[0]: https://github.com/borgbackup/borg/issues/225

[1]: https://github.com/restic/restic/issues/256

Koffiepoeder · on July 25, 2022

Paper/master thesis/nerd snipe idea: does availability of Reed-Salomon information of unencrypted files weaken the encryption of their encrypted counterparts?

dark-star · on July 24, 2022

I have yet to find a conclusive analysis on how well RAR with recovery works for different failure modes.

I mean I can guess it works pretty well for single-bit flips, but how about burst errors, how long can those be? Usually you want to have protection from at least 1 or 2 filesystem blocks, which can be 4 or 8k or even more, depending on the file system. How about repeating error patterns, data deletions, etc.?

metadat · on July 24, 2022

PAR files (parity) are battle tested, if you are really concerned about recovering from corruption.

https://en.m.wikipedia.org/wiki/Parchive

brnt · on July 24, 2022

It appears lzip is resistant to single bit flips, but can't be configured with more resistance.

gsich · on July 25, 2022

I like the inbuilt parity as you don't have to hold on to another set of files.

toomuchtodo · on July 24, 2022

> The cool thing about squashfs is you can efficiently list the contents and extract single files.

What’s the story look like around reading files out of squashfs archives stored in an S3 compatible storage system? Can what you mention above be done with byte range requests versus retrieving the entire object?

exikyut · on July 24, 2022

https://dr-emann.github.io/squashfs/squashfs.html suggests this may be possible with a small handful of HTTP requests (less than 10, likely 5 or 6).

toomuchtodo · on July 24, 2022

Thank you!

coldblues · on July 24, 2022

https://github.com/mhx/dwarfs

"DwarFS compression is an order of magnitude better than SquashFS compression, it's 6 times faster to build the file system, it's typically faster to access files on DwarFS and it uses less CPU resources."

jbotz · on July 24, 2022

DwarFS may be good, but it's not in the Linux kernel (depends on FUSE). That makes it less universal, potentially significantly slower for some uses cases, and also less thoroughly tested. SquashFS is used by a lot of embedded Linux distros among other use cases, so we can have pretty high confidence in its correctness.

metadat · on July 24, 2022

Thank you coldblues, I'd not heard of dwarfs!

Went ahead and submitted, I think it deserves its own discussion:

https://news.ycombinator.com/item?id=32216275

grumpyprole · on July 24, 2022

Are you recommending zstd based on compression speed and ratio only? Because as the linked article explains, those are not the only criteria. How does zstd rate with everything else?

usr1106 · on July 24, 2022

Yes, I felt that's the biggest deficiency of the article , not covering zstd.

Some comment here says that the author of the article disapproves zstd with similar arguments as for xz. Have not verified the claim.

rwmj · on July 24, 2022

zstd still doesn't have a seekable format as part of the official standard (I wish it did): https://github.com/facebook/zstd/issues/395#issuecomment-535...

kiririn · on July 24, 2022

Zstd is still just as bad when it comes to the most important point:

>Xz does not provide any data recovery means

For the common use case of compression a tar archive, this is a critical flaw. One small area of corruption will render the remainder of the archive unreadable.

I don’t know how we found ourselves with the best formats for data recovery/robustness fading into obscurity - i.e lzip and to lesser extent bzip2. The only compressed format left in common use that handles corruption is the hardware compression in LTO tape drives

tablespoon · on July 24, 2022

> I don’t know how we found ourselves with the best formats for data recovery/robustness fading into obscurity - i.e lzip and to lesser extent bzip2.

Because monotonically increasing technological progress is a commonly-believed fairy tale. Nowadays, capabilities are lost as often as they're gained. Either because the people who designed the successor are being opinionated or just focusing on something else.

armitron · on July 24, 2022

Zstd is terrible for archiving since it doesn't even detect corruption. The --check switch described in the manpage as enabling checksums (in a super-confusing way) seems to do absolutely nothing.

You can test by intentionally corrupting a .zstd file that was created with checksums enabled and then watch as zstd happily proceeds to decompress it, without any sort of warning. This is the stuff of nightmares.

After all these years, RAR remains the best option for archiving.

btdmaster · on July 24, 2022

I got "Decoding error (36)" when data was wrong, so --check (enabled by default during compression) is working for me:

  echo 'a b c' > test
  zstd test
  zstdcat test.zst
  sed -i s/a/b/g test.zst # corrupt the file on purpose
  zstdcat test.zst

ars · on July 24, 2022

zstd hardly replaces xz, the compression ratio is quite worse. zstd seem more of a replacement for gz.

basilgohar · on July 24, 2022

Most tests I've seen, such as [0] don't support your statement. Zstd can compress almost as well as xz but decompresses much faster.

It can also compress more than xz with tweaks, though I don't know the compute/memory tradeoffs.

[0] https://archlinux.org/news/now-using-zstandard-instead-of-xz...

Edit to fix typo.

klodolph · on July 24, 2022

Sure, if you look at the Pareto frontier for xz and zstd, zstd does not seem like a “replacement” for xz. It’s not a replacement for PPMd.

The problem is that xz has kind of horrible performance when you crank it up to the high settings. On the medium settings, you can get the same ratio for much, much less CPU (round-trip) by switching to zstd.

YMMV, use your own corpus and CPU to test it.

usr1106 · on July 24, 2022

Depends on your use case. Saving the last couple of Bytes is often less important than fast compression.

metadat · on July 24, 2022

Do you have any data to support this claim? In my experience, zstd is way better in every way compared to gzip. Additionally, xz is good compression but horribly crazy slow to decompress. Xz also only operates on one single file at a time, which is annoying.

jmillikin · on July 24, 2022

I just did a quick test with a copy of GNU M4, which is reasonably representative of a source code archive.

  $ time xz -9k m4-1.4.19.tar 
  real 0m2.928s
  user 0m2.871s
  sys 0m0.056s
  $ time zstd -19 m4-1.4.19.tar
  real 0m3.411s
  user 0m3.380s
  sys 0m0.032s
  $ ls -l m4-1.4.19.tar*
  -rw-rw-r-- 1 john john 14837760 Jul 24 14:40 m4-1.4.19.tar
  -rw-rw-r-- 1 john john  1674612 Jul 24 14:40 m4-1.4.19.tar.xz
  -rw-rw-r-- 1 john john  1726155 Jul 24 14:40 m4-1.4.19.tar.zst

In this test, XZ was both faster and had better compression than Zstd.

metadat · on July 24, 2022

Howdy jmillikin.

That is a very small file.

See: https://news.ycombinator.com/item?id=25455846#:~:text=xz%20o....

> xz on highest (normal) compression level easily beats the compression ratio of zstd on highest compression level (--ultra -22) on any data I've tested. However with xz reading the compressed files easily becomes a bottleneck, zstd has great read speeds regardless of compression ratio

jmillikin · on July 24, 2022

Are you agreeing or disagreeing with ars' claim that XZ provides a better compression ratio than Zstd? My data shows that it's true in at least one common use case (distribution of open-source software source archives).

I've seen similar comparative ratios from files up to the multi-gigabyte range, for example VM images. In what cases have you seen XZ produce worse compression ratios than Zstd?

JoshTriplett · on July 24, 2022

Generally speaking, the top-end of xz very slightly beats the top-end of zstd. However, xz typically takes several times as long to extract. And generally I've seen xz take longer to compress than zstd, as well.

Example with a large archive (representative of compiled software distribution, such as package management formats):

    $ time xz -T0 -9k usrbin.tar 
    
    real 2m0.579s
    user 8m46.646s
    sys 0m2.104s
    
    $ time zstd -T0 -19 --long usrbin.tar 
    real 1m47.242s
    user 6m34.845s
    sys 0m0.544s
    /tmp$ ls -l usrbin.tar*
    -rw-r--r-- 1 josh josh 998830080 Jul 23 23:55 usrbin.tar
    -rw-r--r-- 1 josh josh 189633464 Jul 23 23:55 usrbin.tar.xz
    -rw-r--r-- 1 josh josh 203107989 Jul 23 23:55 usrbin.tar.zst
    /tmp$ time xzcat usrbin.tar.xz >/dev/null

    real 0m9.410s
    user 0m9.339s
    sys 0m0.060s
    /tmp$ time zstdcat usrbin.tar.zst >/dev/null
    
    real 0m0.996s
    user 0m0.894s
    sys 0m0.065s

Comparable compression ratio, faster to compress, 10x faster to decompress.

And if you do need a smaller compression ratio than xz, you can get that at a cost in time:

    $ time zstd -T0 -22 --ultra --long usrbin.tar 

    real 4m32.056s
    user 9m2.484s
    sys 0m0.644s
    $ ls -l usrbin.tar*
    -rw-r--r-- 1 josh josh 998830080 Jul 23 23:55 usrbin.tar
    -rw-r--r-- 1 josh josh 189633464 Jul 23 23:55 usrbin.tar.xz
    -rw-r--r-- 1 josh josh 186113543 Jul 23 23:55 usrbin.tar.zst

And it still takes the same amount of time to extract, 10x faster than xz.

jmillikin · on July 24, 2022

That seems fine -- it's a tradeoff between speed and compression ratio, which has existed ever since compression went beyond RLE.

Zstd competes against Snappy and LZ4 in the market of transmission-time compression. You use it for things like RPC sessions, where the data is being created on-the-fly, compressed for bandwidth savings, then decompressed+parsed on the other side. And in this domain, Zstd is pretty clearly the stand-out winner.

When it comes to archival, the wall-clock performance is less important. Doubling the compress/decompress time for a 5% improvement in compression ratio is an attractive option, and high-compression XZ is in many cases faster than high-compression Zstd even delivering better ratios.

---

EDIT for parent post adding numbers: I spot-tested running zstd with `-22 --ultra` on files in my archive of source tarballs, and wasn't able to find cases where it outperformed `xz -9`.

tepitoperrito · on July 24, 2022

I think you're missing the point that in terms of tradeoffs people are willing to make: absolute compression ratio loses to 80% of the compression ability with big gains to decompression speed (aka include round trip cpu time if you want something to agree / disagree with, we're not talking about straight compression ratios).

Arch Linux is a case study in a large distributor of open source software that switched from xz compressed binaries to zstd and they didn't do it for teh lulz[0].

[0] https://archlinux.org/news/now-using-zstandard-instead-of-xz...

jmillikin · on July 24, 2022

I'm not missing the point. I'm responding to the thread, which is about whether XZ offers better compression ratios than Zstd.

Whether it's faster in some, many, or most cases isn't really relevant.

tepitoperrito · on July 24, 2022

Yup and how much better is about 1%. "zstd and xz trade blows in their compression ratio. Recompressing all packages to zstd with our options yields a total ~0.8% increase in package size on all of our packages combined, but the decompression time for all packages saw a ~1300% speedup."

svnpenn · on July 24, 2022

Yes, because ratio is the only thing that matters. Decompression speed doesn't matter at all. Who cares if Zstd is 9 times faster?

ranger_danger · on July 24, 2022

In my experience the normal squashfs kernel driver is quite slow at listing/traversing very large archives (several GB+). For some reason squashfuse is MUCH faster for just looking around inside.

exyi · on July 24, 2022

Does SquashFS support cross-file compression? - i.e. how well does it compress a folder with a number of similar files?

lathiat · on July 24, 2022

I recently came to the same determination looking for a better way to package sosreports (diagnostic files from linux machines). The pieces are there for indexed file lists and also seekable compression but basically nothing else implements them in a combined fassion with a modern compression format (mainly zstd).

adastra22 · on July 24, 2022

I use lzop, as it has faster compression/decompression. Is there a specific reason to prefer zstd?

metadat · on July 24, 2022

Zstd can achieve good compression ratio for many data patterns and is super fast to create, and decompresses at multiple GB/s with a halfway decent CPU, often hitting disk bandwidth limits.

I've never tried lzop or met someone who advocated to use it. Needs research, perhaps. Until then I'm healthily skeptical.

adastra22 · on July 24, 2022

LZO has been around since the 90's. There are multiple distros which use it (as an option) for archive downloads. It was the recommended algorithm to use for btrfs compressed volumes for years. It's pretty standard, about as common as .bz2 in my experience.

ce4 · on July 24, 2022

LZO has been around for 25 years if that helps.

I've also used it especially when resources are constrained (low performance CPU, ram, or when executable size really matters)

mananaysiempre · on July 24, 2022

When does it make sense to use lzop (or the similar but more widely recommended LZ4) for static storage? My impression was that it was a compressor for when you want to put less bytes onto the transmission / storage medium at negligible CPU cost (in both directions) because your performance is limited by that medium (fast network RPC, blobs in databases), not because you want to take up as little space as possible on your backup drive. And it does indeed lose badly in compression ratio even to zlib on default settings (gzip -6), let alone Zstandard or LZMA.

usr1106 · on July 24, 2022

I used lzop in the past when speed was more of a concern than compressed size (big disk images etc.)

For a couple of years now I have switched to zstd. It is both fast and well compressed by default in that use case, no need to remember any options.

No, I have NOT done any deeper analysis except a couple of comparisons which I haven't even documented. But they ended up in slight favor of zstd, so I switched. But nothing dramatic making me say forget lzop.

Edit: NOT

slavik81 · on July 24, 2022

> For read-only long-term filesystem-like archives, use squashfs with whatever compression option is convenient

Is there a way to extract the files without mounting the filesystem?

Nekit1234007 · on July 24, 2022

Yes! https://manpages.debian.org/testing/squashfs-tools/unsquashf...

I believe 7z can do that as well.

pdimitar · on July 24, 2022

I thought DwarFS[0] is better than SquashFS?

[0] https://github.com/mhx/dwarfs

yokoprime · on July 24, 2022

How good is zstd for long term archival vs e.g. 7z? The latter appears (I have no data to back it up) to be vastly more popular at the moment.

usr1106 · on July 24, 2022

7z was popular in the Windows world when I still used that more than 10 years ago because Windows contained nothing reasonable.

I have never really seen it in the Linux world. There are several alternatives installed in most distros, all except zstd discussed in the article.