Correct, xz is no longer particularly useful, mostly annoying.
For read-only long-term filesystem-like archives, use squashfs with whatever compression option is convenient. The cool thing about squashfs is you can efficiently list the contents and extract single files. This is why it's an ultimate best option for long-term filesystem-esque archives.
I still think there is a place for a streamable, concatenable archive format with no builtin compression, plus an index sidecar for it when you want to trade streamability for seeking (PDF does something like this within a single file, in case you don’t value your sanity). People have made a number of those on top of tar, but they are all limited in various ways by the creators’ requirements, and hardly ubiquitous; and, well, tar is nuts—both because of how difficult and dubiously compatible it is to store some aspects of filesystem metadata in it, and because of how impossible it is to not store others except by convention. Not to mention the useless block structure compared to e.g. cpio or even ar. (Seek indexes for gzip and zstd are also a solved problem in that annoying essentially-but-not-in-practice way, but at least the formats themselves are generally sane.)
Incidentally, the author of the head article also disapproves of the format (not compression) design of zstd[1] on the same corruption-resistance grounds (although they e.g. prohibit concatenability), even though to me the format[2] seems much less crufty than xz.
> I still think there is a place for a streamable, concatenable archive format with no builtin compression, plus an index sidecar for it when you want to trade streamability for seeking (PDF does something like this within a single file.
I wholeheartedly agree! That use-case is not currently covered by any widely known OSS project AFAIK.
Tar is really great for its intended use case (archiving files to tape, with the constraints of late 1970s computing power), and kind of weird in basically all other uses. The block format wastes tons of space unless you compress it with run length encoding (which is why .tar.gz is so common). It's dirt simple but also kind of brittle. Its data model is not really a good fit for storing things that don't look like posix files, too much metadata in some ways while missing other kinds.
I'm actually working on just such a thing! It's definitely a low-priority side project at this point, but I think the general technique has legs. It was born out of a desire for easier to use streaming container formats that can save arbitrary data streams.
I call it SITO and it's based on a stream of messagepack objects. If folks are interested in such a thing, I can post what I have currently of the spec, and I'd love to get some feedback on it.
I agree that TAR is problematic, compression is problematic, and there needs to be a mind towards ECC from the get-go.
I could really use some technical discussion to hammer out some of the design decisions that I'm iffy about, and also hone in on what is essential for the spec. For example, how to handle seek-indexing, and whether I should use named fields vs fixed schema, or allow both.
This is about long-term archiving though. For that you want a wide-spread format that is well-documented and has many independent implementations.
Like zip/7z (with external error recovery), or maybe RAR (with error-recovery records)
Fast compression or decompression is almost entirely meaningless in that context, and compression ratio is also only of secondary importance.
This is why PDF is still considered the best format for long-term archiving of documents, even though there might be things that compress better (djvu/jp2)
I'm still looking for a full-fledged backup and archiving solution that has the following characteristics:
0) error-correction (not just detection!)
1) end-to-end encryption
2) deduplication
3) compression
4) cross-platform implementations
5) (at least one) user interface
6) open source
Both Borg [0] and Restic [1] have long standing open issues for error-correction, but seem to consider it off strategy. I find that decision kind of strange, since to me the whole purpose of a backup solution is to restore your system to a correct state after any kind of incident.
My current solution is an assembly of shell scripts that combine borg with par2, but I'm rather unhappy with it. For one, I trust my home-brewn solution rather faintly (i.e. similar to `don't roll your own crypto` I think there should be an adagium `don't roll your own back-up solutions`). In addition I think an error-correcting mechanism should be available also for the less technology-savvy.
Paper/master thesis/nerd snipe idea: does availability of Reed-Salomon information of unencrypted files weaken the encryption of their encrypted counterparts?
I have yet to find a conclusive analysis on how well RAR with recovery works for different failure modes.
I mean I can guess it works pretty well for single-bit flips, but how about burst errors, how long can those be? Usually you want to have protection from at least 1 or 2 filesystem blocks, which can be 4 or 8k or even more, depending on the file system. How about repeating error patterns, data deletions, etc.?
> The cool thing about squashfs is you can efficiently list the contents and extract single files.
What’s the story look like around reading files out of squashfs archives stored in an S3 compatible storage system? Can what you mention above be done with byte range requests versus retrieving the entire object?
"DwarFS compression is an order of magnitude better than SquashFS compression, it's 6 times faster to build the file system, it's typically faster to access files on DwarFS and it uses less CPU resources."
DwarFS may be good, but it's not in the Linux kernel (depends on FUSE). That makes it less universal, potentially significantly slower for some uses cases, and also less thoroughly tested. SquashFS is used by a lot of embedded Linux distros among other use cases, so we can have pretty high confidence in its correctness.
Are you recommending zstd based on compression speed and ratio only? Because as the linked article explains, those are not the only criteria. How does zstd rate with everything else?
Zstd is still just as bad when it comes to the most important point:
>Xz does not provide any data recovery means
For the common use case of compression a tar archive, this is a critical flaw. One small area of corruption will render the remainder of the archive unreadable.
I don’t know how we found ourselves with the best formats for data recovery/robustness fading into obscurity - i.e lzip and to lesser extent bzip2.
The only compressed format left in common use that handles corruption is the hardware compression in LTO tape drives
> I don’t know how we found ourselves with the best formats for data recovery/robustness fading into obscurity - i.e lzip and to lesser extent bzip2.
Because monotonically increasing technological progress is a commonly-believed fairy tale. Nowadays, capabilities are lost as often as they're gained. Either because the people who designed the successor are being opinionated or just focusing on something else.
Zstd is terrible for archiving since it doesn't even detect corruption. The --check switch described in the manpage as enabling checksums (in a super-confusing way) seems to do absolutely nothing.
You can test by intentionally corrupting a .zstd file that was created with checksums enabled and then watch as zstd happily proceeds to decompress it, without any sort of warning. This is the stuff of nightmares.
After all these years, RAR remains the best option for archiving.
Sure, if you look at the Pareto frontier for xz and zstd, zstd does not seem like a “replacement” for xz. It’s not a replacement for PPMd.
The problem is that xz has kind of horrible performance when you crank it up to the high settings. On the medium settings, you can get the same ratio for much, much less CPU (round-trip) by switching to zstd.
Do you have any data to support this claim? In my experience, zstd is way better in every way compared to gzip. Additionally, xz is good compression but horribly crazy slow to decompress. Xz also only operates on one single file at a time, which is annoying.
I just did a quick test with a copy of GNU M4, which is reasonably representative of a source code archive.
$ time xz -9k m4-1.4.19.tar
real 0m2.928s
user 0m2.871s
sys 0m0.056s
$ time zstd -19 m4-1.4.19.tar
real 0m3.411s
user 0m3.380s
sys 0m0.032s
$ ls -l m4-1.4.19.tar*
-rw-rw-r-- 1 john john 14837760 Jul 24 14:40 m4-1.4.19.tar
-rw-rw-r-- 1 john john 1674612 Jul 24 14:40 m4-1.4.19.tar.xz
-rw-rw-r-- 1 john john 1726155 Jul 24 14:40 m4-1.4.19.tar.zst
In this test, XZ was both faster and had better compression than Zstd.
> xz on highest (normal) compression level easily beats the compression ratio of zstd on highest compression level (--ultra -22) on any data I've tested. However with xz reading the compressed files easily becomes a bottleneck, zstd has great read speeds regardless of compression ratio
Are you agreeing or disagreeing with ars' claim that XZ provides a better compression ratio than Zstd? My data shows that it's true in at least one common use case (distribution of open-source software source archives).
I've seen similar comparative ratios from files up to the multi-gigabyte range, for example VM images. In what cases have you seen XZ produce worse compression ratios than Zstd?
Generally speaking, the top-end of xz very slightly beats the top-end of zstd. However, xz typically takes several times as long to extract. And generally I've seen xz take longer to compress than zstd, as well.
Example with a large archive (representative of compiled software distribution, such as package management formats):
$ time xz -T0 -9k usrbin.tar
real 2m0.579s
user 8m46.646s
sys 0m2.104s
$ time zstd -T0 -19 --long usrbin.tar
real 1m47.242s
user 6m34.845s
sys 0m0.544s
/tmp$ ls -l usrbin.tar*
-rw-r--r-- 1 josh josh 998830080 Jul 23 23:55 usrbin.tar
-rw-r--r-- 1 josh josh 189633464 Jul 23 23:55 usrbin.tar.xz
-rw-r--r-- 1 josh josh 203107989 Jul 23 23:55 usrbin.tar.zst
/tmp$ time xzcat usrbin.tar.xz >/dev/null
real 0m9.410s
user 0m9.339s
sys 0m0.060s
/tmp$ time zstdcat usrbin.tar.zst >/dev/null
real 0m0.996s
user 0m0.894s
sys 0m0.065s
Comparable compression ratio, faster to compress, 10x faster to decompress.
And if you do need a smaller compression ratio than xz, you can get that at a cost in time:
That seems fine -- it's a tradeoff between speed and compression ratio, which has existed ever since compression went beyond RLE.
Zstd competes against Snappy and LZ4 in the market of transmission-time compression. You use it for things like RPC sessions, where the data is being created on-the-fly, compressed for bandwidth savings, then decompressed+parsed on the other side. And in this domain, Zstd is pretty clearly the stand-out winner.
When it comes to archival, the wall-clock performance is less important. Doubling the compress/decompress time for a 5% improvement in compression ratio is an attractive option, and high-compression XZ is in many cases faster than high-compression Zstd even delivering better ratios.
---
EDIT for parent post adding numbers: I spot-tested running zstd with `-22 --ultra` on files in my archive of source tarballs, and wasn't able to find cases where it outperformed `xz -9`.
I think you're missing the point that in terms of tradeoffs people are willing to make: absolute compression ratio loses to 80% of the compression ability with big gains to decompression speed (aka include round trip cpu time if you want something to agree / disagree with, we're not talking about straight compression ratios).
Arch Linux is a case study in a large distributor of open source software that switched from xz compressed binaries to zstd and they didn't do it for teh lulz[0].
Yup and how much better is about 1%.
"zstd and xz trade blows in their compression ratio. Recompressing all packages to zstd with our options yields a total ~0.8% increase in package size on all of our packages combined, but the decompression time for all packages saw a ~1300% speedup."
In my experience the normal squashfs kernel driver is quite slow at listing/traversing very large archives (several GB+). For some reason squashfuse is MUCH faster for just looking around inside.
I recently came to the same determination looking for a better way to package sosreports (diagnostic files from linux machines). The pieces are there for indexed file lists and also seekable compression but basically nothing else implements them in a combined fassion with a modern compression format (mainly zstd).
Zstd can achieve good compression ratio for many data patterns and is super fast to create, and decompresses at multiple GB/s with a halfway decent CPU, often hitting disk bandwidth limits.
I've never tried lzop or met someone who advocated to use it.
Needs research, perhaps. Until then I'm healthily skeptical.
LZO has been around since the 90's. There are multiple distros which use it (as an option) for archive downloads. It was the recommended algorithm to use for btrfs compressed volumes for years. It's pretty standard, about as common as .bz2 in my experience.
When does it make sense to use lzop (or the similar but more widely recommended LZ4) for static storage? My impression was that it was a compressor for when you want to put less bytes onto the transmission / storage medium at negligible CPU cost (in both directions) because your performance is limited by that medium (fast network RPC, blobs in databases), not because you want to take up as little space as possible on your backup drive. And it does indeed lose badly in compression ratio even to zlib on default settings (gzip -6), let alone Zstandard or LZMA.
I used lzop in the past when speed was more of a concern than compressed size (big disk images etc.)
For a couple of years now I have switched to zstd. It is both fast and well compressed by default in that use case, no need to remember any options.
No, I have NOT done any deeper analysis except a couple of comparisons which I haven't even documented. But they ended up in slight favor of zstd, so I switched. But nothing dramatic making me say forget lzop.
For read-only long-term filesystem-like archives, use squashfs with whatever compression option is convenient. The cool thing about squashfs is you can efficiently list the contents and extract single files. This is why it's an ultimate best option for long-term filesystem-esque archives.
https://en.m.wikipedia.org/wiki/SquashFS
For everything else, or if you want (very) fast compression/decompression and/or general high-ratio compression, use .zstd.
https://github.com/facebook/zstd
I've used this combination professionally to great effect. You're welcome :)