Highlight (benchmark of Perl source code): The results follow: xz -T16 -9 -k - 2...

zimpenfish · 2025-02-01T18:48:03 1738435683

Did a test on a 1.3G text file (output of `find -f -printf ...`); Macbook Pro M3 Max 64GB. All timings are "real" seconds from bash's builtin `time`.

    files.txt         1439563776
    bzip2 -9 -k       1026805779 71.3% c=67 d=53
    zstd --long -19   1002759868 69.7% c=357 d=9
    xz -T16 -9 -k      993376236 69.0% c=93 d=9
    zstd -T12 -16      989246823 68.7% c=14 d=9
    bzip3 -b 256       975153650 67.7% c=174 d=187
    bzip3 -b 256 -j12  975153650 67.7% c=46 d=189
    bzip3 -b 511       974113769 67.6% c=172 d=187
    bzip3 -b 511 -j12  974113769 67.6% c=77s d=186

I'll stick with zstd for now (unless I need to compress the Perl source, I guess.)

(edited to add 12 thread runs of bzip3 and remove superfluous filenames)

zimpenfish · 2025-02-01T21:31:04 1738445464

Since I only have 12 perf cores on this Mac, I tried the xz test again with 12 threads.

    xz -T12 -9 -k      993376236 69.0% c=83 d=9

~10% faster for compression with the same size output.

yencabulator · 2025-02-02T20:02:11 1738526531

That d=9 sure wins the day there, for me.

wongarsu · 2025-02-01T19:43:30 1738439010

Additional benchmarks on the same dataset:

    uncompressed               - 19'291'709'440
    bzip2 -9                   -  3'491'493'993 (sanity check)
    zstd -16 --long            -    593'915'849
    zstd -16 --long=31         -    122'909'756 (requires equivalent argument in decompressor due to needing ~4GB RAM)
    zstd -19 --long            -    505'728'419
    zstd -19 --long=31         -    106'601'594 (requires equivalent argument in decompressor)
    zstd --ultra -22           -    240'330'522
    zstd --ultra -22 --long=31 -     64'899'008 (requires equivalent argument in decompressor)
    rar a -m5 -md4g -s -mt8    -     64'837'044

As you notice my sanity check actually has a slightly different size. Not sure why. The benchmark is a bit underspecified because new perl versions were released in the interim. I used all releases up to perl-5.37.1 to get to the correct number of files. Just treat all numbers to have about 2% uncertainty to account for this difference.

I can't provide compression/decompression times, but the --long or --long=31 arguments should not have major impact on speed, they mostly impact used memory. --long=31 requires setting the same in the decompressor, making that option mostly useful for internal use, not archives meant for public consumption.

As you can see, the benchmark chosen by the author mostly comes down to finding similar data that's far away. I wonder if bzip3 can do this better than other algorithms (especially in less memory) or simply chose default parameters that use more memory.

Edit: added more benchmarks

linsomniac · 2025-02-01T18:02:33 1738432953

My standard test is compressing a "dd" disc image of a Linux install (I use these for work), with unused blocks being zeroed. Results:

    Uncompressed:      7,516,192,768
    zstd:              1,100,323,366
    bzip3 -b 511 -j 4: 1,115,125,019

palaiologos · 2025-02-01T20:59:14 1738443554

Hi, tool author here!

Thank you for your benchmark!

As you may be aware, different compression tools fill in different data type niches. In particular, less specialised statistical methods (bzip2, bzip3, PPMd) generally perform poorly on vaguely defined binary data due to unnatural distribution of the underlying data that at least in bzip3's case does not lend well to suffix sorting.

Conversely, Lempel-Ziv methods usually perform suboptimally on vaguely defined "textual data" due to the fact that the future stages of compression that involve entropy coding can not make good use of the information encoded by match offsets while maintaining fast decompression performance - it's a long story that I could definitely go into detail about if you'd like, but I want to keep this reply short.

All things considered, data compression is more of an art than science, trying to fit in an acceptable spot on the time to compression ratio curve. I created bzip2 as an improvement to the original algorithm, hoping that we can replace some uses of it with a more modern and worthwhile technology as of 2022. I have included benchmarks against LZMA, zstandard, etc. mostly as a formality; in reality if you were to choose a compression method it'd be very dependent on what exactly you're trying to compress, but my personal stance is that bzip3 would likely be strictly better than bzip2 in all of them.

bzip3 usually operates on bigger block sizes, up to 16 times bigger than bzip2. additionally, bzip3 supports parallel compression/decompression out of the box. for fairness, the benchmarks have been performed using single thread mode, but they aren't quite as fair towards bzip3 itself, as it uses a way bigger block size. what bzip3 aims to be is a replacement for bzip2 on modern hardware. what used to not be viable decades ago (arithmetic coding, context mixing, SAIS algorithms for BWT construction) became viable nowadays, as CPU Frequencies don't tend to change, while cache and RAM keep getting bigger and faster.

linsomniac · 2025-02-01T22:10:14 1738447814

Thanks for the reply. I just figured I'd try it and see, and the bzip3 results are extremely good. I figured it was worth trying because a fair bit of the data in that image is non-binary (man pages, config files, shell/python code), but probably the bulk of it is binary (kernel images, executables).

andix · 2025-02-01T22:03:22 1738447402

Shouldn't a modern compression tool, targeting a high compression rate, try to switch its compression method on the fly depending on the input data?

I have no idea about compression, just a naive thought.

supertrope · 2025-02-02T01:04:05 1738458245

7-Zip can apply a BCJ filter before LZMA to more effectively compress x86 binaries. https://www.7-zip.org/7z.html. Btrfs’ transparent compression feature checks if the first block compressed well; if not it gives up for the rest of the file.

kristofferR · 2025-02-02T05:09:52 1738472992

https://en.wikipedia.org/wiki/PAQ

nmz · 2025-02-02T23:51:55 1738540315

If the focus is on text, then the best example is probably the sqlite amalgation file which is a 9mb C file.

linsomniac · 2025-02-01T22:05:12 1738447512

A couple other data points:

    zstd --long --ultra -2:                                1,062,475,298
    zstd --long=windowLog --zstd=windowLog=31 --ultra -22: 1,041,203,362

So for my use case the additional settings don't seem to make sense.

SeptiumMMX · 2025-02-01T19:57:02 1738439822

Given that it's BWT, the difference should be the most prominent on codebases with huge amounts of mostly equivalent files. Most compression algorithms won't help if you get an exact duplicate of some block when it's past the compression window (and will be less efficient if near the end of the window).

But here's a practical trick: sort files by extension and then by name before putting them into an archive, and then use any conventional compression. It will very likely put the similar-looking files together, and save you space. Done that in practice, works like a charm.

hcs · 2025-02-02T00:50:02 1738457402

Handy tip for 7-Zip, the `-mqs` command line switch (just `qs` in the Parameters field of the GUI) does this for you. https://7-zip.opensource.jp/chm/cmdline/switches/method.htm#...

ku1ik · 2025-02-02T09:07:25 1738487245

Ooh, that’s neat. How much improved do you get from this? Is it more single or double digit % diff?

sevg · 2025-02-01T17:30:32 1738431032

To make your comment more useful you’ll want to include compression and decompression time.

Using the results from the readme, seems like bzip3 performs competitively with zstd on both counts.

idoubtit · 2025-02-01T18:30:05 1738434605

I've experimented a bit with bzip3, and I think the results in the readme are not representative. I think it's a handmade pick, with an uncommon input and unfair choices of parameters. And it's made with a HDD, which skews the results even more.

For instance, with a 800 MB SQL file, for the same compression time and optimal parameters (within my capacity), bzip3 produced a smaller file (5.7 % compression ration) than zstd (6.1 % with `--long -15`). But the decompression was about 20× slower (with all cores or just one).

I'm not claim my stupid benchmark is better or even right. It's just that my results were very different from bzip3's readme. So I'm suspicious.

dralley · 2025-02-01T17:33:55 1738431235

Also the compression levels..

linsomniac · 2025-02-01T17:41:20 1738431680

I believe the compression levels are included in the list above.

dralley · 2025-02-01T17:46:17 1738431977

Not for zstd or lzma

linsomniac · 2025-02-01T17:59:53 1738432793

Added, thanks.

jl6 · 2025-02-01T17:46:46 1738432006

A 4x improvement over lzma is an extraordinary claim. I see the author has also given a result after applying lrzip (which removes long-range redundancies in large files), and the difference isn’t so great (but bzip3 still wins). Does the amazing result without lrzip mean bzip3 is somehow managing to exploit some of that long-range redundancy natively?

I’d be astonished if such a 4x result generalized to tarballs that aren’t mostly duplicated files.

wongarsu · 2025-02-01T18:06:11 1738433171

Currently running my own benchmarks, my preliminary results are that zstd becomes competitive again once you add the --long option (so `zstd --long -16 all.tar` instead of `zstd -16 all.tar`). Which is an option that not everyone might be aware of, but whose usefulness should be intuitive for this benchmark of >200 very similar files.

eqvinox · 2025-02-01T23:11:41 1738451501

I'd argue that's actually the lowlight of the README since that is a very poor choice of benchmark. Combining a multitude of versions of the same software massively favors an algorithm good at dealing with this kind of repetitiveness in a way that will not be seen in typical applications.

The "Corpus benchmarks" further down in the README are IMHO much more practically relevant. The compression ratio of bzip3 is not significantly better, but the runtime seems quite a bit lower than lzma at least.

miohtama · 2025-02-01T17:28:17 1738430897

In Linux source benchmark results are interestingly more equal, LZMA still holding up well.

What makes Perl source benchmark special? Deduplication?

mmastrac · 2025-02-01T17:30:14 1738431014

An old friend use to say that Perl is line noise that was given sentience.

ape4 · 2025-02-01T17:43:48 1738431828

This is the source - which is probably C.

bhaney · 2025-02-01T18:13:04 1738433584

It's 246 C files and 3163 perl files

loeg · 2025-02-01T18:07:50 1738433270

Why -T12 for zstd and T16 for xz? How many threads is bzip3 using?

zimpenfish · 2025-02-01T19:07:44 1738436864

From the source, it looks like bzip3 defaults to 1 thread if not explicitly set by arguments.

JoshTriplett · 2025-02-01T17:46:57 1738432017

...using zstd level 16, when zstd goes up to 22. And without turning on zstd's long-range mode.