Zstandard v1.3.4 – faster everything

koolba · on March 30, 2018

One of the coolest parts of zstd is the simple support for custom dictionaries. If you have a lot of mid sized blobs that you want to separately compress (so that you can separately decompress them), you can create a common dictionary that covers the entire corpus. In real world use cases the compression ratio can go from 3x to 9x: https://github.com/facebook/zstd#the-case-for-small-data-com...

What was the finality on the issue of patents with this library? Is there an active covenant from Facebook not to sue users?

hinkley · on March 30, 2018

This is a little known feature borrowed from zlib, and possibly predates even that.

I ran into this many moons ago while working on a large J2ME app, trying to shrink the zip file. We hit the point where looking a lot the compression algorithm was our next most reasonable step, and since I had prior zlib experience decoding PNGs, I dug into it.

I discovered that Sun had partly exposed the dictionary support and I ended up filing an enhancement request and some numbers for my POC. But as it turns out Sun had already begun work on their dense archive format which achieved a multiple of the improvements I was getting, so it went nowhere.

Alphabetizing the constant pool got us a fraction of the benefit, and I discovered that because of the way the constants are stored, suffix sorting them got us another 1.1% compression so I dropped it.

terrelln · on March 30, 2018

The library is dual licensed under plain BSD [1] and GPLv2 [2].

[1] https://github.com/facebook/zstd/blob/dev/LICENSE [2] https://github.com/facebook/zstd/blob/dev/COPYING

koolba · on March 30, 2018

Which only makes things more confusing. Is it dual licensed, both apply? Or dual license, your choice? That only complicates the patent question further.

terrelln · on March 30, 2018

The GPLv2 license was added for inclusion in the kernel, you may choose either GPLv2 or BSD. See the header file https://github.com/facebook/zstd/blob/dev/lib/zstd.h.

koolba · on March 30, 2018

Sweet. Would be nice if that was in the README too as it only mentions the BSD license, but the top level includes the GPL as well.

There's still the open patent question but this (i.e. BSD license) makes it easier to decide to use it.

terrelln · on March 30, 2018

It is mentioned at the bottom of the README, but it isn't very findable. I've opened a PR https://github.com/facebook/zstd/pull/1085.

exikyut · on March 31, 2018

Which got accepted (an hour after you submitted it) - nice!

lilyball · on March 30, 2018

Dual-licensed means buyer's choice. As the client of the library, you can use it under the terms of either license.

cbuq · on March 30, 2018

The whole patent issue was the BSD + Patents license Facebook used (they addressed the concerns over React by relicensing to MIT and dropping the PATENTS file).

The PATENTS file appears to be present in zstd 1.3.0 but not 1.3.1, so it looks like similar actions have been carried over here?

I'm left with questions though, because the development branch has both a LICENSE file (BSD) and a COPYING file (GPL).

ntoshev · on March 31, 2018

I'd like to be able to load log files in a database and have it take the same space as compressed log file size, not the uncompressed size (which can easily be 20x more). I guess this requires database built on compact data structures or something similar, but surely simple support for dictionary compression for database text fields would help? Text fields are small chunks of text of the same type, so they should be amenable to good compression with the same dictionary. I've searched around and found some support in RocksDB, but that's pretty much it.

Most compression libraries offer dictionary support, although it's somewhat obscure. For example, there is no method in zlib to actually create your dictionary. Bindings and higher level libraries often ignore the dictionary support.

portmanteaufu · on March 30, 2018

From the project homepage[1]:

"Zstandard, or zstd as short version, is a fast lossless compression algorithm, targeting real-time compression scenarios at zlib-level and better compression ratios. It's backed by a very fast entropy stage, provided by Huff0 and FSE library."

1: https://github.com/facebook/zstd

stochastic_monk · on March 31, 2018

A very important aspect of this project is the zlibWrapper. It provides a interface for transparently reading from zstd-compressed, zlib-compressed, or uncompressed files with one API. It's quite fast, provides an excellent balance between speed and compression ratio, and is generally my preferred way to work with compressed data.

Their emphasis on branchless coding is even more valuable now than it was when zstd was released, since we are living in a post-spectre/meltdown world.

I dislike Facebook as a company, but I cannot find fault with their engineering.

[Edit/Off-topic: I want someone to make an LZ77/DEFLATE/Led Zeppelin pun-based product sometime. Please. That joke has been begging to be made since at least 1990.]

tmd83 · on March 31, 2018

In this case though the core innovation actually happened pre-facebook. Post facebook it got more polish/tweaking/multi-threading etc. but I don't recall any breakthrough kind of change.

segmondy · on March 30, 2018

I'll still pick lz4 based on the benchmark.

loeg · on March 30, 2018

Which algorithm is best is use-case dependent. As of now, zstd offers best-in-class compression for a wider variety of use cases than lz4. lz4 (created by the same author) still wins for high-throughput software compression, yes. But zstd --fast 4 or 5 are getting pretty close.

0xcde4c3db · on March 31, 2018

It's not obvious to me what the relevant measurements are on the zstd side, but I'm pretty sure lz4 wins considerably where code size and RAM footprint are major considerations, as in some bootloader and embedded firmware situations.

karmakaze · on March 30, 2018

The provided benchmark table did illustrate when to choose which:

  Compressor name       Ratio  Compression  Decompress.
  zstd 1.3.4 -1         2.877    470 MB/s    1380 MB/s
  zlib 1.2.11 -1        2.743    110 MB/s     400 MB/s
  brotli 1.0.2 -0       2.701    410 MB/s     430 MB/s
  quicklz 1.5.0 -1      2.238    550 MB/s     710 MB/s
  lzo1x 2.09 -1         2.108    650 MB/s     830 MB/s
  lz4 1.8.1             2.101    750 MB/s    3700 MB/s
  snappy 1.1.4          2.091    530 MB/s    1800 MB/s
  lzf 3.6 -1            2.077    400 MB/s     860 MB/s

petre · on April 1, 2018

Zstd is better suited as a general purpose algorithm. It's like zlib only with better performance and a wider application range.

devinus · on March 30, 2018

These are rather large release notes for a patch release.

realPubkey · on March 30, 2018

I wonder if it's possible to create a blockchain where the proof-of-work consists of building a better compression-dictionary instead of doing useless hashing.

s17n · on March 30, 2018

h4l0 · on March 30, 2018

To give a little detail on why this is not possible, proof-of-whatever should be quickly verifiable while hard to generate. Compression might sound like it can achieve both requirements. However, raw data needs to be very large so that the problem is not trivial. Large data is a red flag in any blockchain application.

throwaway84742 · on March 30, 2018

I wish he’d stop tweaking it at this point. Newer versions are incompatible with older versions, which is a cardinal sin if you want mass adoption for something like this.

Or at least identify a stable subset of some sort and put it into another tool that I would not be hesitant to use.

So when I need something very fast I use Snappy or lz4, and when I need decent compression ratio I use pigz.

terrelln · on March 30, 2018

Newer versions are forwards and backwards compatible with older versions. The compression format stabilized starting with version 0.8.1.

throwaway84742 · on March 30, 2018

Newer versions are incompatible with what’s in Ubuntu 16.04.

cookiecaper · on March 30, 2018

Looks like xenial has 0.5, so that's consistent with the grandparent's claim. xenial-updates has 1.3.1, however. Seems like the best option for Ubuntu users would be to pull this from the xenial-updates repo. Also note that Xenial is about to be replaced by 18.04.

Also, be careful with universe packages. They are fast and loose with them. The version of redis commonly deployed on 14.04 has a known security issue that bit us some time ago (CVE 2015-4335). See https://bugs.launchpad.net/ubuntu/trusty/+source/redis/+bug/... .

terrelln · on March 30, 2018

Zstandard version 0.5 (and all other versions < 0.8.1) were pre-releases where forward compatibility was never intended. However, Zstandard is still backward compatible with versions down to version 0.4. All releases since August 2016 have been forward and backward compatible.

It is unfortunate that Xenial picked up version 0.5, but Yann Collet worked hard to get version 1.3.1 backported for this exact reason https://bugs.launchpad.net/ubuntu/+source/libzstd/+bug/17170....

loeg · on March 30, 2018

16.04, as in 2016-04, as in Ubuntu from two years ago -- if you refuse to patch or install updates -- yes.

tomjakubowski · on March 30, 2018

… 16.04 as in the most recent LTS version of Ubuntu.

antientropic · on March 30, 2018

Incompatible in what way? API? File format?

terrelln · on March 30, 2018

Zstandard also maintain ABI stability for a portion of the API, and require a macro definition to access the unstable parts.

codetrotter · on March 30, 2018

I’d presume file format