Brotli is such an ugly hack (hardcoded dictionary with a snapshot of the world a...

duskwuff · on May 14, 2021

The contents of that hardcoded dictionary are really weird, too. It includes a lot of bizarrely specific entries like:

    in popular culture
    Holy Roman Emperor
    It is important to
    examples include the
    have speculated that
    aria-hidden="true">·<
    (new Date).getTime()}
    :url" content="http://

Many of the English phrases (like "in popular culture"!) are highly characteristic of content from the English Wikipedia. I've speculated before that this may be the result of the use of a compression benchmark which included (or consisted entirely of) content from that site, such as https://cs.fit.edu/~mmahoney/compression/textdata.html

If you're curious, here's a full dump of the dictionary:

https://gist.github.com/duskwuff/8a75e1b5e5a06d768336c8c7c37...

hifly · on May 15, 2021

however the idea seems sound? I can't help wondering why Google who can fund GP-3, couldn't come up with a better initial dictionary? especially for small responses it is a big win

nvllsvm · on May 14, 2021

It's also impossible to identify whether an arbitrary file is compressed with brotli. It lacks a magic number. https://github.com/google/brotli/issues/298

JyrkiAlakuijala · on May 15, 2021

Static dictionary is not why Brotli reaches excellent compression density. Much of it comes from the more advanced context modeling and in general more dynamic entropy modeling.

Brotli wins over Zstd on web compression even when you fill the static dictionary with zeros.

No Mountain View based engineers participated in the development of Brotli, it was built in Zurich. The static dictionary was distilled from a language corpus originally having 40 languages. To reduce the binary size while keeping effectiveness, I reduced it to six: English, Spanish, Chinese, Russian, Hindi, Arabic. What got into the dictionary and in which order it was (both words and transforms) were chosen by how much entropy they reduced in a relatively large test corpus including diverse material (for example over 110 natural languages).

Brotli was originally designed as a faster to decode replacement of LZMA for fonts to enable WOFF2. LZMA was too slow for that use. WOFF2 is just geometry, no natural languages there. There, W3C observed that Brotli performed similarly to LZMA in density but was significantly (~5x) faster to decode -- and enabled the WOFF2 launch.

Unlike SDCH, Brotli's performance does not degrade when the static dictionary ages. This is because Brotli only uses invalid LZ77 commands to codify static dictionary entries. If you don't use them, you are not paying for them. Just like Zstd, users can bring their own dictionaries when they prefer that -- a functionality that was in Brotli before it was introduced in Zstd.

Unlike ZStd, Brotli degrades less for exotic languages heavy on utf-8 (Korean, Vietnamese, Chinese, Japanese, etc.), and especially so for mixed data such as HTML-commands mixed with the above utf-8 sources.

Unlike LZMA, Brotli's context modeling works for very short data, too -- It becomes about 0.6 % worse on gigabyte corpora, but can be 15+ % better on shortest documents.

Unlike Zstd, Brotli offers the data in a streamable order, i.e., hides less data during the transfer. The user can decode much more out of a brotli stream than a respective fraction of zstd stream (shadowed amount is in tens of bits vs. tens of kilobytes in zstd). This is because brotli does not reshuffle the data within the blocks for less cpu-work at decoding time. Any shadowing of data will make other dependant resources loads start later or dependant processing (Javascript or HTML parsing) to happen later.

Most of Brotli's advances come from other algorithms than LZ77. The LZ77 part of ZStd and Brotli are essentially the same. The LZ77 algorithm is proven optimal when the data is infinitely long as well as the sliding window -- this means that the longer the data, the less difference you see. If you benchmark Brotli vs. Zstd with real life data (like HTML pages of 75 kB) you see a different performance behaviour than if you compare them using a 100 MB or 10 GB file. There, most of Brotli's benefits disappear.

In a gigabyte multi-corpus comparison, Brotli still compresses ~5 % more than Zstd. See: https://github.com/google/brotli/issues/642 -- aggregated here: https://encode.su/threads/2947-large-window-brotli-results-a...

Zstd's encoder has seen more work during the years and some features (like the long maching) is missing from Brotli. Reaching parity in the encoder could reduce the gap. An interesting option would be to have a single encoder for both.

Compiling is also difficult: One continual topic has been performance degradations when moving from GCC/CLANG to MSVC. Brotli was never optimized for MSVC and it seems something is going badly wrong there. Also, in the past several benchmarks were comparing a non-optimized build of Brotli vs a release build of Zstd. This was because until summer 2020 Brotli compiled without options produced a non-optimized build -- you'd need to configure release specifically to get it right.