I see its dictionary is optimized for web content.
Seems a bit like cheating ;)
https://en.wikipedia.org/wiki/Brotli
Unlike most general purpose compression algorithms, Brotli uses a pre-defined 120 kilobyte dictionary. The dictionary contains over 13000 common words, phrases and other substrings derived from a large corpus of text and HTML documents.[6][7] A pre-defined algorithm can give a compression density boost for short data files.
It leads people to believe this will be a compression improvement for all their content, but after they do the work to integrate it they may discover that it's only a slight improvement because their content isn't identical to the content its static dictionary was designed for.
It's at least a very good codec, though, so it's still a win for other data. Just smaller than you might expect.
Also note that the dictionary is strongly biased towards English. Sure, there is some Russian, Chinese, Arabic (and probably some other scripts in there which I don't recognize), but there seems to be more English words in there than all those others combined. If you're compressing small documents in any other language than English, it might not be worth it to use Brotli.
>
Unlike other algorithms compared here, brotli includes a static dictionary. It contains 13’504 words or syllables of English, Spanish, Chinese, Hindi, Russian and Arabic, as well as common phrases used in machine readable languages, particularly HTML and JavaScript. The total size of the static dictionary is 122’784 bytes. The static dictionary is extended by a mechanism of transforms that slightly change the words in the dictionary. A total of 1’633’984 sequences, although not all of them unique, can be constructed by using the 121 transforms. To reduce the amount of bias the static dictionary gives to the results, we used a multilingual web corpus of 93 different languages where only 122 of the 1285 documents (9.5 %) are in languages supported by our static dictionary.
It isn't a static dictionary; it just has a (large) initial value, unlike gzip.
Brotli is a great compressor, especially at levels 2-5. Unfortunately, the Google paper on Brotli runs tests at levels 1 and 11. I don't get that at all when their stated goal was to replace gzip.
I'm not sure there's anything wrong with it. But I'd be interested to know how it compares across different languages, like Norwegian and Japanese. I'm not saying it's the case, but there's a bit of a difference between "better than gzip for web content" and "better than gzip for web content in English". It'd be nice to see a test across something like the various international Project Gutenberg collections.
Huh, yeah that does feel like cheating. Even having some sort of sheared dictionary feels better than a hard-coded one. I wonder how well it performs on things that aren't text based?
Well unlike fax machines though are protocol shifts over time. New tags and conventions are invented. A known shared dictionary seems a better method than a hard coded static dictionary.