I see its dictionary is optimized for web content. Seems a bit like cheating ;) ...

return0 · on March 22, 2016

How is it cheating if it was developed specifically for that?

Lagged2Death · on March 22, 2016

"Cheating" might be a little strong, but "not actually comparable to general purpose tools like Gzip" seems justifiable.

kami8845 · on March 22, 2016

If we are talking about web content then how exactly is it not comparable?

NelsonMinar · on March 22, 2016

What's wrong with cheating?

kg · on March 22, 2016

It leads people to believe this will be a compression improvement for all their content, but after they do the work to integrate it they may discover that it's only a slight improvement because their content isn't identical to the content its static dictionary was designed for.

It's at least a very good codec, though, so it's still a win for other data. Just smaller than you might expect.

xnyhps · on March 22, 2016

Also note that the dictionary is strongly biased towards English. Sure, there is some Russian, Chinese, Arabic (and probably some other scripts in there which I don't recognize), but there seems to be more English words in there than all those others combined. If you're compressing small documents in any other language than English, it might not be worth it to use Brotli.

Edit: They wrote this about it in http://www.gstatic.com/b/brotlidocs/brotli-2015-09-22.pdf :

> Unlike other algorithms compared here, brotli includes a static dictionary. It contains 13’504 words or syllables of English, Spanish, Chinese, Hindi, Russian and Arabic, as well as common phrases used in machine readable languages, particularly HTML and JavaScript. The total size of the static dictionary is 122’784 bytes. The static dictionary is extended by a mechanism of transforms that slightly change the words in the dictionary. A total of 1’633’984 sequences, although not all of them unique, can be constructed by using the 121 transforms. To reduce the amount of bias the static dictionary gives to the results, we used a multilingual web corpus of 93 different languages where only 122 of the 1285 documents (9.5 %) are in languages supported by our static dictionary.

xnyhps · on March 22, 2016

Here's the list of words, by the way. I couldn't find it anywhere in non-hexadecimal form:

https://gist.github.com/xnyhps/677f7c1b444f346bef99

(I cleaned it up a bit to remove newlines and tabs, and a couple that are entirely of unprintable characters.)

prirun · on March 23, 2016

It isn't a static dictionary; it just has a (large) initial value, unlike gzip.

Brotli is a great compressor, especially at levels 2-5. Unfortunately, the Google paper on Brotli runs tests at levels 1 and 11. I don't get that at all when their stated goal was to replace gzip.

e12e · on March 22, 2016

I'm not sure there's anything wrong with it. But I'd be interested to know how it compares across different languages, like Norwegian and Japanese. I'm not saying it's the case, but there's a bit of a difference between "better than gzip for web content" and "better than gzip for web content in English". It'd be nice to see a test across something like the various international Project Gutenberg collections.

ape4 · on March 22, 2016

What if a new keyword is added to javascript / html.

jzymbaluk · on March 22, 2016

Then I imagine that keyword would be added to the static dictionary.

Dylan16807 · on March 22, 2016

Then you lose 6 bytes off of optimal.

donatj · on March 22, 2016

Huh, yeah that does feel like cheating. Even having some sort of sheared dictionary feels better than a hard-coded one. I wonder how well it performs on things that aren't text based?

efuquen · on March 22, 2016

Since when has being designed and optimized for a specific use case ever been "cheating" in software development?

IncRnd · on March 22, 2016

Static dictionaries are used frequently in data communications to "break the bounds" of Shannon's theorem.

See static Huffman coding in fax machines.

donatj · on March 22, 2016

Well unlike fax machines though are protocol shifts over time. New tags and conventions are invented. A known shared dictionary seems a better method than a hard coded static dictionary.