Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I see its dictionary is optimized for web content. Seems a bit like cheating ;) https://en.wikipedia.org/wiki/Brotli Unlike most general purpose compression algorithms, Brotli uses a pre-defined 120 kilobyte dictionary. The dictionary contains over 13000 common words, phrases and other substrings derived from a large corpus of text and HTML documents.[6][7] A pre-defined algorithm can give a compression density boost for short data files.


How is it cheating if it was developed specifically for that?


"Cheating" might be a little strong, but "not actually comparable to general purpose tools like Gzip" seems justifiable.


If we are talking about web content then how exactly is it not comparable?


What's wrong with cheating?


It leads people to believe this will be a compression improvement for all their content, but after they do the work to integrate it they may discover that it's only a slight improvement because their content isn't identical to the content its static dictionary was designed for.

It's at least a very good codec, though, so it's still a win for other data. Just smaller than you might expect.


Also note that the dictionary is strongly biased towards English. Sure, there is some Russian, Chinese, Arabic (and probably some other scripts in there which I don't recognize), but there seems to be more English words in there than all those others combined. If you're compressing small documents in any other language than English, it might not be worth it to use Brotli.

Edit: They wrote this about it in http://www.gstatic.com/b/brotlidocs/brotli-2015-09-22.pdf :

> Unlike other algorithms compared here, brotli includes a static dictionary. It contains 13’504 words or syllables of English, Spanish, Chinese, Hindi, Russian and Arabic, as well as common phrases used in machine readable languages, particularly HTML and JavaScript. The total size of the static dictionary is 122’784 bytes. The static dictionary is extended by a mechanism of transforms that slightly change the words in the dictionary. A total of 1’633’984 sequences, although not all of them unique, can be constructed by using the 121 transforms. To reduce the amount of bias the static dictionary gives to the results, we used a multilingual web corpus of 93 different languages where only 122 of the 1285 documents (9.5 %) are in languages supported by our static dictionary.


Here's the list of words, by the way. I couldn't find it anywhere in non-hexadecimal form:

https://gist.github.com/xnyhps/677f7c1b444f346bef99

(I cleaned it up a bit to remove newlines and tabs, and a couple that are entirely of unprintable characters.)


It isn't a static dictionary; it just has a (large) initial value, unlike gzip.

Brotli is a great compressor, especially at levels 2-5. Unfortunately, the Google paper on Brotli runs tests at levels 1 and 11. I don't get that at all when their stated goal was to replace gzip.


I'm not sure there's anything wrong with it. But I'd be interested to know how it compares across different languages, like Norwegian and Japanese. I'm not saying it's the case, but there's a bit of a difference between "better than gzip for web content" and "better than gzip for web content in English". It'd be nice to see a test across something like the various international Project Gutenberg collections.


What if a new keyword is added to javascript / html.


Then I imagine that keyword would be added to the static dictionary.


Then you lose 6 bytes off of optimal.


Huh, yeah that does feel like cheating. Even having some sort of sheared dictionary feels better than a hard-coded one. I wonder how well it performs on things that aren't text based?


Since when has being designed and optimized for a specific use case ever been "cheating" in software development?


Static dictionaries are used frequently in data communications to "break the bounds" of Shannon's theorem.

See static Huffman coding in fax machines.


Well unlike fax machines though are protocol shifts over time. New tags and conventions are invented. A known shared dictionary seems a better method than a hard coded static dictionary.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: