That's extremely interesting. Would you happen to still have the code lying arou...

vardump · on Aug 1, 2019

I might still have it on some hard disk that's been unplugged in storage for ages. But probably long since lost. I wrote it by trying out different things and seeing how it affected gzipped size.

Just use some HTML parser and prune html comment nodes and empty elements when safe (for example removing even empty div is not!), collapse whitespace, etc. If majority of text nodes is in lower case, ensure also tags, attribute names etc. is as well. Ensure all attribute values are same way, say attr=5, but not attr='5' or attr="5". Etc. That's all there is to it.

It saved a lot already as a result of whitespace collapsing, which also removes high frequency chars like linefeeds, etc. leaving shorter huffman table entries for the data that actually matters.

Study how LZ77 and Huffman works.