on the principle that high entropy means it compresses badly. However, that uses each line as the dictionary, rather than the entire file, so it has a little trouble with very short lines which compress badly.
which is valid code but indeed seems kind of high in entropy. I was also able to fool it to not detect a high-entropy line by adding a comment of natural English to it.
I'm on the go but it would be interesting to see comparisons between the Perl command and this tool. The benefit of the Perl command is that it would run out of the box on any non-Windows machine so it might not need to be as powerful to gain adoption.
I learned Go many years ago doing some advent of code problems. As I solved each problem, my housemate pestered me for a look and then rewrote my solutions (each needing 10-50 lines of go) into Ruby one-liners. All the while making fun of Go and my silly programs. I wasn’t intending to, but I ended up learning a lot Ruby that night too.
I guess you could take all lines in the file except the one you're testing and measure the filesize, then add the line and measure again. The delta should then be more fair. You could even do this by concatenating all code files and then testing line by line across the entire repo, but that would probably be too slow.
> best compression and therefore best entropy estimate
That's a good point. But the Hutter Prize is for compressing a 1 GB file. On inputs as short as a line of code, gzip doesn't do so badly. For a longer line:
Are there any command-line tools for zip or similar that allow you to predefine a dictionary over one or more files, and then use that dictionary to compress small files?
Which would require the dictionary as a separate input when decompressing, of course?
gzip (or really DEFLATE) does actually come with a small predefined dictionary (the "fixed Huffman codes" in the RFC) which is somewhat optimised for latin letters in UTF-8, but I have not verified that this is indeed what ends up being used when compressing individual lines of source code.
It did react to this line
which is valid code but indeed seems kind of high in entropy. I was also able to fool it to not detect a high-entropy line by adding a comment of natural English to it.I'm on the go but it would be interesting to see comparisons between the Perl command and this tool. The benefit of the Perl command is that it would run out of the box on any non-Windows machine so it might not need to be as powerful to gain adoption.