Hacker News new | past | comments | ask | show | jobs | submit login

Interesting. If I had to do this, I would have done something like

    perl -lne 'next unless $_; $z = qx(echo "$_" | gzip | wc -c); printf "%5.2f    %s\n", $z/length($_), $_'
on the principle that high entropy means it compresses badly. However, that uses each line as the dictionary, rather than the entire file, so it has a little trouble with very short lines which compress badly.

It did react to this line

    return map { $_ > 1 ? 1 : ($_ < 0 ? 0 : $_) } @vs;
which is valid code but indeed seems kind of high in entropy. I was also able to fool it to not detect a high-entropy line by adding a comment of natural English to it.

I'm on the go but it would be interesting to see comparisons between the Perl command and this tool. The benefit of the Perl command is that it would run out of the box on any non-Windows machine so it might not need to be as powerful to gain adoption.




I learned Go many years ago doing some advent of code problems. As I solved each problem, my housemate pestered me for a look and then rewrote my solutions (each needing 10-50 lines of go) into Ruby one-liners. All the while making fun of Go and my silly programs. I wasn’t intending to, but I ended up learning a lot Ruby that night too.

Thankyou for continuing the tradition.


I guess you could take all lines in the file except the one you're testing and measure the filesize, then add the line and measure again. The delta should then be more fair. You could even do this by concatenating all code files and then testing line by line across the entire repo, but that would probably be too slow.


I would use a better compressor than gzip but I have done this trick several times.

xz or zstd may be better choices, or you can look at Hutter Prize [1] winners for best compression and therefore best entropy estimate.

[1] http://prize.hutter1.net/


> best compression and therefore best entropy estimate

That's a good point. But the Hutter Prize is for compressing a 1 GB file. On inputs as short as a line of code, gzip doesn't do so badly. For a longer line:

  $ INPUT='    bool isRegPair() const { return kind() == RegisterPair || kind() == LateRegisterPair || kind() == SomeLateRegisterPair; }'
  $ echo "$INPUT" | gzip | wc -c
  95
  $ echo "$INPUT" | bzip2 | wc -c
  118
  $ echo "$INPUT" | xz -F xz | wc -c
  140
  $ echo "$INPUT" | xz -F lzma | wc -c
  97
  $ echo "$INPUT" | zstd | wc -c
  92
For a shorter line:

  $ INPUT='        ASSERT(regHi().isGPR());'
  $ echo "$INPUT" | gzip | wc -c
  48
  $ echo "$INPUT" | bzip2 | wc -c
  73
  $ echo "$INPUT" | xz -F xz | wc -c
  92
  $ echo "$INPUT" | xz -F lzma | wc -c
  51
  $ echo "$INPUT" | zstd | wc -c
  46


Are there any command-line tools for zip or similar that allow you to predefine a dictionary over one or more files, and then use that dictionary to compress small files?

Which would require the dictionary as a separate input when decompressing, of course?


zstd supports shared dictionaries easily.

It also has vastly superior compression and performance, compared to gzip, even without.


gzip (or really DEFLATE) does actually come with a small predefined dictionary (the "fixed Huffman codes" in the RFC) which is somewhat optimised for latin letters in UTF-8, but I have not verified that this is indeed what ends up being used when compressing individual lines of source code.


I use Yelp's secret scanner as a pre-commit hook, they have pretty easy setup be pre-commit's install mechanism




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: