Interesting. If I had to do this, I would have done something like perl -lne 'ne...

josephg · 2024-06-05T13:21:08 1717593668

I learned Go many years ago doing some advent of code problems. As I solved each problem, my housemate pestered me for a look and then rewrote my solutions (each needing 10-50 lines of go) into Ruby one-liners. All the while making fun of Go and my silly programs. I wasn’t intending to, but I ended up learning a lot Ruby that night too.

Thankyou for continuing the tradition.

blixt · 2024-06-05T10:33:47 1717583627

I guess you could take all lines in the file except the one you're testing and measure the filesize, then add the line and measure again. The delta should then be more fair. You could even do this by concatenating all code files and then testing line by line across the entire repo, but that would probably be too slow.

GuB-42 · 2024-06-05T10:41:22 1717584082

I would use a better compressor than gzip but I have done this trick several times.

xz or zstd may be better choices, or you can look at Hutter Prize [1] winners for best compression and therefore best entropy estimate.

[1] http://prize.hutter1.net/

nequo · 2024-06-05T16:23:31 1717604611

> best compression and therefore best entropy estimate

That's a good point. But the Hutter Prize is for compressing a 1 GB file. On inputs as short as a line of code, gzip doesn't do so badly. For a longer line:

  $ INPUT='    bool isRegPair() const { return kind() == RegisterPair || kind() == LateRegisterPair || kind() == SomeLateRegisterPair; }'
  $ echo "$INPUT" | gzip | wc -c
  95
  $ echo "$INPUT" | bzip2 | wc -c
  118
  $ echo "$INPUT" | xz -F xz | wc -c
  140
  $ echo "$INPUT" | xz -F lzma | wc -c
  97
  $ echo "$INPUT" | zstd | wc -c
  92

For a shorter line:

  $ INPUT='        ASSERT(regHi().isGPR());'
  $ echo "$INPUT" | gzip | wc -c
  48
  $ echo "$INPUT" | bzip2 | wc -c
  73
  $ echo "$INPUT" | xz -F xz | wc -c
  92
  $ echo "$INPUT" | xz -F lzma | wc -c
  51
  $ echo "$INPUT" | zstd | wc -c
  46

crazygringo · 2024-06-05T16:44:19 1717605859

Are there any command-line tools for zip or similar that allow you to predefine a dictionary over one or more files, and then use that dictionary to compress small files?

Which would require the dictionary as a separate input when decompressing, of course?

Too · 2024-06-06T05:34:33 1717652073

zstd supports shared dictionaries easily.

It also has vastly superior compression and performance, compared to gzip, even without.

kqr · 2024-06-05T18:26:08 1717611968

gzip (or really DEFLATE) does actually come with a small predefined dictionary (the "fixed Huffman codes" in the RFC) which is somewhat optimised for latin letters in UTF-8, but I have not verified that this is indeed what ends up being used when compressing individual lines of source code.

eternityforest · 2024-06-07T04:07:09 1717733229

I use Yelp's secret scanner as a pre-commit hook, they have pretty easy setup be pre-commit's install mechanism