It has to cater for any possible input. Even with special case handling for this...

vitus · 2025-04-30T01:42:08 1745977328

The reason why the discussion in this thread centers around gzip (and brotli / zstd) is because those are standard compression schemes that HTTP clients will generally support (RFCs 1952, 7932, and 8478).

As far as I can tell, the biggest amplification you can get out of zstd is 32768 times: per the standard, the maximum decompressed block size is 128KiB, and the smallest compressed block is a 3-byte header followed by a 1-byte block (e.g. run-length-encoded). Indeed, compressing a 1GiB file of zeroes yields 32.9KiB of output, which is quite close to that theoretical maximum.

Brotli promises to allow for blocks that decompress up to 16 MiB, so that actually can exceed the compression ratios that bzip2 gives you on that particular input. Compressing that same 1 GiB file with `brotli -9` gives an 809-byte output. If I instead opt for a 16 GiB file (dd if=/dev/zero of=/dev/stdout bs=4M count=4096 | brotli -9 -o zeroes.br), the corresponding output is 12929 bytes, for a compression ratio of about 1.3 million; theoretically this should be able to scale another 2x, but whether that actually plays out in practice is a different matter.

(The best compression for brotli should be available at -q 11, which is the default, but it's substantially slower to compress compared to `brotli -9`. I haven't worked out exactly what the theoretical compression ratio upper bound is for brotli, but it's somewhere between 1.3 and 2.8 million.)

Also note that zstd provides very good compression ratios for its speed, so in practice most use cases benefit from using zstd.

tom_ · 2025-04-30T01:52:05 1745977925

That's a good point, thanks - I was thinking of this from the point of view of the client downloading a file and then trying to examine it, but of course you'd be much better off fucking up their shit at an earlier stage in the pipeline.