Hacker News new | past | comments | ask | show | jobs | submit login

I mean, sure this is compression in the sense that I can send you a tiny "compressed text" and all you need is this multi-terabyte model to decompress it!



But we don’t usually count the decompression program size when evaluating compression ratio. Eg 7-Zip is about 1 MB, but you don’t think about that when evaluating particular 7z files.


We do when it's the Hutter prize, otherwise it's easy to cheat.

But sure, it's a constant factor, so if you compress enough data you can always ignore it.


We would if it’s a multi-gigabyte program the receiver doesn’t have installed.


Maybe not multi-gigabyte, but in a new system/phone in a year, you're basically guaranteed to find at least one tiny model. We may even get some "standard" model everyone can reliably use as a reference.


At that point it would be useful. Although, I wonder if it wouldn’t make more sense to train one specifically for the job. Current LLMs can predict HTML, sure, but they’re large and slow for the task.


Yeah sometimes compression ratio is not the right question when there are other practical concerns like disk space or user experience.

But I do want to point out that almost everyone installs at least one multigigabyte file to decompress other files, and that is the OS.


If the only thing the OS could do is decompress files, then we'd be rightly upset at that for being multi-gigabyte as well. :)


There's an existing idea called shared dictionary compression. Everybody pre-agrees on some statistical priors about the data, and you use them to improve the compression ratio.

This is just the gigascale version of that.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: