When comparing compression ratios in serious publications, it is customary to include the size of the decompression utility and any dictionaries required for decompression in one's numbers, especially where these are of comparable size to the corpus used for testing or larger. It is otherwise trivial to come up with compression schemes that show amazing performance on some corpus, simply by smuggling arbitrarily large parts of the corpus inside the decompression utility.
In this case, the entire, exact large language model dataset used to compress the data is required for decompression. Since this is not accounted for, the paper is meaningless.
Rapidly becomes similar to the Pi filesystem in that hunting the big dictionary for the best match for your sequence begins to eat any gains (especially with delta coding, which is needed to make the ratios realize potential).
And the model has to be fixed: all your encoded text is against this version of this model; so it can never update or you have to re-encode when you do.
> We provide new estimates of an asymptotic upper bound on the entropy of English using the large language model LLaMA-7B as a predictor for the next token given a window of past tokens.
> Preliminary results from limited experiments suggest that our scheme outperforms state-of-the-art text compression schemes such as BSC, ZPAQ, and paq8h.
Curious if this sort of trained compression could radically reduce the size of log files / analytics storage (or if there exist simple methods for that which perform well already?)
In this case, the entire, exact large language model dataset used to compress the data is required for decompression. Since this is not accounted for, the paper is meaningless.
Note that there are prizes available for genuine AI based improvements to English text compression, e.g. https://en.wikipedia.org/wiki/Hutter_Prize