There is a way to do this same compression by utilizing the raw probability distributions that the LLM produces as output. Fabrice Bellard has some experiments with doing this with transformers: https://bellard.org/nncp/
The idea is that if you can produce an accurate probably distribution over the next bit/byte/token, then you can compress things with an entropy compressor (huffman encoding, range encoding, asymmetric numeral systems, etc). This comment is too small of a space to explain fully how they work, but it suffices to say that pretty much every good compression algorithm models probability distributions in some way.
Exactly this. The way to do this is to use the LLM as the statistical model inside an arithmetic coder. You use its next-token probabilities to encode the next token. The KL divergence between the LLM's probabilities, and the empirical probabilities in the corpus that you actually compress, is the compression inefficiency.
> The idea is that if you can produce an accurate probably distribution over the next bit/byte/token...
But how can you get credible probability distributions from the LLMs? My understanding is that the outputs specifically can't be interpreted as a probability distribution, even though superficially they resemble a PMF, due to the way the softmax function tends to predict close to 100% for the predicted token. You can still get an ordered list of most probable tokens (which I think beam search exploits), but they specifically aren't good representations of the output probability distribution since they don't model the variance well.
My understanding is that minimizing perplexity (what LLMs are generally optimized for) is equivalent to finding a good probably distribution over the next token.
$ zstd --train trainpart.* -o dictionary
Save dictionary of size 112640 into file dictionary
$ zstd -vD dictionary text.txt
*** Zstandard CLI (64-bit) v1.5.5, by Yann Collet ***
text.txt : 15.41% ( 170 KiB => 26.2 KiB, text.txt.zst)
For this example, ZSTD warns that the dictionary training set is 10X-100X too small to be efficient. Realistically, I guess you'd train it over E.G. the entire Gutenberg library. Then you can distribute specific books to people who already have the dictionary.
> Interesting you're showing 15% - 16%, and the LLM technique showed 15%.*
Yeah. But I don't think it's hinting at any fundamental theoretical limit.
Both the LLM and my examples were trained on data including the full text of Alice in Wonderland, which we're "compressing". Probably many copies of it, for the LLM. In theory they should both be able to reach 0% (or very close).
So both the blog post and my examples are a bit silly— Like "losslessly compressing" an image by diffing it with a lossy JPEG, then claiming a higher compression ratio than PNG/JPXL because the compression program is a 1TB binary that bundles Sloot-style blurry copies of every known image.
In fact, by just adding `--maxdict=1MB` to my first example, `zstd -D` gets down to 13.5%. Probably lower with further tweaking. And adding an explicit `cat text.txt >> FullTextsSample.txt` brings `zstd --patch-from` down to… Uh. 0.02%. 40 bytes total. …And probably around a third of that is headers and checksum… So… Yeah. A bit silly.
I think a better comparison should probably:
- Have a clean separation between training data, and data to be compressed. Usually the compressed data should be similar to, but not included in, the training data.
- Use the same training data for both the LLM and conventional compressor.
- Include the dictionary/model size. And compare methods at the same dictionary/model size.
Also, as an aside, the method in the blog post could probably also get smaller by storing token probability ranks for most of its current explicit letters.
A problem with this is that the inference has to be 100% deterministic. For example of you change tailing order of matmul you may get slightly different results. It's easy to imagine people changing tailing in llama.cpp or other libraries. You are very unlikely to get bit exact results from different libraries.
Interestingly (though maybe not relevant here) you can also get different results from multi-user inference systems depending on what other requests are in the batch. It's possible to avoid this but I'm pretty sure most systems don't.
The "slightly different" bit of course makes it worse - it will work 99% of the time.
Probably could set the seed and temperature and store these. Run with a few variations and keeping the best would probably work even better, at the expense of more computation.
Not necessarily. This also uncovers the weakness of the NYT lawsuit.
Imagine in your corpus of training data is the following:
- bloga.com: "I read in the NYT that 'it rains cats and dogs twice per year'"
- blogb.com: "according to the NYT, 'cats and dogs level rain occurs 2 times per year."
- newssite.com: "cats and dogs rain events happen twice per year, according to the New York Times"
Now, you chat with an LLM trained on this data, asking it "how many times per year does it rain cats and dogs?"
"According to the New York Times, it rains cats and dogs twice per year."
NYT content was never in the training data, however it -is- mentioned a lot on various sources throughout commoncrawl-approved sources, therefore gets a higher probability association with next token.
Zoom that out to full articles quoted throughout the web, and you get false positives.
They were getting huge chunks, verbatim of NYT articles out. I remember being stunned. Then I remember finding out there was some sort of trick to it that made it seem sillier.
Does it matter? What's the legal view on "I downloaded some data which turns out to be copied from a copyrighted source and it was probably trivial to figure it out, then trained the LLM on it"? I mean, they work on data processing - of course they would expect that if someone responds with 10 paragraphs in reporting style, under a link to NYT... that's just the article.
I genuinely don’t know the answer but I can see it being more complicated than “OpenAI purposefully acquired and trained on NYT articles”.
If Stack Overflow collects a bunch of questions and comments and expose them as a big dataset licensed as Creative Commons but it actually contains a quite bit of copyrighted content, whose responsibility is it to validate copyright violations in that data? If I use something licensed as CC in good faith and it turns out the provider or seller of that content had no right to relicense it, am I culpable? Is this just a new lawsuit where I can seek damages for the lawsuit I just lost?
> I think Colour is what the designers of Monolith are trying to challenge, although I'm afraid I think their understanding of the issues is superficial on both the legal and computer-science sides. The idea of Monolith is that it will mathematically combine two files with the exclusive-or operation. You take a file to which someone claims copyright, mix it up with a public file, and then the result, which is mixed-up garbage supposedly containing no information, is supposedly free of copyright claims even though someone else can later undo the mixing operation and produce a copy of the copyright-encumbered file you started with. Oh, happy day! The lawyers will just have to all go away now, because we've demonstrated the absurdity of intellectual property!
> The fallacy of Monolith is that it's playing fast and loose with Colour, attempting to use legal rules one moment and math rules another moment as convenient. When you have a copyrighted file at the start, that file clearly has the "covered by copyright" Colour, and you're not cleared for it, Citizen. When it's scrambled by Monolith, the claim is that the resulting file has no Colour - how could it have the copyright Colour? It's just random bits! Then when it's descrambled, it still can't have the copyright Colour because it came from public inputs. The problem is that there are two conflicting sets of rules there. Under the lawyer's rules, Colour is not a mathematical function of the bits that you can determine by examining the bits. It matters where the bits came from.
I don't think that's what I was driving at. Monolith users in this scenario would be knowingly using copyrighted content with the clear intent to "de-copyright" it for distribution purposes by mixing it up into a new output via a reversible process. Which seems like it probably violates copyright because the intent is to distribute a copyrighted work even if the process makes programmatic detection difficult during distribution. This may operate within the wording of the law but it clearly is being done in bad faith to the spirit of the law (and this seems like standard file encryption of a copyrighted work where you are also publicly distributing the decryption key... and transmitting a copyrighted work over TLS today doesn't absolve anyone of liability). You seem to be suggesting this is what OpenAI has done via the transformer model training process - and acting in bad faith. Which is certainly possible but won't be proven unless their court case reveals it. I'm asking about the opposite: what if they acted in good faith?
What I'm getting at is that it's plausible that a LLM is trained purely on things that were available and licensed as Creative Commons but that the data within contains copyrighted content because someone who contributed to it lied about their ownership rights to provide that content under a Creative Commons license, i.e. StackOverflow user UnicornWitness24 is the perpetrator of the copyright violation by copying a NYT article into a reply to bypass a paywall for other users and has now poisoned a dataset. And I'm asking: What is the civil liability for copyright violations if the defendant was the one who was actually defrauded or deceived and was acting in good faith and within the bounds of the law at the time?
it is permissible to use limited portions of a work including quotes, for purposes such as commentary, criticism, news reporting, and scholarly reports.
But yes, open to interpretation as far as where LLM training falls.
LLMs blow away traditional compressors because they are very good predictors, and prediction and compression are the same operation.
You can convert any predictor into a lossless compressor by feeding the output probabilities into an entropy coding algorithm. LLMs can get compression ratios as high as 95% (0.4 bits per character) on english text.
Its only equivalent for a very narrow sense of 'prediction' , namely modelling conditional probability distributions over known data.
There's no sense, for example, in which deriving a prediction about the nature of reality from a novel scientific theory is 'compression'
eg., suppose we didn't know a planet existed, and we looked at orbital data. There's no sense in which compressing that data would indicate another planet existed.
It's a great source of confusion that people think AI/ML systems are 'predicting' novel distributions of observations (science), vs., novel observations of the same distribution (statistics).
It should be more obvious that the latter is just compression, since it's just taking a known distribution of data and replacing it with a derivative optimal value.
Science predicts novel distributions based on theories, ie., it says the world is other than we previously supposed.
It doesn’t matter how your predictor works, whether it’s a scientific theory, a statistical model, or a magic oracle. You can always perfectly convert its predictions into compression using entropy coding. The conversion process is black-box.
Sure it is! If we were trying to compress an archive of orbital data, one way to do it would be "initial positions + periodic error correction". If you have the new planet, your errors will be smaller and can be represented in less space at the same precision.
A statistical model of orbits, without a theory of gravity, is less compressed when you assume more objects. Take all the apparent positions of objects in the sky, {(object, x1, x2, t),...}. Find a statistical model of each point at t+1, so y = (o, x1, x2, t+1). There is no sense in which you're deriving a new object in the sky from this statistical model -- it is only a compression of observable orbits.
When you say, "if you have the new planet", you're changing the data generating process (theory) to produce a new distribution of points {(o' x1', x2', t'), ...} to include an unseen object. You're then comparing two data generating models (two theories) for their simplicity. You're not comparing the associative models.
Call the prior theory 8-planets, so 8P generates x1,x2,t; and the new theory 9P which generates x1',x2',t'
You're then making a conditional error distribution when comparing two rival theories. The 9P theory will minimize this error.
But in no sense can the 9P theory be derived from the initial associative statistical distribution. You are, based on theory (, science, knowledge, etc.) choosing to add a planet, vs. eg., correcting for measurement error, modifiying newton's laws, changing the angle of the earth wrt the solar system... or one of an infinite number of theories which all produce the same error minimization
The sense of "prediction" that science uses (via Popper et al.) is deriving the existence of novel phenomena that do not follow from prior observable distributions.
By the way, if you want to see how well gzip actually models language, take any gzipped file, flip a few bits, and unzip it. If it gives you a checksum error, ignore that. You might have to unzip in a streaming way so that it can't tell the checksum is wrong until it's already printed the wrong data that you want to see.
You can use a preshared dictionary with bzip2 and zstd so you can get that down alot by using different dictionaries depending on certain rules. I dont know if it helps with literature but I had great success in sending databases with free text like that. In the end it was easier to just use one dictionary for everything and just skip the rules.
I mean, sure this is compression in the sense that I can send you a tiny "compressed text" and all you need is this multi-terabyte model to decompress it!
But we don’t usually count the decompression program size when evaluating compression ratio. Eg 7-Zip is about 1 MB, but you don’t think about that when evaluating particular 7z files.
Maybe not multi-gigabyte, but in a new system/phone in a year, you're basically guaranteed to find at least one tiny model. We may even get some "standard" model everyone can reliably use as a reference.
At that point it would be useful. Although, I wonder if it wouldn’t make more sense to train one specifically for the job. Current LLMs can predict HTML, sure, but they’re large and slow for the task.
There's an existing idea called shared dictionary compression. Everybody pre-agrees on some statistical priors about the data, and you use them to improve the compression ratio.
Potentially hot take, but compression is what I think language is for.
It’s easier to transmit a story than the entire sensory experience of something happening, so it saves us time and attention. Being able to convert complex tasks and ideas into short stories has likely served us well since the brain burns a lot of calories.
I don't think this is a hot take at all, it's matches my understanding. One of the reasons language itself is so difficult (miscommunication, etc) is we have a mostly similar but not identical "compression table" of ideas that words map to, and why we spend so much time aligning on terms, to ensure correct "decompression".
We need compression because internally cognition has a very "wide" set of inputs/outputs (basically meshed), but we only have a serial audio link with relatively narrow bandwidths, so the whole thing that allows us to even discuss this is basically a neat evolutionary hack.
The model is always wrong, since it predicts a propability distribution over all possible tokens, but the target has 100% possibility for one token and 0 for all others.
The idea is that if you can produce an accurate probably distribution over the next bit/byte/token, then you can compress things with an entropy compressor (huffman encoding, range encoding, asymmetric numeral systems, etc). This comment is too small of a space to explain fully how they work, but it suffices to say that pretty much every good compression algorithm models probability distributions in some way.