Learning and compressing is closely related. When compressing, you extract systematic rules and unsystematic parameters from the original data. When you learn you do the same but unsystematic parameter become noise. If you throw the noise away you get a lossy compressor.
Depending on how you define “learning” (there are different basic frameworks like PAC learning) there’s a number of analyses that show learning and compression are equivalent.
Even quite broadly, Bayesian methods can be interpreted as a rate-distortion problem from Information Theory, which is an approach to lossy compression
There are countless papers showing amazing results on MNIST and CIFAR-10. No non experts should be reading any paper like this.
Just like we cure cancer in mice all the time we solve these datasets all the time. And it means nothing. Wait to see if the method is actually useful.
I’m surprised this is downvoted. I liked this paper (I’m mildly biased since I’ve been doing generative modeling work and am very excited about flows) but was disappointed to see MNIST and CIFAR-10.
No non experts should be reading any paper like this. I agree non-experts have no need to read this paper, but not because of what what datasets it used. If you can read and understand this level of technical paper, I’d say you are an expert by definition.
It may mean nothing "within" MNIST and CIFAR-10 - but if you want to show how your new technology works, why not using the usual set? Is it not like "Hello World" - not innovation, just demonstration?
I suppose you mean, "Yes you can do "hello world" but now show us something that makes your product worthwhile, a convincing use-case, a record worth noting".
20 years ago that was true. MNIST and CIFAR-10 were a vehicle to show that your method had promise.
Today, any bad idea you throw at a wall will work on MNIST and CIFAR-10. Even buggy broken code will generally work. We've advanced to the point where these datasets are totally meaningless.
In 1910 if you could show that your airplane idea could fly 100ft, you had something promising. In 2010 we're so used to building airplanes and understand the physics involved to the point where this isn't a meaningful test of the promise of your new aircraft concept.
It's worse than "in mice" in biology/cancer papers. We cure everything in mice. Basically nothing ever transfers to humans. Same with MNIST/CIFAR-10. Everything works, basically nothing matters.
Or another way to put it. A "Hello World"-level compiler would be interesting to report on in 1960. That same compiler would be trivial today and anyone could build it in minutes/hours.
I disagree with this. Binarized MNIST samples of any reasonable quality are (still) tricky to get right without a hierarchical system (read: VQ-VAE tokens or some such encoder space). Same with really solid CIFAR-10. "Scaling down" is a different problem than scaling up, not everything transfers but saying "everything works on
MNIST / CIFAR-10" in generative modeling is a bit glib.
Would much prefer to see early work with solid small scale results on arXiV, than have people hold concepts for another 6 months scaling up. Let that be for a v2, if you cannot put early but concrete results on arXiV where else is there?
Recalling that a lot of nice papers are mostly MNIST / CIFAR-10 level results at first, followed by scale (thinking of VQ-VAE, PixelCNN / RNN, PerceiverAR, many others that worked well at scale later). That doesn't mean every result will scale up, but we have a lot of tricks to scale "small-scale" models using pretrained latent spaces and so on. The first diffusion results were also pretty small scale... different time but I don't think things are so different today.
That said, I can agree that you need to be a bit in the weeds on the research side to be diving deep on this - but I expect lots of followup clarifications or blog posts on this type of work.
What since Adam? Learning rate scales / schedules? I cannot think of many big massive changes since ~2014, most of the setups from that era (grad clip + medium-ish LR, some ramp up or roll-off at the end) work fine today for me.
(Note: There are many, many great optimization papers since 2014 - I just don't see them show up in general recipes in open source too often)
You find more Twitter threads on the paper.
And here some more on Reddit: https://www.reddit.com/r/MachineLearning/comments/15rrljw/ba...
It's the first paper from Alex Graves since quite a while. Also, he is with NNAISENSE now.