I just asked a stability employee and they said the the current models ran into ...

HarHarVeryFunny · on April 19, 2023

I've asked this question in a few places, and never been able to get an answer, maybe you know...

Q: Why are these LLMs trained on a single epoch, and perform worse if the dataset is repeated ?

This seems maybe related to suspecting data duplication as a cause of overfitting.

Why don't LLMs need multi-epoch training at a low learning rate to generalize? If they are managing to learn from a single epoch, that sounds more like they may be memorizing!

thunderbird120 · on April 20, 2023

Never repeating your training data is what you'd ideally like to do for training basically any ML model. If you do that you don't really need to worry about overfitting since the model is constantly trying to fit a stream of new data. To reduce its training error it actually has to model the structure of the data rather than just memorizing it since each training step will involve data it has never seen before. Larger models are more prone to overfitting but also learn several orders of magnitude faster. If you can use larger models without being concerned about overfitting it's generally desirable to do so. It's just that most tasks don't actually have enough data to support doing that. Thankfully, text modeling does have enough data.

HarHarVeryFunny · on April 21, 2023

Thanks.

So when, for example, we train an ImageNet model over multiple epochs using rotation/scaling/etc augmentation, it's really better to think of this as one epoch over a unique set of images than multi-epoch per se ? I was really thinking of augmentation as a way to get coverage over the input space rather than ensuring the training data doesn't repeat, but I guess it serves both purposes.

It does still seem that many LLMs are overfitting / memorizing to a fair degree though - maybe just because they are still too big for the amount of data they are trained on ? It seems like a bit of a balancing act - wanting an LLM to generalize, but yet also to serve as somewhat of a knowledge store for rare data it has only seen once.