I just asked a stability employee and they said the the current models ran into an overfitting issue probably due to some duplicated data somewhere in their dataset, which consists of 1.5T tokens. The 800B tokens is the number of tokens they've been trained on so far. The plan is to keep going and train on the rest of the data once the issue is resolved.
I've asked this question in a few places, and never been able to get an answer, maybe you know...
Q: Why are these LLMs trained on a single epoch, and perform worse if the dataset is repeated ?
This seems maybe related to suspecting data duplication as a cause of overfitting.
Why don't LLMs need multi-epoch training at a low learning rate to generalize? If they are managing to learn from a single epoch, that sounds more like they may be memorizing!
Never repeating your training data is what you'd ideally like to do for training basically any ML model. If you do that you don't really need to worry about overfitting since the model is constantly trying to fit a stream of new data. To reduce its training error it actually has to model the structure of the data rather than just memorizing it since each training step will involve data it has never seen before. Larger models are more prone to overfitting but also learn several orders of magnitude faster. If you can use larger models without being concerned about overfitting it's generally desirable to do so. It's just that most tasks don't actually have enough data to support doing that. Thankfully, text modeling does have enough data.
So when, for example, we train an ImageNet model over multiple epochs using rotation/scaling/etc augmentation, it's really better to think of this as one epoch over a unique set of images than multi-epoch per se ? I was really thinking of augmentation as a way to get coverage over the input space rather than ensuring the training data doesn't repeat, but I guess it serves both purposes.
It does still seem that many LLMs are overfitting / memorizing to a fair degree though - maybe just because they are still too big for the amount of data they are trained on ? It seems like a bit of a balancing act - wanting an LLM to generalize, but yet also to serve as somewhat of a knowledge store for rare data it has only seen once.