Wow, they hot-swapped activation functions (GELU -> RELU) during training. They are indeed very similar activation functions, but it's kinda crazy to me that you can make that kind of a change to a model while it's training, preserving all weights and other state, and just keep going. They changed weight clipping thresholds on the fly too.
They also swapped out the optimizer several times from what I can tell, switching between Adam, "Fake SGD", and "Vanilla SGD" multiple times.
Even without the huge amounts of hardware/driver issues they seemed to be having with the GPUs in their big training cluster(s), this puts into perspective how hard it is to train enormous models like this. Many of the failures don't have an immediately obvious cause. Plus, there aren't all that many places out there doing training at this scale so I imagine many of these things need to get figured out on their own.
It's pretty reassuring to see that constantly fiddling with the model and trying to adjust learning rates on the fly is also normal at leading research labs. Although on the other hand it only makes the replication crisis even worse.
After a quick look through, I really hope releasing raw notes like this becomes more of a trend!
Not sure if your comment is meant as a disagreement or a question.
Generally the way hyperparameters are adjusted is some mix of intuition/experience and random/grid searching. Plus most people don't have the resources/infra to do a large scale grid search on a model that might take a day or more to train. It's somewhat principled, but often a random search is just as good as fiddling numbers by hand and often you have to figure out why something worked post-hoc. You also accept that you might never have a good explanation - for all you know it’s dataset dependent - and trust that your results are good enough to convince peer review (and you can show that this other parameter set was worse, so you didn't use it). It's hacky in the sense that a lot of the work in getting to state of the art (moving the needle on a benchmark by less than 1%) involves playing with the numbers until you get the best results. For example here the engineers modify the learning rate between various runs. I don't think they really had any theoretical reason behind the step changes apart from "this will probably work better because we've seen that effect when training similar sized models".
Adjusting learning rate schedules is one of the simplest knobs to tweak. When you're working with huge models generally you want to use as big a batch size as you can get away with to reduce training time. A bit counter to the earlier thinking where LeCunn said something like "friends don't let friends use batch sizes > 32".
There may be some guided methods like exploring the parameter space in a Bayesian way (eg try to efficiently explore which knobs make the most difference).
They seem to be adjusting lr between epochs as well when the loss explodes, not just runs. But I haven't read through the whole thing yet, maybe they trained the whole thing properly from start to finish at the end. Otherwise that would be extremely hacky and irreproducible
Yeah I think for now they were just trying to get any comparable results due to a near complete lack of details on GPT-3. They seemed to have a hard deadline for the task.
The time and expense of training a model at this size does not benefit well from trial and error. It's simply impractical to iteratively try ~20 different learning schedules.
Hideously ineficient and hacky to have someone manually tweaking things, but not terribly different from the state of the art for scientific research. As long as they state the objectives of their manual control and produce a log of what they did someone else could try to replicate it.
They hot-swapped all kinds of model hyperparameters such as changing activation function and optimizer. It doesn't look like there was a principled reason why they kept switching optimizer or activation function. Maybe as they were training the model their data scientists kept finding ways to improve the model? Not sure, but it looks extremely hacky to me. Not something some team ran one day and forgot until it trained.
I've started reading from bottom and haven't read the whole thing yet. But their default action as stated in their log when face exploding gradients or unstable training is to just roll back a checkpoint and lower lr. Other proposed actions such as clamping activations are also just pretty standard things to try.
I guess since their goal is to just be able to have a trained model it doesn't really matter. But it doesn't seem to be a easily reproducible process, and like i said a bit hacky in my opinion
I'm surprised how many hardware issues they were having.
I am oncall for a ~10k node system, and this log looks pretty similar to my workload... Yet Facebook only had 1% of the number of machines I look after for this! With far fewer machines, they should have far fewer failures!
I suspect they are doing a bad job of root causing failures to make sure they never happen again. For example, that Nvidia infoROM message should have ended up with all the logs and a couple of troublesome boards sent to Nvidia engineering to find out why the corruption happens, how to make it never happen again, how to scan to find out if it has happened, how to auto-undo the corruption, etc.
The same with the infiniband bandwidth issues - get that stuff sent to someone who can hook up a logic analyzer or look at traces to find out exactly why it's happening, and adjust the design of the hardware, firmware or software to make sure it can't happen again and that you have good visibility of any future similar issues beyond just 'its kinda slow, shrug.'.
People are surprised by how ad-hoc the whole approach was. But line 1 in the logbook states the goal of the project:
> Goal: Get a 175B dense model up and running by any means necessary.
"by any means necessary" is engineering speak for "just keep solving problems, in the hackiest way possible, if necessary, and don't stop until the goal is achieved".
From the note:
> "AKA: Help! I’m oncall, it’s 3am, and everything is on fire!"
I didn't think ML model training ever needs on-call, especially for this kind of research-oriented ones. But apparently it's a thing. So is this what MLOps is about?
They also swapped out the optimizer several times from what I can tell, switching between Adam, "Fake SGD", and "Vanilla SGD" multiple times.
Even without the huge amounts of hardware/driver issues they seemed to be having with the GPUs in their big training cluster(s), this puts into perspective how hard it is to train enormous models like this. Many of the failures don't have an immediately obvious cause. Plus, there aren't all that many places out there doing training at this scale so I imagine many of these things need to get figured out on their own.