100 Pages of raw notes released with the language model OPT-175

Ameo · on May 4, 2022

Wow, they hot-swapped activation functions (GELU -> RELU) during training. They are indeed very similar activation functions, but it's kinda crazy to me that you can make that kind of a change to a model while it's training, preserving all weights and other state, and just keep going. They changed weight clipping thresholds on the fly too.

They also swapped out the optimizer several times from what I can tell, switching between Adam, "Fake SGD", and "Vanilla SGD" multiple times.

Even without the huge amounts of hardware/driver issues they seemed to be having with the GPUs in their big training cluster(s), this puts into perspective how hard it is to train enormous models like this. Many of the failures don't have an immediately obvious cause. Plus, there aren't all that many places out there doing training at this scale so I imagine many of these things need to get figured out on their own.

dotnet00 · on May 4, 2022

It's pretty reassuring to see that constantly fiddling with the model and trying to adjust learning rates on the fly is also normal at leading research labs. Although on the other hand it only makes the replication crisis even worse.

After a quick look through, I really hope releasing raw notes like this becomes more of a trend!

ackbar03 · on May 4, 2022

I'm surprised by how hacky the whole process is and how it's mostly just about tuning different hyperparameters

sbierwagen · on May 4, 2022

Welcome to ML.

daenz · on May 4, 2022

Can you say more about why you are seeing the process as hacky?

joshvm · on May 4, 2022

Not sure if your comment is meant as a disagreement or a question.

Generally the way hyperparameters are adjusted is some mix of intuition/experience and random/grid searching. Plus most people don't have the resources/infra to do a large scale grid search on a model that might take a day or more to train. It's somewhat principled, but often a random search is just as good as fiddling numbers by hand and often you have to figure out why something worked post-hoc. You also accept that you might never have a good explanation - for all you know it’s dataset dependent - and trust that your results are good enough to convince peer review (and you can show that this other parameter set was worse, so you didn't use it). It's hacky in the sense that a lot of the work in getting to state of the art (moving the needle on a benchmark by less than 1%) involves playing with the numbers until you get the best results. For example here the engineers modify the learning rate between various runs. I don't think they really had any theoretical reason behind the step changes apart from "this will probably work better because we've seen that effect when training similar sized models".

Adjusting learning rate schedules is one of the simplest knobs to tweak. When you're working with huge models generally you want to use as big a batch size as you can get away with to reduce training time. A bit counter to the earlier thinking where LeCunn said something like "friends don't let friends use batch sizes > 32".

There may be some guided methods like exploring the parameter space in a Bayesian way (eg try to efficiently explore which knobs make the most difference).

ackbar03 · on May 4, 2022

They seem to be adjusting lr between epochs as well when the loss explodes, not just runs. But I haven't read through the whole thing yet, maybe they trained the whole thing properly from start to finish at the end. Otherwise that would be extremely hacky and irreproducible

dotnet00 · on May 4, 2022

Yeah I think for now they were just trying to get any comparable results due to a near complete lack of details on GPT-3. They seemed to have a hard deadline for the task.

lumost · on May 4, 2022

The time and expense of training a model at this size does not benefit well from trial and error. It's simply impractical to iteratively try ~20 different learning schedules.

Hideously ineficient and hacky to have someone manually tweaking things, but not terribly different from the state of the art for scientific research. As long as they state the objectives of their manual control and produce a log of what they did someone else could try to replicate it.

gnulinux · on May 4, 2022

They hot-swapped all kinds of model hyperparameters such as changing activation function and optimizer. It doesn't look like there was a principled reason why they kept switching optimizer or activation function. Maybe as they were training the model their data scientists kept finding ways to improve the model? Not sure, but it looks extremely hacky to me. Not something some team ran one day and forgot until it trained.

ackbar03 · on May 4, 2022

I've started reading from bottom and haven't read the whole thing yet. But their default action as stated in their log when face exploding gradients or unstable training is to just roll back a checkpoint and lower lr. Other proposed actions such as clamping activations are also just pretty standard things to try.

I guess since their goal is to just be able to have a trained model it doesn't really matter. But it doesn't seem to be a easily reproducible process, and like i said a bit hacky in my opinion

dekhn · on May 5, 2022

The very best ML models are distilled from postdoc tears.

mhh__ · on May 4, 2022

Currently I think we are still gluing transistors (networks) together (spiritually) like the very early days of the modern computer, it is hacky.

Nzen · on May 4, 2022

This twitter post points at the pdf rendering [0] of the communal log book that facebook researchers kept while training opt-175.

[0] https://github.com/facebookresearch/metaseq/tree/main/projec...

tomcam · on May 4, 2022

MAD props for Meta being releasing these raw notes. Love seeing into their work process as well.

learndeeply · on May 4, 2022

Skimming through this, a lot of it has to deal with bad GPU hosts.

> CSP fat fingered and deleted our entire cluster when trying to replenish our buffer nodes.

Ouch.

londons_explore · on May 5, 2022

I'm surprised how many hardware issues they were having.

I am oncall for a ~10k node system, and this log looks pretty similar to my workload... Yet Facebook only had 1% of the number of machines I look after for this! With far fewer machines, they should have far fewer failures!

I suspect they are doing a bad job of root causing failures to make sure they never happen again. For example, that Nvidia infoROM message should have ended up with all the logs and a couple of troublesome boards sent to Nvidia engineering to find out why the corruption happens, how to make it never happen again, how to scan to find out if it has happened, how to auto-undo the corruption, etc.

The same with the infiniband bandwidth issues - get that stuff sent to someone who can hook up a logic analyzer or look at traces to find out exactly why it's happening, and adjust the design of the hardware, firmware or software to make sure it can't happen again and that you have good visibility of any future similar issues beyond just 'its kinda slow, shrug.'.

gurjeet · on May 5, 2022

People are surprised by how ad-hoc the whole approach was. But line 1 in the logbook states the goal of the project:

> Goal: Get a 175B dense model up and running by any means necessary.

"by any means necessary" is engineering speak for "just keep solving problems, in the hackiest way possible, if necessary, and don't stop until the goal is achieved".

flakiness · on May 4, 2022

From the note: > "AKA: Help! I’m oncall, it’s 3am, and everything is on fire!"

I didn't think ML model training ever needs on-call, especially for this kind of research-oriented ones. But apparently it's a thing. So is this what MLOps is about?

dekhn · on May 5, 2022

Of course (industrial) ML model training needs an on-call. Many on-calls at the same time, for different components of the system, actually.

fxtentacle · on May 5, 2022

These servers and their electricity is expensive, so you don't want them to be idle if you can avoid it.

ensan · on May 4, 2022

"The paper mentions 35 (!) manual restarts to train OPT-175B due to hardware failure (and 70+ automatic restarts)."

https://twitter.com/awnihannun/status/1521572873449533440

JoeyBananas · on May 5, 2022

I'm surprised it took Facebook an entire month to train this

JoeyBananas · on May 5, 2022

I don't see why they don't throw away the model and train it again after making changes. It only took them 1 month.

charcircuit · on May 5, 2022

Because it's faster to start training from a relevant existing model than from random weights

SemanticStrengh · on May 4, 2022

Did they leverage deepspeed? Also where are the accuracy results vs popular datasets?