What prevents parallel LLM training? If you read book 1 first and then book 2, t...

ctoth · on Sept 17, 2023

This is not at all intuitive to me. It doesn't make sense in a human perspective, as each book changes you. Consider the trivial case of a series, where nothing will make sense if you haven't read the prior books (not that I think they feed it the book corpus in order maybe they should!), but even in a more philosophical sort of way, each book changes you. and the person who reads Harry Potter first and The Iliad second will have a different experience of each. Then, with large language models, we have the concept of grokking something. If grokking happens in the middle of book 1, it is a different model which is reading book 2 and of course the inverse applies.

Grimblewald · on Sept 18, 2023

This isn't true. Set up even a simple ANN dense feed forward three layers you know the one. Then keep everything the same for two models you train with the exception of data order. You'll end up with two different models even though you started with the same weights, etc.

eachro · on Sept 17, 2023

I'm not sure this is true. For instance, consider reading textbooks for linear algebra and functional analysis out of order. You might still grok the functional analysis if you read it first but you'd be better served by reading the linear algebra one first.

contravariant · on Sept 17, 2023

In ordinary gradient descent the order does matter, since the position changes in between. I think stochastic gradient descent does sum a couple of gradients together sometimes, but I'm not sure what the trade-offs are and if LLMs do so as well.

whimsicalism · on Sept 17, 2023

By the “delta in the LLM weights”, I am assuming you mean the gradients. You are effectively describing large batch training (data parallelism) which is part of the way you can scale up but there are quickly diminishing returns to large batch sizes.

necroforest · on Sept 17, 2023

LLMs are trained in parallel. The model weights and optimizer state are split over a number (possibly thousands) of accelerators.

The main bottleneck to doing distributed training like this is the communication between nodes.

dTal · on Sept 18, 2023

The "deltas" are calculated by the error in how well the current state of the network predicts the output, backpropagated. Sequential runs are not commutative because the state changes.

Consider the trivial example of training a network to distinguish between sample A and sample B. Give it a hundred As in a row and it just learns "everything is A". Give it a hundred Bs in a row and it relearns "no, everything is B". To train it to distinguish, you must alternate As and Bs (and not too regularly, either!)