What prevents parallel LLM training? If you read book 1 first and then book 2, the resulting update in your knowledge will be the same if you read the books in the reverse order. It seems reasonable to assume that LLM is trained on each book independently, the two deltas in the LLM weights can be just added up.
This is not at all intuitive to me. It doesn't make sense in a human perspective, as each book changes you. Consider the trivial case of a series, where nothing will make sense if you haven't read the prior books (not that I think they feed it the book corpus in order maybe they should!), but even in a more philosophical sort of way, each book changes you. and the person who reads Harry Potter first and The Iliad second will have a different experience of each.
Then, with large language models, we have the concept of grokking something. If grokking happens in the middle of book 1, it is a different model which is reading book 2 and of course the inverse applies.
This isn't true. Set up even a simple ANN dense feed forward three layers you know the one. Then keep everything the same for two models you train with the exception of data order. You'll end up with two different models even though you started with the same weights, etc.
I'm not sure this is true. For instance, consider reading textbooks for linear algebra and functional analysis out of order. You might still grok the functional analysis if you read it first but you'd be better served by reading the linear algebra one first.
In ordinary gradient descent the order does matter, since the position changes in between. I think stochastic gradient descent does sum a couple of gradients together sometimes, but I'm not sure what the trade-offs are and if LLMs do so as well.
By the “delta in the LLM weights”, I am assuming you mean the gradients. You are effectively describing large batch training (data parallelism) which is part of the way you can scale up but there are quickly diminishing returns to large batch sizes.
The "deltas" are calculated by the error in how well the current state of the network predicts the output, backpropagated. Sequential runs are not commutative because the state changes.
Consider the trivial example of training a network to distinguish between sample A and sample B. Give it a hundred As in a row and it just learns "everything is A". Give it a hundred Bs in a row and it relearns "no, everything is B". To train it to distinguish, you must alternate As and Bs (and not too regularly, either!)