"In this work, we argue that the training loss instabilities observed in large-s...

"In this work, we argue that the training loss instabilities observed in large-scale training should be associated with the time-domain correlation between the gradient estimates of earlier layers in the deep-learning models. Based on the identified connection, we propose several ways to mitigate the instabilities, along with the heuristic method that was known in the literature. We conclude that at this point, there is no silver bullet to solve the problem, and the appropriate remedy depends on the specific setup of the large-scale training run."