"In this work, we argue that the training loss instabilities observed in large-scale training should be
associated with the time-domain correlation between the gradient estimates of earlier layers in the
deep-learning models. Based on the identified connection, we propose several ways to mitigate the
instabilities, along with the heuristic method that was known in the literature. We conclude that at
this point, there is no silver bullet to solve the problem, and the appropriate remedy depends on the
specific setup of the large-scale training run."