Thank you! What is batch normalization doing and how does it help

pseudonom- · on Aug 9, 2023

There are other mechanisms for dealing with vanishing and exploding gradients. I (maybe wrongly?) think of batch normalization as being most distinctively about fighting internal covariate shift: https://machinelearning.wtf/terms/internal-covariate-shift/

bkitano19 · on Aug 9, 2023

Karpathy covers this in Makemore, but the tl;dr is that if you don’t normalize the batch (essentially center and scale your activations down to be normally distributed), then at gradient/backprop time, you may get values that are significantly smaller or greater than 1. This is a problem, because as you stack layers in sequence (passing outputs to inputs), the gradient compounds (because of the Chain Rule), and so what may have been a well behaved gradient at the end layers has either vanished (the upstream gradients were 0<x<1 at each layer) or exploded (the gradients were x>>1 upstream). Batch normalization helps control the vanishing/exploding gradient problem in deep neural nets by normalizing the values passed between layers.

ripvanwinkle · on Aug 9, 2023

got it,thanks

mike_hearn · on Aug 9, 2023

It's another one of those mathematical hacks that NNs love so much, which stops the numbers spiralling out of control in big networks.

ripvanwinkle · on Aug 9, 2023

folks thanks for the explanation