Hacker News new | past | comments | ask | show | jobs | submit login

Karpathy covers this in Makemore, but the tl;dr is that if you don’t normalize the batch (essentially center and scale your activations down to be normally distributed), then at gradient/backprop time, you may get values that are significantly smaller or greater than 1. This is a problem, because as you stack layers in sequence (passing outputs to inputs), the gradient compounds (because of the Chain Rule), and so what may have been a well behaved gradient at the end layers has either vanished (the upstream gradients were 0<x<1 at each layer) or exploded (the gradients were x>>1 upstream). Batch normalization helps control the vanishing/exploding gradient problem in deep neural nets by normalizing the values passed between layers.



got it,thanks




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: