Hacker News new | past | comments | ask | show | jobs | submit login

> the lottery hypothesis

Isn’t that another way of saying the optimization algorithm used in finding the network‘s weights (gradient descent) can not find the global optimum? I mean this is nothing new, the curse of dimension prevents any numeric optimizer to completely minimize any complicated error function and it’s been known for decades. AFAIK there is no algorithm that can find the global minimum of any function. And this is what currently limits neural network models: They could be much simpler and less resource hungry if we had better optimizers.




In practice, you don't want the global optimum because you can't put all possible inputs in the training data and need your system to "generalize" instead. Global optimum would mean overfitting.


Can someone explain this? Isn't it possible for the global optimum to be also be the right generalisation optimum?


it's possible, but unlikely. The issue is your training examples are essentially a noisy representation of the general function you are trying to get it to learn. Generally any representation that fits too well will be incorporating the noise and that will distort the general function (in the case of NN it'll generally mean memorising the input data). Most function-fitting approaches are vulnerable to this.


Hm. I see. But, ultimately, overfitting is a consequence of too many parameters absorbing the noise. Perhaps one could fit smaller models and add artificial noise.


The global optimum would be taken in reference to the training data (because that's all you have to set the weights). Unless the training data represents all real world data perfectly, fully optimizing for it will pessimize the model in relation to some set of real world data.


Regularisation should not be done with the optimiser but with the loss function and the architecture.


The entire reason SGD works is because the stochastic nature of updates on minibatches is an implicit regularizer. This one perspective built the foundations for all of modern machine learning.

I completely agree that the most effective regularization is inductive bias in the architecture. But bang for buck, given all the memory/compute savings it accomplishes, SGD is the exemplar of implicit regularization techniques.


Maybe it should not be done but the large neutral networks this decade absolutely rely on this. A network at the global minimum of any of the (regularized) loss functions that are used these days would be waaay overfitted.


Regularization only helps you so much.


In addition to that the hypothesis asserts that a local minimum is likely not good enough. This is different from a few years ago when most thought that the solution space was full of local minima so parameter initialization wouldn't matter that much. But that is perhaps because the threshold for acceptable performance is higher so luck is more important.


I think you're right, but the issue might be local minima which a better optimiser wouldn't help with much. A reason a larger network might work better is that there are fewer local minima in a higher dimension too.


> there are fewer local minima in a higher dimension

Is it actually proven, or another hypothesis? What is the reason behind this?


Just reasoning about this from first principles, but intuitively, the more dimensions you have, the more likely that you are to find a gradient in some dimension. In an N-dimensional space, a local minimum needs to be a minimum in all N dimensions, right? Otherwise the algorithm will keep exploring down the gradient. (Not an expert on this stuff.) The more dimensions there are, the more likely it seems to be that a gradient exists down to some greater minimum from any given point.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: