> the lottery hypothesis Isn’t that another way of saying the optimization algor...

staunton · 2024-05-17T13:41:00 1715953260

In practice, you don't want the global optimum because you can't put all possible inputs in the training data and need your system to "generalize" instead. Global optimum would mean overfitting.

badrunaway · 2024-05-18T06:08:35 1716012515

Can someone explain this? Isn't it possible for the global optimum to be also be the right generalisation optimum?

rcxdude · 2024-05-18T08:41:18 1716021678

it's possible, but unlikely. The issue is your training examples are essentially a noisy representation of the general function you are trying to get it to learn. Generally any representation that fits too well will be incorporating the noise and that will distort the general function (in the case of NN it'll generally mean memorising the input data). Most function-fitting approaches are vulnerable to this.

G3rn0ti · 2024-05-23T14:28:49 1716474529

Hm. I see. But, ultimately, overfitting is a consequence of too many parameters absorbing the noise. Perhaps one could fit smaller models and add artificial noise.

nextaccountic · 2024-05-18T21:11:18 1716066678

The global optimum would be taken in reference to the training data (because that's all you have to set the weights). Unless the training data represents all real world data perfectly, fully optimizing for it will pessimize the model in relation to some set of real world data.

WithinReason · 2024-05-17T14:04:51 1715954691

Regularisation should not be done with the optimiser but with the loss function and the architecture.

uoaei · 2024-05-17T19:44:30 1715975070

The entire reason SGD works is because the stochastic nature of updates on minibatches is an implicit regularizer. This one perspective built the foundations for all of modern machine learning.

I completely agree that the most effective regularization is inductive bias in the architecture. But bang for buck, given all the memory/compute savings it accomplishes, SGD is the exemplar of implicit regularization techniques.

staunton · 2024-05-17T18:43:54 1715971434

Maybe it should not be done but the large neutral networks this decade absolutely rely on this. A network at the global minimum of any of the (regularized) loss functions that are used these days would be waaay overfitted.

redox99 · 2024-05-17T16:44:10 1715964250

Regularization only helps you so much.

bjourne · 2024-05-18T15:01:20 1716044480

In addition to that the hypothesis asserts that a local minimum is likely not good enough. This is different from a few years ago when most thought that the solution space was full of local minima so parameter initialization wouldn't matter that much. But that is perhaps because the threshold for acceptable performance is higher so luck is more important.

WithinReason · 2024-05-17T14:04:01 1715954641

I think you're right, but the issue might be local minima which a better optimiser wouldn't help with much. A reason a larger network might work better is that there are fewer local minima in a higher dimension too.

jxy · 2024-05-17T18:05:15 1715969115

> there are fewer local minima in a higher dimension

Is it actually proven, or another hypothesis? What is the reason behind this?

jcrites · 2024-05-17T18:40:18 1715971218

Just reasoning about this from first principles, but intuitively, the more dimensions you have, the more likely that you are to find a gradient in some dimension. In an N-dimensional space, a local minimum needs to be a minimum in all N dimensions, right? Otherwise the algorithm will keep exploring down the gradient. (Not an expert on this stuff.) The more dimensions there are, the more likely it seems to be that a gradient exists down to some greater minimum from any given point.