Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

TL;DR: For high-dimensional models (say, with millions to billions of parameters), there's always a good set parameters nearby, and when we start descending towards it, we are highly unlikely to get stuck, because almost always there's at least one path down along at least one among of all those dimensions -- i.e., there are no local optima. Once we've stumbled upon a good set of parameters, as measured by validation, we can stop.

These intuitions are consistent with my experience... but I think there's more to deep learning.

For instance, these intuitions fail to explain "weird" phenomena, such as "double descent" and "interpolation thresholds":

* https://openai.com/blog/deep-double-descent/

* https://arxiv.org/abs/1809.09349

* https://arxiv.org/abs/1812.11118

* See also: http://www.stat.cmu.edu/~ryantibs/papers/lsinter.pdf

We still don't fully understand why stochastic gradient descent works so well in so many domains.



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: