Double Descent in Human Learning

contravariant · on April 26, 2023

The linear regression is somewhat interesting, but also points out that the double descent might have a somewhat strange cause. What technically happens is that as you have more parameters you not only optimize for the model fit, but also for the size of the parameters.

If you force the model to fit your training data perfectly then it is no wonder that you can only start to optimize the size of your parameters after you have enough leeway to easily fit the entire training set.

However another way to achieve the same effect is to make the assumption that the parameters should be small(ish) explicit. Fitting the whole training set is nice but ultimately pointless if you're just fitting noise. If the actual data is a simple polynomial + random noise then fitting all the noise will give a worse estimate.

gnramires · on April 26, 2023

I think you're right, cross-validation should enable you to detect overfitting. When you see severe overfitting, you can try a different model (say a statistical model including noise), or add regularization hyperparameters, which you can tune via cross validation (or simply re-run the fit a number of times, because information contained in hyperparameters alone tends to be very low, hence not run risk of overfitting in the hyperparameter tuning process).

jwilber · on April 26, 2023

In case you need a refresher, I made a very visual introduction to Double Descent here: https://mlu-explain.github.io/double-descent/

(There’s a math-ier follow up linked as wel).

dir_balak · on April 26, 2023

Fantastic material. Big thanks!

version_five · on April 26, 2023

Intuition of double descent: when you have the about the same number of parameters as data points, a fit curve looks like one of those shitty "polynomial fits" in excel, where away from the data points it over and under shoots the curve wildly. Lots more parameters and this calms down to give smooth interpolation between the curves.

I can imagine some tasks where people have the same problem. You're overfit to a very specific task, but outside it "when you have a hammer everything looks like a nail" and you end up doing something dumb.

esafak · on April 26, 2023

Gradient descent finds the minimum L2 norm solution of the least squares? So adding a L2 regularization term should do nothing if the objective is convex, right? Is this common knowledge? I must have missed the memo, or be getting rusty.

macleginn · on April 26, 2023

It depends on the inputs. E.g., if two elements of the input are highly correlated, the network can assign an arbitrarily high positive weight to one element and then correct this by assigning an arbitrarily high negative weight to the other, with the total contribution of the two elements being reasonable. However, if these input elements are not as tightly correlated in the test data, the results may be bad. L2 regularisation ensures that this does not happen. (In practice, the effect will be mitigated by the activation function, but weights do blow up in training, leading to "saturated" models, which exhibit weird behaviours.)

jwilber · on April 26, 2023

There’s definitely more to the picture than that - it’s really about how models in the interpolation region behave (see https://mlu-explain.github.io/double-descent/ or https://mlu-explain.github.io/double-descent2/)

canjobear · on April 26, 2023

The formal claim is in Appendix A here: https://arxiv.org/pdf/2303.14151.pdf

elcomet · on April 26, 2023

You don't necessarily want the optimal solution, L2 regularization can give you a slightly worse solution on the training set but that will be better at generalization on unseen data.

svantana · on April 26, 2023

It finds the minimum with regards to the training data. The double descent phenomenon is about error on unseen data.

elcomet · on April 26, 2023

The objective is usually not convex in deep learning.

topaz0 · on April 26, 2023

I would have thought the "few parameter" high-error region corresponds more naturally to the part of language learning where the learner overgeneralizes and thinks everything is regular, and the "many parameter" high-error region corresponds to knowing the irregular forms for each word that you've encountered. But this blog seems to think of it in the other way around. Maybe I'm missing something.

arthur2e5 · on April 27, 2023

It can be argued that the human case correlates better with the "data double descent" model, assuming the brain network size does not change. The thing is... it does! It goes up and down as connections grow and get pruned. Biology is messy. Biological analogies can never be perfectly clean, especially for a system as giant as a human.

valine · on April 26, 2023

I’ve heard the argument made that LLMs are simply models overfit on the entirety of the internet. Double decent is a good counter point to that claim, it suggests something more interesting than simple overfitting is happening with large parameter counts.

bheadmaster · on April 26, 2023

I find it hard to believe that every single problem has a solution on the internet, which is what "overfitting" would imply.

I was playing Fallout: New Vegas on Wine, and for some reason, the music on my pipboy wasn't playing. I searched the internet for the Wine errors from the terminal to no avail, and as a last resort asked ChatGPT. It gave me step-by-step instructions on how to fix it, and it worked.

If that doesn't demonstrate that LLMs have some kind of internal model of the world and understanding of it, then I don't know what will.

alphydan · on April 26, 2023

An alternative explanation is that your google-fu is not as good as openai's crawlers or the WebText corpus (which can go a lot deeper than any of your searches).

Buttons840 · on April 26, 2023

Makes me wonder how Google would compare if a good portion of the entire world hadn't spent decades trying to game their system. OpenAI created a completely new system and reaps the benefits of training on data that hasn't been twisted-half-way-to-hell to exploit it.

ChatGTP · on April 26, 2023

I can’t find anything on Google anymore so I’m not sure this proves much except open ai has better search.

nyrikki · on April 26, 2023

ML is fundamentally pattern finding and matching.

The fact that it is useful for finding patterns that may be different than humans tend to find is not an indication of understanding of the underlying data.

It is no different than clustering in traditional stats. While those found patterns are sometimes incredibly useful, clustering knows nothing outside of the provided dataset.

As other have mentioned, Google's search results are actually really bad at finding novel results these days due to many factors like battling SEO tricks etc...

But while the results of LLMs is impressive, there is no mechanisms for it to have an 'internal model of the world' in their current form.

It may help to remember that current LLMs would require an infinity of RAM to be even computationally complete right now.

bheadmaster · on April 26, 2023

> The fact that it is useful for finding patterns that may be different than humans tend to find is not an indication of understanding of the underlying data.

Without invoking your own self-awareness as an argument, how do you know that other people "understand" stuff, and aren't merely "finding patterns"? In other words, in what way do you define "understanding", such that you can be sure that LLMs have no such thing?

> there is no mechanisms for it to have an 'internal model of the world' in their current form.

How do you know that? We don't even know why humans have an internal model of the world. What if internal modelling of the world is just sufficiently-complex pattern-matching?

SanderNL · on April 26, 2023

If clustering has worked on what amounts to basically the entire world of information things get a bit fuzzy though. I don't suppose you are technically incorrect, it's just that these words lose practical meaning when we talk about models that encode tens or even hundreds of billions of parameters.

Predicting the "next token" requires an "internal model of the world". It might not be how we do it, but without something that acts like it I'd be very interested in how you think it comes up with its predictions.

Let's say it needs to continue a short story about a detective. The detective says at the end: "[...] I have seen every clue and thought of every scenario. I will tell you who the killer is:". Good luck continuing that with any sort of accuracy if you don't have some abstract map of how "people" act. You can see how I can think of a lot of examples that require something that acts as a model of the "world".

There's a definite structure and pattern to everything we do. This (to an LLM) hidden context gives rise to the words we write. To re-invent them, like it has to do, it must basically conjure up all this hidden state. I'm not saying it gets it right, I'm just saying that there is no other way than to model the world behind the text to even get into ballpark-right territory.

sebzim4500 · on April 26, 2023

>It may help to remember that current LLMs would require an infinity of RAM to be even computationally complete right now.

Anything that is computationally compute needs an infinite amount of RAM. This is not unique to LLMs or even to machine learning.

fauxpause_ · on April 26, 2023

https://www.gog.com/forum/fallout_series/new_vegas_music_is_...

I haven’t played this game so I don’t know what I’m searching for but this was my first result. Seems on the money, no?

bheadmaster · on April 26, 2023

That's the thing - it isn't.

I've been through this forum, many reddit posts and other sites - none of the solutions worked. What worked was that ChatGPT figured out that I need to add the following line to ~/.wine/system.reg:

    [Software\\Wine\\GStreamer]
    "DllOverrides"="mscoree,mshtml="

And install 32-bit version of gstreamer good plugins:

    sudo apt-get install gstreamer1.0-plugins-good:i386

If you happen to find these exact instructions anywhere on the internet, please share, as that will be enough to convince me that LLMs aren't anything more than glorified search engines.

Otherwise, I can't help but be skeptical. If nothing else, it's plausible that LLMs have some kind of internal representation of the world.

fauxpause_ · on April 26, 2023

https://baronhk.wordpress.com/2021/10/05/wine-still-needs-32...

How about this guy? You sure the first instruction is necessary?

I do think there is some decent ability to piece things together. But this example seems too niche.

version_five · on April 26, 2023

I'd be interested to know if anyone has studied how overfitting translates into the domain of llm output: it's easy to understand when you're fitting a line though data, or building a classifier, you overfit and your test set loss is higher than your training set loss, and this directly relates to worse performance of the model. For an llm picking probably next words, what's the analogy, and does overfitting make it "worse" even if a test set loss is higher?

hackernewds · on April 26, 2023

How is double descent a good counterpoint to the claim?

qumpis · on April 26, 2023

That novel reasoning is due to the model transitioning to a generalization regime, not merely parroting training data?

malux85 · on April 26, 2023

Overfitting is lack of genralisation, LLMs can produce novel output by using abstraction composition, which you cannot do without generalisation.

Other signs of non-overfitting: Abstraction laddering, task decomposition, novel (i.e. unseen) joke explanation

clbrmbr · on April 26, 2023

In the polynomial fit case: it’s pretty clear to me that if the number of parameters is higher than the degrees of freedom in the data, then the model can simply memorize the input.

That doesn’t explain why it generalizes well, though.

occamrazor · on April 26, 2023

It”s only mentioned at the end. If you use some regularization, you have many ways to fit the data (almost) perfectly, and you choose the one with the smallest coefficients, which is often one that generalizes well. And even without explicit regularization, some training methods implicitly apply sone form of regularization, when the number of parameters exceeds the number of data points.

johndough · on April 26, 2023

Has anyone been able to reproduce double descent without trying to invert a seriously ill-posed linear system? Showing that a solver has numerical issues which go away when enough parameters are introduced does not really prove anything. This looks much more like regularization though floating point rounding.

By the way, the linked Colab notebook is missing a "self." in front of the "lamb" in the unused regularization branch of the fit function.

sebzim4500 · on April 26, 2023

OpenAI paper on double descent: https://arxiv.org/pdf/1912.02292.pdf

They show it happens across a variety of architectures and real world datasets