xksteven's comments

xksteven · on June 3, 2022

I'd be careful of over applying the "bias-variance tradeoff." How to define the variance of a model is not a simple task. I wouldn't say it is immediately obvious how bias-variance relates to small data scenarios.

How much data is considered small? What is the complexity of the dataset itself?

Even in Machine Learning it is possible to learn from small datasets without transfer learning. See meta-learning for instance.

evrydayhustling · on June 4, 2022

> I'd be careful of over applying the "bias-variance tradeoff." How to define the variance of a model is not a simple task. I wouldn't say it is immediately obvious how bias-variance relates to small data scenarios.

It's very important for anyone studying ML to understand how bias-variance relates to sample size, so I'd encourage finding more resources if this note didn't help clarify! Here's another shot at a summary: for a fixed size of training sample, you must trade off between sensitivity to randomness in the sample (variance) and assumptions that bias the model you train.

It's true that quantifying variance and bias can be hard, and you need systems like PAC learning to go further and actually estimate sufficient sample sizes for a task. But you can still reason usefully about any system that involves using data to select (train) among a class of potential output models!

For example, the statement about meta-learning is incorrect, at least as far as i've seen the term used. Meta learning involves learning hyperparameters (including functions) that are then used to train a model. The extra stage makes these models less biased, but actually require more data. (Of course, in some meta learning systems, hyper parameters are learned with the help of external data - a form of transfer learning.)

xksteven · on June 23, 2021

Sometimes reimplantation is impossible without the code and the paper goes on to win awards because it's by a famous scientist. Then if the reimplantation doesn't work most of the time the graduate students are blamed instead of the original work.

There are always assumptions. At least with public code and models those assumptions are laid bare for all to see and potentially expose any bad assumptions.

xksteven · on Dec 6, 2020

My understanding of the read was to show "how" they're equivalent as opposed to how to actually construct such an approximator or learn it.

Similar to showing a problem falls in NP, you can reduce the problem down to another problem in NP and be done with it.

sdenton4 · on Dec 6, 2020

Agree, but also think the result may be too general to be useful. Proving that you can rewrite any network learned with gradient descent this way kinda suggests that the architecture doesn't matter, but we know that's not true. Eg, why are networks with skip connections SO much better than networks without? What about batch normalization? This makes me suspicious that it's a nice theoretical result a bit too high level to be useful. Yes, it was proved years ago that you can train an arbitrary function with a wide enough two-layer net, but it's not a terribly practical way to approach the world. Now we have architectures much better than two-layer networks, and, for that matter, SVMs.

There's a number of problems with svms; complexity for training and inference scales with the amount of training data, which is pretty sad panda for complex problems.

Extremely spicy/cynical take: it's not cool to say "you all should go look at all these possible applications" when the thrust is the paper is to prop up the relevance of an obsolete approach. You gotta do the actual work to close the gap if you still want your PhD to be worth something...

That said, I haven't read the paper terribly closely, and am always happy to be proven wrong!

nightski · on Dec 6, 2020

I'd be curious if re-framing a trained neural network model as a SVM gives you insight into it's support vectors and maybe a little understanding on why the NN works the way it does?

runT1ME · on Dec 6, 2020

>suggests that the architecture doesn't matter, but we know that's not true. Eg, why are networks with skip connections SO much better than networks without? What about batch normalization?

Is this true though, or does network architecture only matter in terms of efficiency? This is non rhetorical, I really don't know much about deep learning. :) I guess i'm asking if with enough data and compute, is architecture still relevant?

sdenton4 · on Dec 6, 2020

These things matter a lot in practice. Imagine a giant million dimensional loss surface, where each point is a set of weights for the model. Then the gradient is pushing us around on this surface, trying to find a 'minimum.' Current understanding (for a while, actually) is that you never really hit minima so much as giant mostly-flat regions where further improvement maybe takes a million years. The loss surfaces for models with skip connections seem to be much, much nicer.

https://papers.nips.cc/paper/2018/file/a41b3bb3e6b050b6c9067...

In effect, there's a big gap between an existence proof and actually workable models, and the tricks of the trade do quite a lot to close the gap. (And there are almost certainly more tricks that we're still not aware of! I'm still amazed at how late in the game batch normalization was discovered.)

OTOH, so long as you're using the basic tricks of the trade, IME architecture doesn't really matter much. Our recent kaggle competition for birdsong identification was a great example of this: pretty much everyone reported that the difference between five or so 'best practices' feature extraction architectures (various permutations of resnet/efficientnet) was negligible.

runT1ME · on Dec 7, 2020

Thank you so much for your response, that example makes a lot of sense. Algorithmically speaking in computer science, we can formalize efficiency with complexity theory.

Can we do the same with neural networks? Is there a formalization of why 'skip connections' (which I know nothing about) are better, why transformers are more efficient than recurrance, etc?

Is it useful to talk about their complexity or universal properties or size (and I realize this is muddled up a bit by the fact that hardware architecture can sometimes trump efficiency).

blt · on Dec 7, 2020

Classically, the equivalent of complexity theory in machine learning is statistical learning theory, where the main question is: if I have a magical algorithm that always finds the function in my class that fits the data best, how big does my dataset (which is a sample from some unknown probability distribution) need to be to ensure that the function I pick is almost as good as the best function in the class with high probability? This is known as PAC (probably approximately correct) learning.

For many non-deep machine learning models like support vector machines (SVMs), "find the function in my class that fits the data best" can be posed as a convex optimization problem. So the magical algorithm actually exists. In this setting, the PAC analysis is the natural thing to study. (Typically, we find that you need more samples if your function class is more expressive, which agrees with the observation that deep learning requires huge data sets.)

In deep learning, the magical algorithm doesn't exist. The optimization problem is nonconvex, but everyone uses stochastic gradient descent (SGD), an algorithm that is only guaranteed to find the global optimum on convex problems. Theory suggests that SGD will often converge on a local optimum that is significantly worse than the global optimum. However, in practice this doesn't happen much! If the network is big enough and all the algorithm hyperparameters are tuned well, and you run deep the deep learning algorithm with different random seeds, the result will be about equally good every time.

ML theory people working in deep learning tend to focus on this phenomenon: why does SGD usually find good local optima? This is totally different from the PAC analysis, and the analogy with computational complexity is less crisp.

sdenton4 · on Dec 7, 2020

Skip connections have very good motivation (see one of my other comments in this thread), and attention is decently well motivated, especially as an improvement in the translation space where they were first introduced. I don't think there's any formal proof that attention >> than convolutions with a wide receptive field.

It would be fantastic to have better measures of problem complexity. My thinking at this point is that huge parameter size makes it easier to search for a solution, but once we've found one, there should be interesting ways to simplify the function you've found. Recall above: there are many equivalent models with the same loss when the learning slows down... Some of these equivalent models have lots of zeros in them. We find that often you can prune 90% of the weights and still have a perfectly good model. Eventually you hit a wall, where it gets hard to prune more without large drops in model quality; this /might/ correspond to the actual problem complexity somehow, but the pruned model you happened to find may not actually be the best, and there may be better methods we haven't discovered yet.

mycall · on Dec 6, 2020

> why are networks with skip connections SO much better than networks without?

What are the leading theories for why this seems to be the case? Less nodes to capture and direct decisions?

sdenton4 · on Dec 6, 2020

Oh there's plenty of good explantation in the neural network literature (my eli5: the skip connections make the default mapping an identity instead of a zero mapping; you can start by doing no harm, and improve from there). The method was suggested by knowledge from differential equations. All I'm saying is that the "everything is secretly an svm" viewpoint is probably too coarse to explain these interesting and effective structural differences.

xksteven · on Dec 6, 2020

That's correct. See https://en.wikipedia.org/wiki/Universal_approximation_theore... for more details

xksteven · on Dec 6, 2020

SVMs are solved via convex optimization methods which have taken more time to get on the GPU train.

On the other hand there are GPU accelerated SVM training such as: https://github.com/Xtra-Computing/thundersvm

A github or Google search will reveal other GPU accelerated SVM training.

knuthsat · on Dec 6, 2020

SVMs can also be trained using gradient descent.

xksteven · on Dec 3, 2020

  Location: Miami, Florida
  Remote: Yes
  Willing to relocate: Yes
  Technologies: Python, Node, Pytorch, Tensorflow
  Résumé/CV: https://stevenbas.art/resources/resume/
  Email: xksteven@uchicago.edu
  Software Engineer
  I have been working on machine learning problems in computer vision and natural language processing for my PhD over the past 5 years.