If you use the right sort of kernel for an SVM it becomes a neural network with ...

dplavery92 · on Dec 31, 2018

Significantly, it becomes a simple, 2-layer neural network. The power of the advances of neural networks in the past decade have largely relied on "deep" architectures with many layers. Very deep networks effectively learn the features from the data, rather than learn a decision surface over a set of hand-crafted features, as in learning with SVMs or shallow neural networks.

nightski · on Dec 31, 2018

I thought it had been proven that a two layer neural network has the same power as a deep one (obviously with a much greater width). It's just that deep neural networks are a lot more practical to train in practice. So I'm not sure how important that distinction is.

dplavery92 · on Dec 31, 2018

This is something of an academic factoid that has nothing to do with the practice of training and using neural networks, or with the merits of deep networks that I was describing above.

Shallow feed-forward networks are "universal function approximators" [0] when the number of hidden neurons is finite but unbounded. Of course, the width of that layer grows exponentially in the depth of the deep network that you might wish to approximate [1].

The statement that "[i]t's just that deep neural networks are a lot more practical to train" (emphasis mine) sounds somewhat reductive; it's not only that depth is a nice trick or hack for training speed, but that depth makes the success of deep networks in the past decade at all possible. We live in a world with bounded computing resources and bounded training data. You cannot subsume all deep networks into shallow networks, and shallow networks into SVMs in the real world. So I am pretty sure of how important that distinction is.

And what's more, depth extracts a hierarchy of interpret-able features at multiple scales[2], and a decision surface embedded within that feature space, rather than a brittle decision surface in an extremely high dimensional space with little semantic meaning. One of these approaches generalizes better than the other to unseen data.

[0] https://en.wikipedia.org/wiki/Universal_approximation_theore... [1] https://pdfs.semanticscholar.org/f594/f693903e1507c33670b896... [2] https://distill.pub/2017/feature-visualization/

igorkraw · on Jan 1, 2019

An important addition to this is priors: deep networks allow to express the prior that hierarchical representation, i.e. composing into multiple layers of abstractions make sense (see e.g. conv nets).

yters · on Dec 31, 2018

If an SVM kernel can replicate a 2 layer NN, why couldn't there be a kernel for a X layer NN, and then autoderive the architecture just like SVMs can autoderive the correct number of neurons? Then there'd also be a more robust theoretical understanding of what's happening.

igorkraw · on Jan 1, 2019

See my other point, there might be, in fact there definitely is for any working NN, but as of now (2019, happy new year) we probably can't find it

slashcom · on Dec 31, 2018

An infinitely sized 2 layer NN is universal in the same way a Turing machine is universal — sure you can write any program; God help you if you try.

igorkraw · on Jan 1, 2019

If i remember my Goodfellow correctly (and quickly checking, wikipedia, I did https://en.wikipedia.org/wiki/Universal_approximation_theore... ), there is a nuance here which is almost always missed: you can represent any function with a sufficiently wide 2 layer neural network, it doesn't say anything about being able tune the network until you find a correct setting (i.e. learnability).

This is important. Flippantly said,discarding learnability and speed of convergence, you can get the power of any neural network by the following algorithm:

1. Randomly generates a sufficiently wide bit pattern 2. Interprets it as a program and run it on the test set 3. discard results until the desired accuracy is reached

laretluval · on Jan 1, 2019

Fwiw the "deep learning" advances in NLP have typically still been from shallow networks, almost always less than 10 layers and usually more like 2.