Back in around 2008, SVMs were all the rage in computer vision. We would use hand designed visual features and then a linear SVM on top. That was how object detectors were built (remember DPM?)
Funny how SVMs are just max-margin loss functions and we just took for granted that you needed domain expertise to craft features like HOG/SIFT by hand.
By 2018, we use ConvNets to learn BOTH the features and the classifier. In fact, it’s hard to separate where the features end and the classifier begins (in a modern CNN).
So the pitch is that you don't have to do feature engineering... but then instead it seems people do network structure engineering with featurish things like convolutions.
The performance is still better in most cases but I often have to wonder, are people just doing feature engineering once removed and is the better performance just the result of having WAY more parameters in the model?
are people just doing feature engineering once removed and is the better performance just the result of having WAY more parameters in the model?
Not really, or sort of, depending on how you think.
A deep neural network does work - at least to some extent - because of the large number of parameters. However, it is practical because it can be trained in a reasonable amount of time.
Things like ResNets are useful because they allow us to train deeper networks.
You can create a SVM with the same number of parameters[1], and in theory it could be as accurate (this is basically the no free lunch theorem[2]). But you won't be able to train it to the same accuracy.
[1] Of course there are practical concerns about what you do for features, since hand created features just aren't as good as neural network ones. One thing people do now is use the lower layers of a deep neural network as a feature extractor and then put a SVM on top of them as the classifier. This works quite well, and is reasonably fast to train.
But you won't be able to train it to the same accuracy.
I'm not sure I agree with this bit in theory. A Neural Network is a stack of basis functions; and this stack can also be seen as a bunch of basis functions. And basis functions are what kernels represent. Trivially, you could then "copy" the weights that a ANN would learn into a kernel and obtain the same accuracy.
The reason this doesn't work in practice is, in SVMs, you tend not to learn kernels from scratch but use (possibly a combination of) standard parameterized kernels - [1]. The learning step in the SVM adapts this standard kernel to your dataset as much as the parameters allow, but this would be sub-optimal compared to learning a kernel (or the corresponding basis functions) from scratch that's built just for your data. With a well trained ANN the latter is what you get.
[1] there has been a fair amount of work on learning kernels too, but its not as mainstream as using standard kernels.
Trivially, you could then "copy" the weights that a ANN would learn into a kernel and obtain the same accuracy.
Sure.
I'm not sure I agree with this bit in theory.
No one really agrees with it in theory - I'm not aware of a good theoretical explanation as to why some deep networks are easier to train. And yet there is a growing body of real, generalized practical hints which work pretty reliably.
This is pretty exciting! There is undiscovered ideas here. But it is unsatisfactory from the theoretical sense at the moment.
Significantly, it becomes a simple, 2-layer neural network. The power of the advances of neural networks in the past decade have largely relied on "deep" architectures with many layers. Very deep networks effectively learn the features from the data, rather than learn a decision surface over a set of hand-crafted features, as in learning with SVMs or shallow neural networks.
I thought it had been proven that a two layer neural network has the same power as a deep one (obviously with a much greater width). It's just that deep neural networks are a lot more practical to train in practice. So I'm not sure how important that distinction is.
This is something of an academic factoid that has nothing to do with the practice of training and using neural networks, or with the merits of deep networks that I was describing above.
Shallow feed-forward networks are "universal function approximators" [0] when the number of hidden neurons is finite but unbounded. Of course, the width of that layer grows exponentially in the depth of the deep network that you might wish to approximate [1].
The statement that "[i]t's just that deep neural networks are a lot more practical to train" (emphasis mine) sounds somewhat reductive; it's not only that depth is a nice trick or hack for training speed, but that depth makes the success of deep networks in the past decade at all possible. We live in a world with bounded computing resources and bounded training data. You cannot subsume all deep networks into shallow networks, and shallow networks into SVMs in the real world. So I am pretty sure of how important that distinction is.
And what's more, depth extracts a hierarchy of interpret-able features at multiple scales[2], and a decision surface embedded within that feature space, rather than a brittle decision surface in an extremely high dimensional space with little semantic meaning. One of these approaches generalizes better than the other to unseen data.
An important addition to this is priors: deep networks allow to express the prior that hierarchical representation, i.e. composing into multiple layers of abstractions make sense (see e.g. conv nets).
If an SVM kernel can replicate a 2 layer NN, why couldn't there be a kernel for a X layer NN, and then autoderive the architecture just like SVMs can autoderive the correct number of neurons? Then there'd also be a more robust theoretical understanding of what's happening.
If i remember my Goodfellow correctly (and quickly checking, wikipedia, I did https://en.wikipedia.org/wiki/Universal_approximation_theore... ), there is a nuance here which is almost always missed: you can represent any function with a sufficiently wide 2 layer neural network, it doesn't say anything about being able tune the network until you find a correct setting (i.e. learnability).
This is important. Flippantly said,discarding learnability and speed of convergence, you can get the power of any neural network by the following algorithm:
1. Randomly generates a sufficiently wide bit pattern
2. Interprets it as a program and run it on the test set
3. discard results until the desired accuracy is reached
I ask because when we're training human to understand things, there are a variety of benefits to separate feature-understanding from the classifiers. In particular, you get gains in flexibility, extendability, and debuggability.
I get why people are happy to take the ConvNet gains and run with them for now. But have you seen any interesting work to get the benefits of separation in the new paradigm? (Or, alternately, is there a reason why those concerns are outmoded?)
That's actually closer to how deep learning started. Initially, deep learning mostly consisted of unsupervised (task independent) features with a linear classifier on top. We had to fit an unsupervised model (e.g. autoencoder) layer by layer before using the feature layers in a supervised task.
This was because we didn't understand how to train a deep model end-to-end until later. When we learned how to make that end-to-end training work it tended to perform better because the learned features were task specific.
You can still learn general features in a bunch of ways, in addition to the older method using autoencoders. For one example, multiple supervised heads with auxiliary losses can learn more generalize features.
To be fair there is a lot of domain knowledge embedded in the use of a convolutional architecture. There is a fascinating paper where the authors don't even train the weights of the convolutional layers and are still able to achieve good performance.
Great article! I did not understand the part with kernels before, so I appreciate the simplified explanation. The interactive demo on https://www.csie.ntu.edu.tw/~cjlin/libsvm/ is really cool.
I notice this doesn't mention hinge loss, which is by far the simpler way of arriving at the SVM. Hinge loss is just max(0, 1- t*y), where y is the output of the linear model and t = +-1 is the label. Thus, it takes the common-sense approach of not penalizing losses that are far enough away from the decision boundary, and penalizing linearly after that.
An SVM is literally just a linear model with hinge loss instead of log loss (logistic regression) or squared loss (ordinary linear regression) in primal form. For apparently historical reasons, it is usually derived from the "hard-margin" SVM in dual form, motivating with trying to maximize the margin. This is complicated and not very intuitive.
This also causes people to conflate the kernel trick and the dual form, while in fact they have nothing to do with each other. You can use the kernel trick in primal svm just fine.
Stochastic gradient descent can also be used for primal methods, while it doesn't work in the dual. That makes it much faster for large problems than the dual.
The hinge-loss and the primal form of the SVM objective is really easy to understand. Every ML 101 class would jump into the dual formulation, talk about kernels, RKHS, and all the fancy stuff.
Once you realize that a linear SVM isn’t very different from logistic regression, it starts to all make sense (at least it did for me).
Key insight of the hinge-loss: once something is classified correctly beyond the margin, it incurs a loss of zero.
Now, Something fun to think about. Draw the hinge loss. Now draw the ReLU (which is found all over the place in CNNs). Now thing about L1-regularization (which was used to induce sparsity in compressed sensing). They are more similar in form than you would think.
It's funny how close everything is connected. It turns out you can even derive the hinge loss using a mixture of normal distributions, so it is also connected closely to OLS.
Some people have had good luck with hinge or multi-hinge loss for neural networks instead of the almost universal log loss, since of course the hinge loss can be used in things other than linear models. It doesn't care how you get the y output.
Wasn't the reason why everyone cared about SVMs was specifically because of the nonlinear kernel stuff that would push performance a smidge and produce new bests on existing benchmark datasets?
The QP-dual formulation never seemed like something that could scale, and linear SVMs never seemed all that much better than just lasso/elasticnet regression. (hmmmmm :) )
The nonlinear kernel SVM works at least as well in primal, using just the representer theorem. Since it is unconstrained, all you need to do is create a kernel matrix and solve your system with your favorite convex optimizer like Newton's method (which also can work in lower dimensions).
Second-order methods like Newton's method converge better to the exact solution that SGD, although they don't reach a "pretty good" solution as fast usually. Coordinate descent methods in the dual also get very close to the exact solution, but Newton's method and friends are usually faster. With (quasi-) Newton methods in the primal, everything just comes down to solving linear systems, which is a much more well-studied problem.
I've even experimented successfully with kernelized models with millions of examples in low dimensions using the Fast Gauss Transform. That's impossible in the dual.
You can also generate low-rank kernel approximations [0] using the Nystroem method or the Fastfood transform that can then be used in a linear SVM. For example, if I had a problem with n=10^6, I can make a low-rank approximation of the kernel matrix (say d=1000) and feed that into a fast SGD optimizer.
This often works really well, and is usually pretty close to the exact kernel solution if the problem is of lower intrinsic dimensionality, which is usually true if the dual SVM is sparse in its basis vectors. This largely negates the sparsity advantage of the primal SVM. If a kernel approximation isn't good, then the dual SVM wouldn't be meaningfully sparse anyway, so there is still no advantage of dual. Best just solve the kernelized system in the primal, and use a second-order optimizer if needed.
Its similar to L2regularized logistic regression not the pure logistic regression. Although in your defense pure logistic is not used that often. I find the large margin view very intuitive -- one finds a separator such that the labeled data points are separated and lie as far as possible from the line
Logistic regression is usually assumed to be regularized in machine learning. It's not often you find practical problems that don't need regularization.
Indeed. You would notice that I had said as much. For the analogy between SVMs and LR to work you need L2 squared regression specifically, other regularizers, for example L1 wont work for the analogy. L2 squared is the key for the connection with Hilbert space and kernels to fall out automagically
There's been some work on variational Bayesian formulations of SVMs in the last few years. These can give actual uncertainty estimates and do automatic hyperparameter tuning. This one in particular is very cool:
It's interesting how quickly support vector machines went from the hot new thing to classify images to an afterthought after deep learning started having great results.
Noticed that too. It feels it was just a few years and all of the sudden everything is "deep" now.
The same thing happened with data storage. As soon as big data appeared everyone stopped doing just data and started doing "big data". Now the term is kind of a joke even.
I predict in a few years "deep learning" term will become mostly used in an ironic sense as well.
> I predict in a few years "deep learning" term will become mostly used in an ironic sense as well.
I may be a bit behind the times, but I'm also mystified by "deep learning's" popularity. Both giant neural nets and kernel methods have overfitting problems: torture a billion-parameter model long enough, and it will tell you what you want to hear.
SVMs address this by finding a large margin for error, which will hopefully improve generalization. DNNs (I think) do this by throwing more ("big") data at the problem and hoping that the training set covers all possible inputs. Work on adversarial learning suggests that DNNs go completely off the rails when presented with anything slightly unexpected.
My other comment addresses some of this but you're overstating things a bit. Throwing more data at the model is one solution. Its just not the only, or even best, approach. Properly measured performance on good holdouts and the application of regularization avoids the worst of overfitting. This is standard practice is most of machine learning, not just deep learning.
Deep learning gets a lot of hype because for many applications they perform better and scale better without a lot of tricks and extensions which are now possible with SVMs. You can even use a large margin loss with deep models to get some of the benefits of SVMs.
Adversarial examples are way overblown. First, SVMs are not immune to them either. Second, very few applications are threatened by things like adversarial examples.
Empericially, CNNs generalize better on image recognition tasks than hand built features. This comment doesn't make much sense and is needlessly obtuse in the face of progress, tbh.
That outcome doesn't seem terribly likely. It's true that, like big data, deep learning is often misused. This is largely because it works well enough in the off-the-shelf case and it's "easier" due to tooling, transfer learning, and free educational materials for beginners. However, deep learning also obtains state-of-the-art in a number of tasks and domains when you know what you're doing.
I don't think your scenario is likely to occur unless something else starts outperforming deep learning (in the broadest sense) _and_ there's an approachable alternative to solve the same problems at least as well.
they were also all the rage in pretty much everything else as well. In the problems not pushed out by DNN, gradient boosting has pretty much replaced SVMs as GBMs are faster training and better accuracy.
In the pdf, it said that the optimization problem in SVMs have a nice property in that it was quadratic, which means that there's a nice global minimum to go towards, and not lots of local minimum like in NN. That means, it seems SVMs won't get stuck at a suboptimal solution.
Is that not a problem in DNNs now? Or is it that it's such high dimensionality that local minima don't stop the optimizer, because there's always another way around the local minimum?
Still a problem, but not a problem at the same time. Downhill in the loss function is always better, so you may not be at the global minima but you might be at a good enough spot anyway. Using SGD gives you a bit of local minima hopping ability as well. Interesting question to think about what the loss surface looks like, as it is so high dimensional that it might just be saddles everywhere. The difference between a local minima and the global might be practically nothing in terms of classifier performance.
The convexity of the loss function for SVMs is a mixed blessing, you are guaranteed to be at the global min but the optimization problem doesn't scale with large training data or big feature vectors. Hence the historical feature engineering efforts or sacrificing this property to a more scalable optimization method.
So in short the ability to use high dimensional features in a NN means you don't get any guarantee about if the minima you get will be the best, but you lose less by not having to cut down the size of the input vector by hand (i.e. working directly with an image rather than some embedding with keypoint descriptors etc).
As others have said, you don't actually want the global optimum of a neural network because that would be terrible overfitting. There is some evidence that architectural tricks (like ResNet) that empirically help performance are making the loss landscape "more convex", though.
SVMs are a convex optimization problem, which means any local optimum is also globally optimal. In practice that leads to fast, provably optimal solutions.
With deep learning, the problem is non-convex, so the results are locally optimal but not necessarily globally optimal. We get around this with different techniques such as randomized initial weights, but with few exceptions, non-convex problems are NP-complete. Therefore there is no polynomial-time algorithm for finding the global optimum in deep learning unless P==NP.
I occasionally hear people claim deep learning finds the globally optimal solution, but that’s just not correct (or they mean they are doing an exhaustive search e.g. branch and bound, which has exponential worst case run time and isn’t practical).
That said, locally optimal solutions are often “good enough” for practical problems.
Its a small mercy that one does not find the global minimum of DNNs. Given their number of parameters and no explicit regularization term, they would overfit viciously. The inability to reach the global and the jiggling around imposed by stochastic gradient descent acts as implicit regularization
SVMs have convex objective functions, but when people use SVMs, they are using some kind of features + SVMs on top. The success of the approach is both good features, and waiting long enough for the optimizer to converge.
With DNNs, people learn everything (features + decision boundary), and this problem is not convex. Surprisingly DNNs work quite well in practice even though we were taught to be afraid of non-convex problems in grad school around 2005.
If back in early 2000s, we stopped worrying about theoretical issues and explored more approaches like ConvNets, we might have had the deep learning revolution 10 years earlier.
You usually do not even want a global optimum with a high capacity model such as a DNN anyway (SVMs memorize only a small number of data points), because that possibly means that you are overfitting to your training dataset.
People who know more about deep learning than I do tend to argue that there is empirical evidence that non-convexity is a non-issue because the performance of local minima will be close to the performance of the global minimum, given a sufficient number of nodes[1][2]. One such quote:
I once ran a small neural net 100 times on simple three-dimensional data re-
selecting the initial weights to be small and random on each run. I found 32
distinct minima, each of which gave a different picture, and having about equal
test set error.
– Leo Breiman
Note also that "optimal" in each case means "relative to other models in the same family." But a large neural net and an SVM do not exist in the same hypothesis space. As an analogy, recall that Ordinary Least Squares is the BLUE (Best Linear Unbiased Estimator.) But that's only relative to other unbiased (which in particular means no regularization: not ridge/lasso/elasticnet) linear models over the same feature set. Just because OLS provably gives the best performance in this family doesn't mean it is going to do well compared to models outside that family. In the same way, just because SVM offers a convergence guarantee doesn't mean its performance will always be the same or better than an NN.
The real problem with SVMs is that when you use a kernel (which is the whole point, linear SVMs are just logistic regression with hinge loss) then you introduce one landmark (feature) for every data point. So if your original data set was a manageable 1e6x1e3 then the SVM will view this data set as 1e6x1e6. It doesn't actually have to store that in memory thanks to the kernel trick[3] but training time still scales as O(N^2) where N is the number of observations (rows) in the original data. In practice, even with a highly optimized library like LIBSVM[4] you're not going to get good performance past N=1e6. (I've personally had the most success with SVMs on small datasets of only a few thousand rows.) While NN can very easily accommodate much later training sets with training time only growing as O(N), more or less. You can always sample a manageable training set from a larger data set but if your only using 1% of your data to train a model that's a problem.
Deep NN is also a far more modular approach: if you know your data has a 2D structure for example, you can add some CNN layers and get translational invariance. Deep NN comes with a large toolbox for specializing the model to the data, while the options for tuning an SVM are much more limited: you can change out the kernel function (not that anything is likely to beat RBF) and play around with the regularization parameter, and that's it. Once you've tuned those hyper-parameters with a small grid search there's not much else left to try with SVMs. If they work, they work, otherwise you have to change approaches.
I think you're exactly right about the modularity aspect of DL; in fact I made a similar comment on this page, albeit speaking in terms of basis functions.
I have a minor nitpick regarding this point you make: not that anything is likely to beat RBF. Depending on the data, specialized kernels can help immensely. An easy example is sequence classification where something like a string kernel might work really well. Or image classification, where histogram based kernels might prove superior.
Note that sometimes you might want to measure how good a kernel is for a problem not by its prediction accuracy alone but also by the number of support vectors it needs - if the final model retains ~100% of the training data as support vectors, it is not a great model in some (subjective) sense since it is memorizing a lot. Depending on the data, you might "beat" the RBF kernel on this aspect too.
Regarding the training time, there are some interesting tricks I've come across (but not tried them out yet) -[1], [2].
[2] SVMPath - algorithm to fit the entire path of SVM solutions for every value of the cost parameter, with
essentially the same computational cost as fitting one SVM model. http://www.jmlr.org/papers/volume5/hastie04a/hastie04a.pdf
Funny how SVMs are just max-margin loss functions and we just took for granted that you needed domain expertise to craft features like HOG/SIFT by hand.
By 2018, we use ConvNets to learn BOTH the features and the classifier. In fact, it’s hard to separate where the features end and the classifier begins (in a modern CNN).