So...are you saying you need the dual formulation in order to allow a kernel mod...

ogrisel · on April 16, 2012

No actually it's not the case. But I don't know how Pegasos can be adapted to use kernels. If you take figure 1 of the paper [1] you will see that the gradient of the objective function is used to update a single weights vector `w` at each step of the projected stochastic gradient descent. In a kernel model, all the support vectors cannot be collapsed into a single weight vector `w`. You would need to handle the kernel expansion against the support vectors explicitly. But then how to select the support vectors out of all the samples from the dataset while keeping the algorithm online? The Pegasos paper does not say mention it.

[1] http://eprints.pascal-network.org/archive/00004062/01/Shalev...

sparsevector · on April 16, 2012

The set of support vectors is just the set of training examples that have non-zero alpha parameters. To implement the gradient update you just evaluate the support vector machine on the example (using the explicit kernel expansion) and then if the example has signed margin less than 1 you add y * eta to the corresponding alpha value.

The difficulty with Pegasos for non linear kernels is the support set quickly becomes very large and so evaluating the model becomes very slow. Note that since the alpha values are not constrained to be non-negative (unlike the standard dual algorithms) the alpha values don't ever get clipped to zero--instead they just slowly converge to zero. It's still (I think) one of the fastest methods in terms of theoretical convergence guarantees but perhaps not as fast as LaSVM or something similar in practice.

However, there's been a more general trend in machine learning to use linear models with lots of features instead of kernel models, partially because of these sort scalability issues.

ogrisel · on April 16, 2012

Thanks for the reply. I was told by twitter that this is similar to the kernel perceptron which I don't know well either. There is a good introduction with python code snippet here:

http://www.mblondel.org/journal/2010/10/31/kernel-perceptron...

However it seems that you need to compute the kernel expansion on the full set of samples (or maybe just the accumulated past samples?): this does not sound very online to me...

sparsevector · on April 16, 2012

It's true you need to compute the kernel dot product between every example you see and every example in the support set (every example that ever previously evaluated to signed margin < 1). Whether it's online depends on your definition of "online" . It's definitely not online in the sense of using memory independent of the number of examples, since you have to keep around the support set. I think there are results showing the support set grows linearly with the size of the training set under reasonable assumptions. However, it is online in the sense that it operates on a stream of data, computing predictions and updates for each example one-by-one. It's also online in the sense that its analysis is based on online learning theory (e.g. mistake / regret bounds). A lot of learning theory papers use "online" in the latter two senses, which is confusing if you expect the former.