Gradient Boosting Explained in 3D

enthdegree · on July 27, 2016

I hate to be a curmudgeon but is machine learning really necessary for a wavelet decompose?

Can anyone give an example where this enables you do to things that couldn't be done better with other techniques?

sforzando · on July 27, 2016

Could you explain how wavelet decompositions/transforms could be used to learn a predictive model? In other words: given a labeled dataset D = {x_i, y_i}, a function F(x) = y, where x is input data (pixels, credit scores, etc), and y are target labels (object labels, investment risk, etc.).

I'm not very well-versed with wavelet methods. But in computer vision and image processing, I've seen people apply wavelet transforms to images, extract the wavelet coefficients, and use the coefficients as the image feature representation. Then, these coefficients would typically be fed to a traditional machine learning classifier, ie nearest neighbor, SVM, etc.

In other words, I've seen wavelet transforms used as feature extractors. I haven't seen wavelet transforms used to actually learn the predictive model F(x).

Gradient boosting, on the other hand, is learning the predictive model F(x).

Said in another way: gradient boosting is learning F(x) = y.

Wavelet transforms learn g(x) = x^{hat}, such that F(g(x)) = y is "easier" to learn.

I hope I'm explaining things clearly - sorry in advance if I made any mistakes, particularly in my understanding of wavelet transforms/decompositions.

romaniv · on July 27, 2016

Can wavelet decompose be trained for classification of feature vectors with many dimensions?

enthdegree · on July 27, 2016

Yes that is what they are for. In this case you probably don't even really need wavelets and you can just use a Fourier transform

romaniv · on July 28, 2016

Fourier transform is not a classification mechanism. You can transform something into frequency domain, but that doesn't inform you which frequencies are most relevant in classifying samples.

minimaxir · on July 27, 2016

Note that chart appears to not work on iOS, as Plot.ly is erroneously snowing a "Webgl not supported" message (which is an error on their end, as the official website for the library shows the same issue. Issue on GitHub: https://github.com/plotly/plotly.js/issues/280)

That's a shame, as their API is pretty good, as this demo illustrates.

arogozhnikov · on July 27, 2016

+1ed the issue. I was sure that's a problem on Safari side.

lwhi · on July 27, 2016

Plot.ly looks very nice! (on Linux, at least)

artursapek · on July 27, 2016

I really need to play with the 3d canvas context API...

corysama · on July 27, 2016

Starting point https://webglfundamentals.org/

Most popular library: http://threejs.org/

Don't miss https://acko.net/blog/mathbox2/ and https://aframe.io/

artursapek · on July 27, 2016

Thanks!

ced · on July 27, 2016

I haven't read much on gradient boosting, so questions:

1. Where is the gradient? This explanation makes it sound like a straight Generalized Additive Model.

2. In fact, the explanation makes it sounds worse than random forests. Wouldn't it quickly overfit? Where does the boosting come into play?

selectron · on July 27, 2016

The explanation glosses over a few important details. Gradient boosting works by adding some small weight to the instances the model is incorrectly predicting. The amount of extra weight these instances get is a parameter that is tuned with validation - because this parameter can be 0, if you are doing correct cv gradient boosting trees is usually superior to random forests. You also do need to tune the number of trees you use in gradient boosting or else you will overfit.

Gradient boosting doesn't get nearly enough hype as compared to things like neural nets. The significant majority of winning solutions to Kaggle competitions for a non-image or text-processing dataset will use xgboost to do gradient boosting as part of the ensemble model. Furthermore, it is a really easy method to understand and use while still being state-of-the-art.

sforzando · on July 27, 2016

These helpful, well-written slides help explain where the "gradient" comes into "Gradient Boosting":

http://www.ccs.neu.edu/home/vip/teach/MLcourse/4_boosting/sl...

The gist of it is: when you add a new decision tree that fits to the residual error, this new tree is fitting to the negative gradient of the loss function (ie training error). Thus, adding the new decision tree to your existing ensemble takes a gradient-descent step that seeks to minimize the loss function (ie training error).

Boosting comes in because the model is combining several weak learners/models (individual trees) into a strong learner (ensemble of trees). Each individual tree breaks up the input space into piecewise-constant regions that best approximate the target function. This representation will incur some error - thus, a new tree is fit to minimize the error over the entire input space, ie by breaking up the input space into piecewise-constant regions, etc.

So, it's boosting not in the traditional Adaboost sense: where the final model is a linear combination of "dumb" classifiers. Instead, I'd liken it more to a cascade method: each tree T_{n} seeks to fix the errors from the previous tree T_{n-1}: https://en.wikipedia.org/wiki/Cascading_classifiers

There's actually a cool facial landmark detector that uses this same cascading idea to train an extremely fast (and quite accurate) system. In essence, they use a cascade of random forests (in a gradient-boosting framework) to detect landmarks. The dlib library has a great implementation, along with a pretrained model. I've used it in my research, and while not perfect, have been satisfied with its results: http://blog.dlib.net/2014/08/real-time-face-pose-estimation....

http://www.cv-foundation.org/openaccess/content_cvpr_2014/pa...

ced · on July 28, 2016

Those slides were very helpful, thank you.