Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> All ML is density estimation using a parametric probability distribution.

Umm is there a typo there ? Otherwise it is so wrong that I cannot even begin to pick the flaws.

In fact the key revolutionary idea in classification theory, courtesy Vapnik, was that one needn't estimate the density at all.

I cannot over-emphasize the revolutionary part, prior to Vapnik, that over-arching consensus was that one should try to learn the the density and then threshold it to obtain the classifier. The motivation for this is the result that the thresholded conditional probability is the optimal classifier for a {0,1} loss.

What Vapnik showed was that it is better to learn the classifier directly (he called it structural risk minimization, SRM, for short), without any parametric assumptions whatsoever. The main reasoning is that density estimation is freaking hard, and depending on a freaking hard task as a pre-requisite is backwards. Learning to classify is actually easier than learning densities. It was one of those paradigm shifts that come to a field only once in a while.

Even if we let SRM be, even in traditional statistics there have been methods as old as time that do not assume any parametric model. Those into non-parametric statistics (I am not one) would be very upset with your claim :) The problem with parametric models is that you almost never know the parametric family. You can of course test, but how many would you test, there are infinite number of such families. Non-parametric methods dont make parametric assumptions, (they do make some and considerably weaker assumptions. Assumptions are necessary for learning). The drawback is that they are more expensive, similarly if you happen to know the correct form of the family parametric model would be hard to beat.



@srean -- I agree with you. To make the GP statement true, you have to adopt sweepingly wide definitions of the terms "parametric" and "density estimation", and a very narrow concept of ML to boot. You have given some good background and I feel a need to add more examples.

The idea of boosting comes to mind. A quintessential ML concept, widely applicable, but really nothing to do with density estimation.

All the PAC learning results are another example. They show very strong results over very wide problem classes, but don't have to do with density estimation. They apply, for example, to problems where no density exists (just a probability measure), much less a smooth-enough density to estimate.

Another example of ML research that comes to mind is time and space efficient learning. There is a whole body of work on probabilistic analysis of data structures used for ML (I'm thinking partly about Alex Gray's work, http://www.cc.gatech.edu/~agray/). The work fits solidly into ML, but it's not about density estimation, it's about making provably fast algorithms for large problems.


SVM induce an implicit probability distribution over labelled sampled of points, whether you like or not.

Yes, "parametric" is often meant to imply a finite parametrization, but that's a dumb terminology.


For conversation, at least, it helps to stick with the agreed upon definitions, for example, "parametric" means finite parametrization. Otherwise the discussion is moot, if you use too broad definitions thens "everything is parametric" becomes a tautology, a vacuous statement.

PAC and MDL arent as much as at odds that you make it up to be. If you are familiar with PAC-Bayes and PAC-MDL theorems you will know what I mean. My main issue was with the ridiculous (or with your definition of whats 'parametric' then a vapid) claim that all of ML is parametric.

Secondly, wanted to highlight that its a bad way to go about ML, by estimating data distribution first and then find the decision function. This is the precise reason why SVMs broke out of the then state of the art. The two stage 'plug-in' approach used to be the preferred way then. A decision function might be Bayes with respect to some density, but thats besides the point, it is ill advised to match the density and then derive the decision function, except for very special and narrow cases.

Even if you throw PAC out, parametric stats (am using the standard definition here) is a tool with very narrow scope. If it happens to be in the current scope then by all means use it.

BTW you make good point about transductive learning, it got compared to semisupervised, but they are not the same thing. Both use unlabeled data, but for transductive, the points were you seek labels have to be given ahead of time. This enables very strong guarantees. I think its potential has not come to fruition yet.

EDIT: @murbard2 > Re: See what I did there.

Color me violently unimpressed. That is a not a parametric distribution.

I continue to stress that plugin estimates are in general not a good idea. Even die-hard Bayesians would not use it, they would rather estimate the MAP or the mean aposteriori decision function directly, rather than fit data distribution and then derive the optimal decision function from it.

I find Bayesian inference to be a good idea, except that it is costly unless you use conjugate priors (these are motivated more by convenience than by data) and its "turtles all the way down" problem.

Its not that you have advocated plug in estimators, but given your claim about everything about ML is about fitting parametric densities, it is likely to mislead readers into believing that the two step way is a good idea.


I agree that Bayesian techniques can be costly. The argument I'm making is that Machine Learning techniques are best understood as approximations to Bayesian methods. Not for some math reason but for some epistemological reason.

You like SVM? Fine, then why do large margins matter at all? "Intuition" ?


The decision function can be seen as a distribution too. Say you're thinking of building a classifier, you can build the distribution over the domain x {-1,1}

Give me f: A -> B and I'll give you P : (A x B) -> R+ See what I did there?


And yes, you can try very hard to be a Popperian and avoid the problem of induction by inventing a contrived framework like PAC. Or you can embrace description length as a universal prior.


That sounds interesting, do you have a good reference for reading on SRM?


Its hard to pick the one true reference. It is of course covered in Vapnik's two books. Just to set expectations right, these arent what I would call hacker friendly books, they need a fair bit of maths. I think another good one is the book by Devroye, Gyorfi and Lugosi, you will find more examples here. You could also google "statistical learning theory" that would give you lots of relevant hits, in particular course notes.


DGL is great, but also not "hacker friendly"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: