> All ML is density estimation using a parametric probability distribution. Umm ...

mturmon · on Aug 5, 2014

@srean -- I agree with you. To make the GP statement true, you have to adopt sweepingly wide definitions of the terms "parametric" and "density estimation", and a very narrow concept of ML to boot. You have given some good background and I feel a need to add more examples.

The idea of boosting comes to mind. A quintessential ML concept, widely applicable, but really nothing to do with density estimation.

All the PAC learning results are another example. They show very strong results over very wide problem classes, but don't have to do with density estimation. They apply, for example, to problems where no density exists (just a probability measure), much less a smooth-enough density to estimate.

Another example of ML research that comes to mind is time and space efficient learning. There is a whole body of work on probabilistic analysis of data structures used for ML (I'm thinking partly about Alex Gray's work, http://www.cc.gatech.edu/~agray/). The work fits solidly into ML, but it's not about density estimation, it's about making provably fast algorithms for large problems.

murbard2 · on Aug 5, 2014

SVM induce an implicit probability distribution over labelled sampled of points, whether you like or not.

Yes, "parametric" is often meant to imply a finite parametrization, but that's a dumb terminology.

srean · on Aug 5, 2014

For conversation, at least, it helps to stick with the agreed upon definitions, for example, "parametric" means finite parametrization. Otherwise the discussion is moot, if you use too broad definitions thens "everything is parametric" becomes a tautology, a vacuous statement.

PAC and MDL arent as much as at odds that you make it up to be. If you are familiar with PAC-Bayes and PAC-MDL theorems you will know what I mean. My main issue was with the ridiculous (or with your definition of whats 'parametric' then a vapid) claim that all of ML is parametric.

Secondly, wanted to highlight that its a bad way to go about ML, by estimating data distribution first and then find the decision function. This is the precise reason why SVMs broke out of the then state of the art. The two stage 'plug-in' approach used to be the preferred way then. A decision function might be Bayes with respect to some density, but thats besides the point, it is ill advised to match the density and then derive the decision function, except for very special and narrow cases.

Even if you throw PAC out, parametric stats (am using the standard definition here) is a tool with very narrow scope. If it happens to be in the current scope then by all means use it.

BTW you make good point about transductive learning, it got compared to semisupervised, but they are not the same thing. Both use unlabeled data, but for transductive, the points were you seek labels have to be given ahead of time. This enables very strong guarantees. I think its potential has not come to fruition yet.

EDIT: @murbard2 > Re: See what I did there.

Color me violently unimpressed. That is a not a parametric distribution.

I continue to stress that plugin estimates are in general not a good idea. Even die-hard Bayesians would not use it, they would rather estimate the MAP or the mean aposteriori decision function directly, rather than fit data distribution and then derive the optimal decision function from it.

I find Bayesian inference to be a good idea, except that it is costly unless you use conjugate priors (these are motivated more by convenience than by data) and its "turtles all the way down" problem.

Its not that you have advocated plug in estimators, but given your claim about everything about ML is about fitting parametric densities, it is likely to mislead readers into believing that the two step way is a good idea.

murbard2 · on Aug 5, 2014

I agree that Bayesian techniques can be costly. The argument I'm making is that Machine Learning techniques are best understood as approximations to Bayesian methods. Not for some math reason but for some epistemological reason.

You like SVM? Fine, then why do large margins matter at all? "Intuition" ?

murbard2 · on Aug 5, 2014

The decision function can be seen as a distribution too. Say you're thinking of building a classifier, you can build the distribution over the domain x {-1,1}

Give me f: A -> B and I'll give you P : (A x B) -> R+ See what I did there?

murbard2 · on Aug 5, 2014

And yes, you can try very hard to be a Popperian and avoid the problem of induction by inventing a contrived framework like PAC. Or you can embrace description length as a universal prior.

ced · on Aug 5, 2014

That sounds interesting, do you have a good reference for reading on SRM?

srean · on Aug 5, 2014

Its hard to pick the one true reference. It is of course covered in Vapnik's two books. Just to set expectations right, these arent what I would call hacker friendly books, they need a fair bit of maths. I think another good one is the book by Devroye, Gyorfi and Lugosi, you will find more examples here. You could also google "statistical learning theory" that would give you lots of relevant hits, in particular course notes.

tel · on Aug 5, 2014

DGL is great, but also not "hacker friendly"