Could anyone go into more detail on why a CRF is a good model for this kind of t...

Dn_Ab · on Sept 21, 2015

All the things you mentioned (plus e.g. bayesian networks and Restricted Boltzmann Machines) are examples of Graphical Models. You can roughly think of (linear chain) CRFs as being to HMMs as logistic regression is to Naive Bayes. HMMs and Naive bayes learn a joint probability distribution on the data while Log Reg and CRFs fit conditional probabilities.

If none of that makes sense then, basically, in general and with more data, the CRF (or discriminative classifier) will tend to make better predictors because they don't try to directly model complicated things that don't really matter for prediction anyways. Because of this they can use richer features without having to worry about how such and such relates to this or that. All this ends up making discriminative classifiers more robust when model assumptions are violated because they don't sacrifice as much to remain tractable (or rather, the trade off/sacrifice they make tends to end up not mattering as much when prediction accuracy is your main concern).

So in short, you use a HMM instead of a Markov Chain when the sequence you're trying to predict is not visible. Like say when you want to predict the parts of speech but only have access to words, you'll use the relationship between the visible sequence of words to learn the hidden sequence of Parts of speech labels. You use CRFs instead of HMMs because they tend to make better predictors while remaining tractable. The downside is discriminative classifiers will not necessarily learn the most meaningful decision boundaries, this starts to matter when you want to move beyond just prediction.

kylebgorman · on Sept 21, 2015

CRFs directly estimate the posterior/conditional model you care about (it tells you how to tag things), whereas a HMM estimates the joint model which you then use for inference. The general feeling is that it is actually easier to learn the posterior model than then joint model. (And the insight of linear models like support vector machines is that it is easier to just learn the most likely label than it is to estimate the label-given-observation probability distribution.)

In fact a linear-chain CRF is little more than the discriminative version of an HMM. (And an HMM is just a sequential naïve Bayes classifier, and a linear-chain CRF is just a sequential logistic regression classifier. And, while I'm at it, a max-margin markov network is just a sequential support vector machine.)

nl · on Sept 21, 2015

The linked book covers it, but it is 90 pages.

From a quick read, plain HMMs don't handle words that have never been seen before well.

Markov Logic networks are equivalent of CRFs (pg 22).