A sane introduction to maximum likelihood estimation and maximum a posteriori

lenticular · on Jan 3, 2019

This is a really cool and clear introduction to MAP/MLE, especially since you take great pains to explain what all of the notation means. I'll definitely be pointing some people I know to this blog.

OT on technical blogs: Experts often are unable to put themselves in the shoes of someone with no experience, which really harms the pedagogy. When one practices a technical topic for a long time, concepts that were once foreign and difficult become instinctual. This makes it very hard to understand in what ways a beginner could be tripped up. It takes a large amount of thought to avoid this problem, which I think is why much introductory material - blog posts, books, etc., is really sub-par.

nordsieck · on Jan 3, 2019

> Experts often are unable to put themselves in the shoes of someone with no experience, which really harms the pedagogy.

If anyone is interested in learning more, this phenomenon is typically called "the curse of knowledge".

theoh · on Jan 3, 2019

See also "unconscious competence": https://en.wikipedia.org/wiki/Four_stages_of_competence

subjectHarold · on Jan 3, 2019

Could someone explain in a bit more detail the move from 26 to 27? I don't get the significance of being "worried about optimization" or why/how we cancel p(x). I do get the later point about integration and the convenience of the reformulation. I just don't get why or how it is "allowed".

Sorry if this is obvious but I have been doing a lot of reading on this and have come across this step a few times before...but am just missing some part of every explanation.

throwaway287391 · on Jan 3, 2019

Because we're optimizing (taking an argmax) with respect to theta for some fixed dataset x, the 1/p(x) is just a constant factor -- p(x) is just some number (and a non-negative one, since it's a probability). It's like saying argmax_{theta} 0.87*f(theta) = argmax_{theta} f(theta).

ysleepy · on Jan 3, 2019

It is allowed since we optimize over many instances and all of them are normalized by p(x). This means we can drop it from all of them and they stay proportional and will result in the same optimization result.

kahoon · on Jan 2, 2019

Nice, clear explanation. Looking forward to the Bayesian inference one!

One note though: I think on equation 25 you are missing a log on the left hand side.

perone · on Jan 2, 2019

Will fix it, thanks a lot for the feedback !

stilley2 · on Jan 3, 2019

Nice write-up! Minor nitpick: ML/MAP estimators don't _require_ observations to be independent. At least, in my field we're looking at a single observation of a multivariate distribution, and we don't need to assume the elements are independent (ie, we permit a non-diagonal covariance matrix). My intuition says this is equivalent to assuming multiple correlated scaler observations, but I'd have to sit down with some paper. Also, you use "trough" where I think you mean "through."

nerdponx · on Jan 3, 2019

Dependence between elements of the same observation is irrelevant. The point is that different observations must be independent and identically distributed for the standard formulation of the likelihood to be valid.

Typically we write the likelihood function as

    L = Π P(y | θ)

If you didn't have identically-distributed observations, the functional form of P would be different for each observation.

And if you didn't have independent observations, then you're basically screwed in the general case. That expression for L is basically the definition of probabilistic independence: a finite set of random variables is mutually independent if and only if their joint probability function is equal to the product of the individual variables' probability functions.

If you have dependence between observations, you lose the ability to write L in that nice form. This is a non-negotiable consequence of basic probability theory.

The only way to do MAP estimation without iid observations is to know the joint distribution of your entire dataset, and be able to maximize that distribution with respect to θ given an arbitrary data set. This is possible but it's not quite the same thing as dumping your data into a GLM.

conjectures · on Jan 3, 2019

The post this is a reply to was correct, and this is not. E.g. a simple counter example is finding the autocorrelation parameter in an AR(1) model for an economic time series. Under your suggested definition of MLE this can't be done, which is simply not the case.

In fact, not approaching the more general case is liable to confuse learners as they may think that independence assumption is somehow baked into MAP/MLE, which it is not.

nerdponx · on Jan 3, 2019

I never suggested a definition of MLE. You need independence to use the "L = Π P(y | θ)" formulation, full stop.

conjectures · on Jan 3, 2019

Yes, you do need independence to assume the likelihood factorises. You do not need independence to find a MLE/MAP.

stilley2 · on Jan 3, 2019

True. But that form is a convenience, not a requirement.

stilley2 · on Jan 3, 2019

But that form is not required. A quick counter example. I'm trying to estimate a value from N measurements. The measurements experience Gaussian noise with some general covariance matrix K (i.e., they are not independent) Therefore, y is a sample from N([1, 1, ..., 1]^T u, K). The MLE is then ([1, 1, ..., 1]K^{-1}[1, 1, ..., 1]^T)^{-1} [1, 1, ..., 1] K^{-1} y. Or in words, multiply y by the inverse covariance matrix, sum the result, and divide by the sum of all the elements in the inverse covariance matrix. As a sanity check, when the measurements _are independent, this reduces to a weighted average, where the observations are weighted by their inverse variances.

stilley2 · on Jan 3, 2019

I suppose my example meets the case of knowing the joint distribution.

master_yoda_1 · on Jan 3, 2019

So you think all the others are insane ;)

perone · on Jan 3, 2019

For that, I would need the likelihood of all others xD