This is a really cool and clear introduction to MAP/MLE, especially since you take great pains to explain what all of the notation means. I'll definitely be pointing some people I know to this blog.
OT on technical blogs:
Experts often are unable to put themselves in the shoes of someone with no experience, which really harms the pedagogy. When one practices a technical topic for a long time, concepts that were once foreign and difficult become instinctual. This makes it very hard to understand in what ways a beginner could be tripped up. It takes a large amount of thought to avoid this problem, which I think is why much introductory material - blog posts, books, etc., is really sub-par.
Could someone explain in a bit more detail the move from 26 to 27? I don't get the significance of being "worried about optimization" or why/how we cancel p(x). I do get the later point about integration and the convenience of the reformulation. I just don't get why or how it is "allowed".
Sorry if this is obvious but I have been doing a lot of reading on this and have come across this step a few times before...but am just missing some part of every explanation.
Because we're optimizing (taking an argmax) with respect to theta for some fixed dataset x, the 1/p(x) is just a constant factor -- p(x) is just some number (and a non-negative one, since it's a probability). It's like saying argmax_{theta} 0.87*f(theta) = argmax_{theta} f(theta).
It is allowed since we optimize over many instances and all of them are normalized by p(x). This means we can drop it from all of them and they stay proportional and will result in the same optimization result.
Nice write-up! Minor nitpick: ML/MAP estimators don't _require_ observations to be independent. At least, in my field we're looking at a single observation of a multivariate distribution, and we don't need to assume the elements are independent (ie, we permit a non-diagonal covariance matrix). My intuition says this is equivalent to assuming multiple correlated scaler observations, but I'd have to sit down with some paper. Also, you use "trough" where I think you mean "through."
Dependence between elements of the same observation is irrelevant. The point is that different observations must be independent and identically distributed for the standard formulation of the likelihood to be valid.
Typically we write the likelihood function as
L = Π P(y | θ)
If you didn't have identically-distributed observations, the functional form of P would be different for each observation.
And if you didn't have independent observations, then you're basically screwed in the general case. That expression for L is basically the definition of probabilistic independence: a finite set of random variables is mutually independent if and only if their joint probability function is equal to the product of the individual variables' probability functions.
If you have dependence between observations, you lose the ability to write L in that nice form. This is a non-negotiable consequence of basic probability theory.
The only way to do MAP estimation without iid observations is to know the joint distribution of your entire dataset, and be able to maximize that distribution with respect to θ given an arbitrary data set. This is possible but it's not quite the same thing as dumping your data into a GLM.
The post this is a reply to was correct, and this is not. E.g. a simple counter example is finding the autocorrelation parameter in an AR(1) model for an economic time series. Under your suggested definition of MLE this can't be done, which is simply not the case.
In fact, not approaching the more general case is liable to confuse learners as they may think that independence assumption is somehow baked into MAP/MLE, which it is not.
But that form is not required. A quick counter example. I'm trying to estimate a value from N measurements. The measurements experience Gaussian noise with some general covariance matrix K (i.e., they are not independent)
Therefore, y is a sample from N([1, 1, ..., 1]^T u, K).
The MLE is then ([1, 1, ..., 1]K^{-1}[1, 1, ..., 1]^T)^{-1} [1, 1, ..., 1] K^{-1} y. Or in words, multiply y by the inverse covariance matrix, sum the result, and divide by the sum of all the elements in the inverse covariance matrix. As a sanity check, when the measurements _are independent, this reduces to a weighted average, where the observations are weighted by their inverse variances.
OT on technical blogs: Experts often are unable to put themselves in the shoes of someone with no experience, which really harms the pedagogy. When one practices a technical topic for a long time, concepts that were once foreign and difficult become instinctual. This makes it very hard to understand in what ways a beginner could be tripped up. It takes a large amount of thought to avoid this problem, which I think is why much introductory material - blog posts, books, etc., is really sub-par.