There has been a lot of rich discussion about the relative merits of the Bayesian and frequentist perspectives on statistical inference. If you are thinking about applying inference techniques to some problem, then it is well worth your time to make sure that you really, really understand this debate, because picking the correct tool for your job is likely to make your life a lot easier (and, yes, there are both situations where the Bayesian perspective is clearly better and where it is clearly worse).
Unfortunately this post completely and totally ignores the things that will enable you to make this decision. This post is called "Understand the Math Behind it All", but you will learn nothing about math, or really, Bayesian statistics, at all. You will not learn, for example, how to apply Bayesian inference to problems, what it means to do basic Bayesian tasks like "use an expert" or "condition on evidence", or even what a Bayesian statistic is. In fact, Bayes' theorem is never even mentioned. There's just a hand-wavy collection of statements like "The Bayesian approach is to rely on past knowledge and then adjust accordingly". That is so vague that it is not even clear that they are talking about Bayesian analysis. This is the sort of statement that fools people into believing they understand something that they really don't.
If you really want to understand this material, you should watch[1], a talk by Mike Jordan called "Are you a Bayesian or a Frequentist?". It's a bit much for beginners, but if you are willing to look up some of the math, it is entirely digestible, and it is by far the best comparison of the two communities I have found. I say "by far" because it is (1) a more or less complete representation of both communities, (2) it is pretty much an unbiased representation of both communities' strengths and weaknesses, and (3) it is as direct as it can get, meaning that it is not tied up in a lot of external knowledge, and is intent on delivering this message, rather than delivering it in an off-hand way as a method of getting to something else.
I'd also prefer not to encourage referring to people as "a Bayesian" or "a frequentist", as though it's a hard philosophical preference, a binary choice one has to make.
I realise there are historical reasons for this, but really guys, they're both tools. Know the pros and cons and pick the right one for the job; neither is uniformly better.
Which framework you use can be a fairly subtle decision sometimes, requiring you to think fairly deeply about what you really want to get out of the analysis, what assumptions you're comfortable making, what kind of interpretation it would be most useful to be able to place on the results. But it's not just an arbitrary choice that you can leave to aesthetic or philosophical preference.
Thanks. Just yesterday I came across this first paragraph in a paper I now see is by Jordan:
"Statistics has both optimistic and pessimistic faces, with the Bayesian perspective often associated with the former and the frequentist perspective with the latter, but with foundational thinkers such as Jim Berger reminding us that statistics is fundamentally a Janus-like creature with two faces." [Janus is a Roman god with two faces]
In my current project (stroke and [since the leap] gesture recognition), I'm using the covariance matrix of a training set and the difference a feature vector from the mean to calculate the mahalanobis distance of the input vector from the training set. I plan to use this same covariance matrix in the Gaussian density estimation formula to generate a probability distribution function (and then use the likelihood function instead of mahalanobis distance). Still trying to mentally connect this to the bigger-picture stuff though.
Berger's actually written a seminal book on the topic called "Statistical Decision Theory and Bayesian Analysis"[1]. If you're interested in that area, consider checking it out of your library.
I'm not quite as sharp in the area as I used to be, but feel free to hit me up over email if you have more questions, I can't guarantee that I'll know the answers, but I'm happy to give it a shot.
My Bayesian theory is a bit rusty, but here we go.
Say we have data X, and some non-finite dimensional index into the family of functions that describe the the data, called \theta.
The Bayesian perspective classically holds \theta constant and optimizes the expected loss, conditioned on the data X. The frequentist perspective, on the other hand, classically optimizes \theta, that is, it picks the best \theta over the data X, unconditionally.
This has two impacts. First, all things equal frequentist statistics will tend to be more stable, and more calibrated, but less coherent. It is commonly said that frequentist statistics will "isolate" one from poor decision making, and all things equal, that will be true.
Specific, clear wins for frequentists are bootstrapping procedures (e.g., Efron's bootstrap, the b of n bootstrap, Jordan's own scalable "bag of bootstraps" from NIPS 2011), which are methods for building what are called "quantifiers" for "estimators". In short, this means that if you have some estimator (e.g., a classifier, or a mean, or whatever), you want to be able to quantify the certainty of your estimator -- so if you've only seen 5 examples, you want to express that you're less certain about this. This is clearly a frequentist application, not a Bayesian application, and in general, it points to the fact that pure frequentist tools not only have a place in inference, but they fill a niche that Bayesian tools necessarily will not, and in some cases, cannot, fill.
It sounds like you're just saying that if you want to know the frequentist properties of an estimator you have to be frequentist. That's a tautology.
The harder question is whether there are any decisions you'd prefer to make using a non-Bayesian procedure. That's basically a tautology in the other direction though.
As I said, my Bayesian theory is rusty, but there are no "frequentist properties" of an estimator. Frequentist inference is inference -- it doesn't make guarantees about the underlying thing it's approximating, it provides guarantees about its approximation.
The key here is that Bayesian and frequentist procedures provide different sorts of guarantees. Frequentists optimize for \theta, the possible set of things that could describe all of the data X, while Bayesians will assume a single describing function \theta (this might come from an "expert") and simply optimize the expectation conditioned on the data. Neither is "wrong" but in the case of the bootstrap, the result is calibrated in a way that Bayesian inference simply never will be (if it were, it would be frequentist).
EDIT: As per your second question, actually I think it's not more interesting. A classifier is a type of estimator, so all of the general frequentist guarantees actually still apply to decisionmaking.
Frequentist statistics is about determining the repeated sampling properties of a procedure/statistic/estimator. It's about evaluation not estimation. "Optimizing \theta" or whatever you're envisioning is just one possible procedure you might be interested in evaluating. You can use the repeated sampling properties of your procedure to do frequentist inference or to evaluate other properties like unbiasedness, consistency, risk, etc. Typically the goal is to find procedures that have "good" frequentist (repeated sampling) properties. Most Bayesian-inclined statisticians would tend to argue that many frequentist properties are not important to applied data analysis or optimal decision making.
A possible example given in the lecture is figuring out if two sets of numeric data were sampled i.i.d from the same distribution (being hand-wavy about what this precisely means).
I don't see how you can sensibly approach this kind of problem in a Bayesian way. From a frequentist perspective you're sort of spoiled for choice about how to approach this problem.
The useless answer is that they both do different things, so it depends what of those things you want :)
One aspect of frequentist techniques that perhaps others haven't emphasised so much, is that they tend to give guarantees about expected behaviour which hold uniformly over all possible values of the unknown parameters.
Whereas the Bayesian approach, the guarantees you obtain will only hold in an 'averaged-out' sense over the prior distribution you specify.
If you're a bit paranoid and you want a probabilistic bound on what might happen in the worst case, you might sometimes find the former a little more comforting than the latter.
In particular if you don't have much data, the influence of the choice of prior will be bigger and so the distinction will matter more.
Hope that helps, and that any stats PhDs will correct me if I've over-simplified things here.
There are times when, even as a Bayesian, one is interested in calibration. Model checking without a specified alternative is an example. Frequentist ideas -- sampling from the model and comparing it to the observed distribution -- can be helpful here. I'm thinking of Rubin (1984): http://www.cs.princeton.edu/courses/archive/fall11/cos597C/r...
This is an attempt to give a pragmatic overview, without the math.
Bayesian methods rely on the concept of "priors". A prior is a known probability about a fact, or event, or whatever it is you are modeling. Priors "seed" the network.
Bayesian networks generally need fewer samples of data to make predictions, and as a result the downside is the sensitivity of the data (aka priors) increases. Whereas frequentist approaches rely on much larger data sets that can handle noise, more effectively.
Think about GMail's spam filter for example (A Bayesian approach), if you train HAM as SPAM, that is going to have devastating effect on the efficacy of your filtering.
Thus practically speaking, if you have tons of data and looking for a signal, use frequentist. If you're building, for example, an expert diagnostic engine (think Sherlock Holmes), requiring few pieces of information to make a prediction, consider using Bayesian.
I'm over-simplifying of course, but that seems to be the gist of it.
I don't think that's right. If you have enough data points, your prior gradually gets more relevant. And Bayesian statistics has the concept of an "ignorance prior", which mathematically represents the position where all possibilities are equally likely. Bayesian statistics also offers the possibility of adding more data after running your experiment, and computing a new answer in a consistent way. Whereas doing this with frequentist statistics completely is completely invalid.
One of the cooler things about the field of machine learning is that conferences like ICML, and little mini-summer schools like this are very "rich", so talks like this are all over the Internet.
For those who are curious Tenenbaum has recently gotten very famous in the community for Bayesian style work. His students are getting very very prestigious jobs, so a talk like that is worth a look for sure.
That's a truly excellent talk and really even the first slide, titled "Statistical Inference," should be enough to gain a ton of information, so if you're intimidated by the length just give the first slide a try. Michael Jordan is one of my favorite statisticians/machine learning researchers, and if I see he's speaking somewhere I always try to go. I don't go to many stats talks, but his talks are always some of the most mathy I do see, and he's not afraid to dive in to the mathematical mechanics of methods.
If you actually want to understand the "math behind it all," do yourself a favor and read the first three chapters of _Probability Theory: The Logic of Science_ by E. T. Jaynes. Jaynes builds, from the ground up, probability theory as an extended logic that allows you to draw inferences from incomplete and uncertain information. In this logic, a probability represents a degree of belief, and Bayes' theorem becomes a rule for updating prior beliefs in light of new evidence. From this basis, Jaynes recreates the classical probability theorems (e.g., sum and product rules) , giving them clear interpretations.
If you want to understand this stuff for real, it's hard to beat Jaynes. (Plus, he uses a robot mind (!) as a recurring expository device. That alone is worth the price of admission.)
Whereas a frequentist model looks at an absolute basis for
chances, something like the population of females is 52%, so that
means that if I select someone at random from my office, I have a
52% chance of picking a female. The chances are purely based on
the total probability. The Bayesian approach is to rely on past
knowledge and then adjust accordingly. If I know that 75% of my
office is male, and I grab a person, then I know that I have a
25% chance of picking a female.
This is a terrible example, it makes it look like frequentist statistics don't know about conditional probabilities.
My understanding is that frequentist statistics are all about point estimates (such as maximum likelihood) or sometimes confidence intervals. Say you want to run an elections poll, sample a bunch of people and ask who they have voted for.
Frequentists will average the data and say "Party A is at 51%" (point estimate) or "Party A is between 49% and 52%" (confidence interval). Asymptotically and under certain assumptions this value will converge to the "real" value and often you can also estimate the speed of convergence (with variance bounds).
Bayesians will instead start with a "prior" which is a probability distribution p(x) on "Party A is at x%". You can start from an uniform, non-informative prior, or if you have some information you can factor it into the prior. Then you take your polls data D and compute the conditional probability p(x | D), called "posterior", which is the probability that "Party A is at x% given the data". So you don't get a number or an uniform interval, you get a probability for each possible electoral outcome. Again, if you have infinite data this will converge to a distribution where all the mass is on a single point, which usually is the same given by frequentist statistics.
The problem with Bayesian statistics is that you have to handle probability distributions instead of single numbers (as with point estimates), so the inference gets extremely harder. It has been made somewhat possible only recently thanks to both algorithmic and hardware advances. On the other hand, the main advantage is that you get to know how uncertain your estimate is, which can make a huge difference when you have little data.
Last note: another thing you can do when you have a prior is factor it into your estimator and take the maximum likelihood of the posterior as a point estimate. This is called Maximum a Posteriori (MAP) and called by some people "Bayesian", but I don't think Bayesians agree with that.
I don't think this is a worthwhile distinction, even if it's historically accurate. Both Bayesian and Frequentists focus on point estimation and distributions. EAP and MAP are just as Bayesian and the full posterior distribution. And the sampling distribution is just as important to Frequentists as the posterior distribution is to a Bayesian.
The key difference is whether inference is based on the sampling distribution or the posterior distribution.
The sampling distribution is not the likelihood. It's the fundamental basis of all frequentist inference. Amazingly, even though this is typically taught in introductory statistics virtually no students actually digest its importance. You literally cannot understand frequentist statistics without understanding the idea of a sampling distribution.
The sampling distribution is the distribution of your statistic (MLE estimate, mean, EAP, MAP, or whatever you want) under repeated sampling from the population distribution. Frequentism is an evaluation procedure, which can be applied to any estimator whether it be Bayesian or something like MLE. Frequentists are interested in whether this distribution has "good" properties. Supposedly good properties include things like unbiasedness, consistency, minimum variance, etc. Inference is typically expressed as a function of this distribution (confidence intervals) or by comparing the sampling distribution under some restriction (the null hypothesis) to the actual value of the statistic in the observed sample.
Given that you can't typically sample from the population distribution, the practical question becomes how do you approximate the sampling distribution. Typically this is done by appealing to a central limit theorem. Bootstrapping provides another intuitive approximation.
There are all sorts of problem with this approach to statistics despite its success.
Last note: another thing you can do when you have a prior is factor it into your estimator and take the maximum likelihood of the posterior as a point estimate. This is called Maximum a Posteriori (MAP) and called by some people "Bayesian", but I don't think Bayesians agree with that.
The general perspective is that, of course, you'd like to get the fully marginalized, exact posterior probability distribution. However, that's not computationally viable for most problems, so you have to resort to approximations, like MCMC, Variational methods and MAP. I would say that they're all definitely "Bayesian" methods, as long as you're aware of what you're doing, and that you check the quality of your approximation.
Thanks for your clarification! I'm not a statistician, I just happen to be surrounded by them :)
My impression is that among the Bayesians I know there is a general negative bias towards MAP, and Variational methods are vastly preferred. However I agree with you that, all being approximations, none of them is intrinsically better than the others.
In particular I don't understand all the hype around Variational Bayes, to me it seems like a "fat MAP", a MAP estimate with a Gaussian around it.
Right - MAP may (debatably) be a lousy Bayesian approximation, but it's still Bayesian :)
David MacKay's wonderful book made the observation that MAP is a variational method that uses a delta function. "From this perspective, any approximating distribution Q(x; θ) [like the Gaussian], no matter how crummy it is, has to be an improvement on the spike produced by the standard method! [MAP]"
I've only recently come across the technique myself. I think the hype is because it is new (well new in old new thing kind of way). What I find interesting is the duality like relationship between MCMC and variational methods. Variational methods are an optimization alogrithm. I don't understand variational methods enough to say anything insightful but given the work showing the duality between optimization and probability* I find this new development of hype to be highly interesting.
Drawing on the concept of duality, I think Variational methods will come to be seen as holding no more power than probabilistic techniques. But a Duality is still great cause you can plumb old techniques to get new results.
The problems with statistics is that it's complicated, both bayesian and frequentist. Specifically, all statistical methods make assumptions about the data, some of which are quite subtle and take effort to understand. Their intricacy is the reason why so many scientists use them incorrectly. It's much less about whether a method is bayesian or frequentist, but whether the specific assumptions made by a method are suitable for the data. This requires a judgement call. One of the advantages of Bayesian methods over frequentist methods is that it's easier to incorporate what we know about the data into a bayesian model using the bayesian prior straightforwardly, but only in principle, because in practice doing a good job is pretty tricky.
Someday I will fundamentally understand Bayesian probability.
By understand I mean to grasp the links between thermodynamics ,learning, blackholes , cosmology, optimization , probability, bayesianism & quantum mechanics. Stuff like why the utility of techniques from thermodynamics and energy based models in machine learning, the well known relationship between shannon entropy and thermodynamic entropy, between entropy and decoherence in QM, the duality of optimization and probability, the complex Bayesian probability interpretations of quantum mechanics, the Berkenstein bound and the holographic principle. I could ramble on at length but fortunately I have much to do.
Someday, I hope to do the same! But that's a big chunk to chew. Lately, I've been nibbling starting with covariance matrices, Mahalanobis distance (a "moment measure" if i'm not mistaken, therefore a connection to.. dundudun) and Fisher information.
Fisher information is key and turns up in a lot of fundamental places. I'm currently slowly working my way through a text on Information Geometry and another on Ideals and varieties. There is only a limited time one can devote to constant learning so I try to learn things that cut through as much territory as possible. I feel strong discomfort when reading about subjects like say machine learning where a lot of stuff is seemingly arbitrary rules of thumb* .
Turns out that a bunch of geometric ideas that are useful in physics also unify ML concepts. The idea of Information Geometries. There are three ways I have seen the concept used. One is based on differential geometry and treats sets of probability distributions as manifolds and their parameters as coordinates. Many concepts are unified and tricky ideas become tautologies within a solid framework (fisher information as a metric) http://www.cscs.umich.edu/~crshalizi/notabene/info-geo.html .
The other approach is in terms of varieties from algebraic geometry. Here statistical models of discrete random variables are the zeros of certain sets of polynomials (which describe hypertetrahedrons). Graphical models (hidden markov models, neural nets, bayes nets) are all treated on one footing.
The final approach is an interesting set of techniques where a researcher abstracts Information retrieval with methods from quantum mechanics. The benefit is that you get a basic education in the math of QM as well.
* Arbitrary in the sense that you just have to accept a lot things that only become less fuzzy in time. Where as a proper framework provides handholds that reward effort with proportional amounts of understanding. The last time I felt this way was when I was first learning functional programming 7 years ago. The terminology was different and heavy going from imperative programming but I knew the rewards in understanding, expressiveness and flexibility would be well worth the effort. Confusion dissipated linearly with effort (unlike C++'s nonlinear relationship) and I knew that I was picking up a bunch of CS theory at the same time that would make learning programming (and C++) much easier.
Seems like everyone wants to write an intro to Bayesian state (or, usually, subjective interpretation of probabilities and Bayes' Theorem). It's like the new Monad tutorial.
Unfortunately this post completely and totally ignores the things that will enable you to make this decision. This post is called "Understand the Math Behind it All", but you will learn nothing about math, or really, Bayesian statistics, at all. You will not learn, for example, how to apply Bayesian inference to problems, what it means to do basic Bayesian tasks like "use an expert" or "condition on evidence", or even what a Bayesian statistic is. In fact, Bayes' theorem is never even mentioned. There's just a hand-wavy collection of statements like "The Bayesian approach is to rely on past knowledge and then adjust accordingly". That is so vague that it is not even clear that they are talking about Bayesian analysis. This is the sort of statement that fools people into believing they understand something that they really don't.
If you really want to understand this material, you should watch[1], a talk by Mike Jordan called "Are you a Bayesian or a Frequentist?". It's a bit much for beginners, but if you are willing to look up some of the math, it is entirely digestible, and it is by far the best comparison of the two communities I have found. I say "by far" because it is (1) a more or less complete representation of both communities, (2) it is pretty much an unbiased representation of both communities' strengths and weaknesses, and (3) it is as direct as it can get, meaning that it is not tied up in a lot of external knowledge, and is intent on delivering this message, rather than delivering it in an off-hand way as a method of getting to something else.
[1] http://videolectures.net/mlss09uk_jordan_bfway/