It seems odd that the key feature of Naive Bayes (conditional independence of individual features given the class) is only mentioned in passing (and never explicitly that they are conditionally independent).
This feature is then used when transforming probabilities to log-probabilities in one go without any mention which would make it particularly confusing for beginners.
I would recommend decomposing p(x|C) before applying the log transform for clarity.
Thanks for pointing out! This is a blog post for people that know already a bit about the theory and are interested in a possible implementation. But I see your point and will add it.
It's a probability (conditional probability of the discrete rv Y given the continuous X=x) multiplied by the density of the continuous rv at X=x. It is the value of the joint density at X=x, Y=y.
p-values are unrelated to this but are computed by looking at the cumulative distribution function (the integral of the density).
If you have a bag of [1.0 3.1 5.2 7.8 7.8 7.9 8.1 8.2 9.9 10.1] what is the probability of picking 8.0? It is definitely higher then 1.1, isn't it?
I hope this clarifies the pdf usage.
pdf = slope of the cdf. The value of the pdf at a given point is not a probability, it's the instantaneous rate of change of a probability. You need to integrate the pdf over a range to get a probability.
You could take the area under the pdf (i.e., integrate) for a window around a given x or use the area under the tail of the pdf past x (i.e., p-value).
Giving a range of size 2.980232238769531250 * 10^-8 in which all numbers compare equal to 0.3639401 in IEEE754 32-bit floating point. And since we're looking at a domain of [0, 1], that's also immediately our probability.
(I'm fully aware this isn't what you were asking, but I found it fun either way)
It doesn't make sense, in my opinion (though I could be ignorant). For classification, wouldn't it be better to stick to discrete probability mass functions?
Might be worthwhile to add conditional risk to it, to generalize it to the minimum risk classifier. That way you'd also distinguish it from, say the scikit-learn implementation.
I honestly couldn't find many implementations of Naive Bayes out there. The famous ones are over engineered to use it as a learning too. I think people appreciate the fact that an article like this for its step-by-step approach.
I went looking for something just like this a few weeks back. The examples I found were all either wildly over-mathematical or terribly unclear. This seems to strike a good balance. Thanks for the writeup!
It's hot right now, but a lot of people don't actually know anything about it. A lot of ML tutorials are either dumbed-down or already pretty technical. People appreciate clear demonstrations of fundamentals.
This feature is then used when transforming probabilities to log-probabilities in one go without any mention which would make it particularly confusing for beginners.
I would recommend decomposing p(x|C) before applying the log transform for clarity.