Naive Bayes classifier in 50 lines

ced · on Nov 27, 2011

Good post, but it's a sad reflection of code's failure as a medium of expression that we need 50 lines of Python to express one line of math.

sqrt17 · on Nov 27, 2011

If you look at the code, it's more like "write an ARFF file parser in 40 lines" and "do a NB classifier with add-one smoothing in 10 lines".

If you use a better-adapted input format and code things more concisely, you'd probably end up with two functions of 3-4 lines each; conversely, if you wanted to do things properly, you'd separate out ARFF file parsing and the Naive Bayes functionality.

All in all, the blog post wouldn't make me want to recommend their group for prospective (undergrad or graduate) students.

tel · on Nov 27, 2011

Seriously. I think you could golf NB into 2 lines pretty sensibly once you've got the data in. It's really just compute two histograms, multiply, and maximize.

JulianMorrison · on Nov 27, 2011

That's not one line of math. That's one line of math and ten years of textbooks. Math just has a bigger standard library.

Groxx · on Nov 28, 2011

And no fear of namespace collisions, or of introducing additional symbols.

Want to match Math in size when implementing algorithms? Use APL. Want to avoid adding additional symbols all the time? Use J/K/etc (APL descendants). Want to avoid namespace collisions? Welcome to Java.Sun.Com.Math.Oh.For.Effing.Sake.FactoryInterfaceBuilder

freyrs3 · on Nov 27, 2011

The actual translation of the formula to Python is about 4 lines. Starting on L30 - L34 of [1]. The rest is just IO plumbing.

[1] https://gist.github.com/731413/7ad1b4c04bc2d6b5033c5811efcb4... .

twelvechairs · on Nov 27, 2011

This is an interesting point - I suppose it shows how sophisticated the language of mathematics is.

I think its not perhaps the 50 lines that matter though (most of this is just effectively defining what the mathematical symbols and grammar mean), but the one line, which can tell you how everything relates together very effectively..... this is what the python version misses to me...

agumonkey · on Nov 27, 2011

tell that to sir alan kay

ps: I should add some data >>> http://tinlizzie.org/~awarth/

abecedarius · on Nov 27, 2011

http://code.google.com/p/aima-python/source/browse/trunk/lea... is shorter (NaiveBayesLearner around line 200) though that assumes some infrastructure.

bajer · on Nov 27, 2011

Upcoming:

- Raycaster in 1000 lines of Lisp

- Database management system in 20000 lines of C

- Python web framework in 2000 lines of Python

...