Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I remember reading about that in "The Information", where it is described how Claude Shannon did it.

Now I'll have to tweak my spam detection even more. Joke aside and somebody correct me if I'm wrong, spam probably runs on a simple wordlist type algorithm.

What is then the usefulness (I define usefulness extremely wide) of such a generator?



If you have a good generator for text, you have a useful language model that can be plugged into applications such as speech recognition, OCR, predictive text entry systems and compression.


Could you somehow use it in reverse? What I mean is, is it possible to get a random text generator for a certain language and then use it to determine, whether a given text is in that language or not?


Yes.

Given the string so far, see how probable it is that the generator would generate the next character, p(x_n | x_<n). Running through the whole string you can build up the log probability of the whole string: log p(x) = \sum_n log p(x_n | x_<n). Comparing the log probabilities under different models gives you a language classifier. For a first stab at the one-class problem, compare the log probability to what the model typically assigns to strings it randomly generates.

For more on information theory, modelling and inference you might like: http://www.inference.phy.cam.ac.uk/mackay/itila/book.html


I recently did exactly this to discriminate English text from gibberish.

https://github.com/rrenaud/Gibberish-Detector


It sounds like you are talking about a naive Bayesian classifier. PG wrote a couple of articles on his experience with these for spam filtering (http://www.paulgraham.com/spam.html and http://www.paulgraham.com/better.html). They're probably a decent high-level introduction to the area.

For a more in-depth, yet very accessible discussion, I would recommend "Speech and Language Processing" by Jurafsky & Martin (http://books.google.com/books/about/SPEECH_AND_LANGUAGE_PROC...). It's considered by many to be the Bible of NLP.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: