I remember reading about that in "The Information", where it is described how Cl...

imurray · on June 13, 2011

If you have a good generator for text, you have a useful language model that can be plugged into applications such as speech recognition, OCR, predictive text entry systems and compression.

juretriglav · on June 13, 2011

Could you somehow use it in reverse? What I mean is, is it possible to get a random text generator for a certain language and then use it to determine, whether a given text is in that language or not?

imurray · on June 13, 2011

Yes.

Given the string so far, see how probable it is that the generator would generate the next character, p(x_n | x_<n). Running through the whole string you can build up the log probability of the whole string: log p(x) = \sum_n log p(x_n | x_<n). Comparing the log probabilities under different models gives you a language classifier. For a first stab at the one-class problem, compare the log probability to what the model typically assigns to strings it randomly generates.

For more on information theory, modelling and inference you might like: http://www.inference.phy.cam.ac.uk/mackay/itila/book.html

robrenaud · on June 13, 2011

I recently did exactly this to discriminate English text from gibberish.

https://github.com/rrenaud/Gibberish-Detector

donall · on June 13, 2011

It sounds like you are talking about a naive Bayesian classifier. PG wrote a couple of articles on his experience with these for spam filtering (http://www.paulgraham.com/spam.html and http://www.paulgraham.com/better.html). They're probably a decent high-level introduction to the area.

For a more in-depth, yet very accessible discussion, I would recommend "Speech and Language Processing" by Jurafsky & Martin (http://books.google.com/books/about/SPEECH_AND_LANGUAGE_PROC...). It's considered by many to be the Bible of NLP.