I remember reading about that in "The Information", where it is described how Claude Shannon did it.
Now I'll have to tweak my spam detection even more. Joke aside and somebody correct me if I'm wrong, spam probably runs on a simple wordlist type algorithm.
What is then the usefulness (I define usefulness extremely wide) of such a generator?
If you have a good generator for text, you have a useful language model that can be plugged into applications such as speech recognition, OCR, predictive text entry systems and compression.
Could you somehow use it in reverse? What I mean is, is it possible to get a random text generator for a certain language and then use it to determine, whether a given text is in that language or not?
Given the string so far, see how probable it is that the generator would generate the next character, p(x_n | x_<n). Running through the whole string you can build up the log probability of the whole string: log p(x) = \sum_n log p(x_n | x_<n). Comparing the log probabilities under different models gives you a language classifier. For a first stab at the one-class problem, compare the log probability to what the model typically assigns to strings it randomly generates.
Now I'll have to tweak my spam detection even more. Joke aside and somebody correct me if I'm wrong, spam probably runs on a simple wordlist type algorithm.
What is then the usefulness (I define usefulness extremely wide) of such a generator?