or it's a combination of both...

wisty · on May 23, 2011

The explanation seems a little sloppy. If they used Markov Chains, then chains of length 1 is the same as frequency. A chain of length > 1 will be a slightly better fit.

They've generalized the model, and their generalization seems to work a bit better.

_delirium · on May 23, 2011

That seems right, and makes the write-up a bit strange. They seem to be positioning it as if they're comparing two completely unrelated hypotheses: the frequency hypothesis versus the information-content hypothesis. But as you point out, their method of measuring information content (following Shannon) is simply n-gram frequency. The difference is that they set n=2,3,4 rather than n=1.

It's also not clear that it differs from the original motivation for the frequency hypothesis: (some) people advancing the "more common words are shorter" hypothesis come in part from something like a compression argument, that common words are short for efficiency reasons. Arguing that predictable (low-information-content) words are short instead is an interesting refinement, but not a completely different claim. Just choosing string length by individual word frequency is actually a sort of crappy compression algorithm, and this paper seems to show that English's built-in compression is better than that, and takes sequence frequencies into account.