The set of all possible written text is infinitely complex. That's not what is b...

drdeca · on Feb 21, 2024

The set of all strings over an alphabet is infinite in cardinality. I don’t think I would say that it is infinitely complex. I don’t know how you are defining complexity, but the shortest program that recognizes the language “all strings (over this alphabet)” is pretty short.

A program that has a good rate at distinguishing whether a string is human written, would be substantially longer than one that recognizes the language of all strings over a particular alphabet.

If you want to generate strings instead of recognizing them, a program that enumerates all possible strings over a given alphabet, can also be pretty short.

Not sure what you mean by complexity.

I don’t know what you mean by “it doesn’t follow any rules at all”.

thomastjeffery · on Feb 21, 2024

When you write a parser, you build it out of language (syntax) rules. The parser uses those rules to translate text into an abstract syntax tree. This approach is explicit: syntax is known ahead of time, and any correctly written text can be correctly parsed.

When you train an LLM, you don't write any language rules. Instead, you provide examples of written text. This approach is implicit: syntax is not known ahead of time. The core feature is that there is no distinction between what is correct and what is not correct. The LLM is liberated from the syntax rules of language. The benefit is that it can work with ambiguity. The limitation is that it can't decide what interpretation of that ambiguity is correct. It can only guess what interpretation is most likely, based on the text it was trained on.