Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The set of all possible written text is infinitely complex. That's not what is being modeled, though: LLMs model text that was written intentionally by humans. That's less complicated, but it's not simple enough for language rules to completely define it. Natural language is "context-sensitive", so the meaning of a written language statement is dependent on more than the language itself. That's why we can't just parse natural language like we can programming languages.

An LLM is a model of written text. It doesn't know anything about language rules. In fact, it doesn't follow any rules whatsoever. It only follows the model, which tells you what text is most likely to come next.



The set of all strings over an alphabet is infinite in cardinality. I don’t think I would say that it is infinitely complex. I don’t know how you are defining complexity, but the shortest program that recognizes the language “all strings (over this alphabet)” is pretty short.

A program that has a good rate at distinguishing whether a string is human written, would be substantially longer than one that recognizes the language of all strings over a particular alphabet.

If you want to generate strings instead of recognizing them, a program that enumerates all possible strings over a given alphabet, can also be pretty short.

Not sure what you mean by complexity.

I don’t know what you mean by “it doesn’t follow any rules at all”.


When you write a parser, you build it out of language (syntax) rules. The parser uses those rules to translate text into an abstract syntax tree. This approach is explicit: syntax is known ahead of time, and any correctly written text can be correctly parsed.

When you train an LLM, you don't write any language rules. Instead, you provide examples of written text. This approach is implicit: syntax is not known ahead of time. The core feature is that there is no distinction between what is correct and what is not correct. The LLM is liberated from the syntax rules of language. The benefit is that it can work with ambiguity. The limitation is that it can't decide what interpretation of that ambiguity is correct. It can only guess what interpretation is most likely, based on the text it was trained on.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: