The token system used by large language models like GPT-4 is designed to be comprehensive enough to represent virtually any text, including every possible word that could exist in a language. This is separate from the training the neural net and is chosen deliberately.
The training process teaches LLMs how to compose these tokens to form replies to our queries. The training data used in the training process does not have obscured words or sentences with strange spacing. The LLM is still able compose the tokens correctly from varied input that never existed in the training data.
Yes and no. I know that there is no word "understand" in the dictionary, only "under" and "stand", but other than that it is just a large table with probabilities to see tokens in specific context.
1) If I talked about a therapist, sentence would look like "police caught _the_ therapist"
2) How often do the police even catch therapists? Come on, it looks like the training set was just heavily censored. No intelligence, just a broken ngram database (where n = length of articles in training set, see https://news.ycombinator.com/item?id=38458683).
>Yes and no. I know that there is no word "understand" in the dictionary, only "under" and "stand", but other than that it is just a large table with probabilities to see tokens in specific context.
The training data is more important then the dictionary because the dictionary is designed to be able to form every possible combination of words and sentences that can be formed. It is not limited to specific words it is building words and sentences from building blocks.
1. That parsing is valid. Though unlikely. The choice it made is not incorrect. Thus not a sign of lack of intelligence.
2. Not often. But if you ask chatGPT to reinterpret the word in another way that is grammatically correct it will find the rapist. It shows definitively there is no censorship of the word.
3. I actually didn't see the alternative myself for some reason. Therapist jumped out at me and I didn't see what you were talking about for a good couple of minutes. I mean, unless you want to think of me (a human) as not "intelligent" then clearly it's not a factor here.
The training process teaches LLMs how to compose these tokens to form replies to our queries. The training data used in the training process does not have obscured words or sentences with strange spacing. The LLM is still able compose the tokens correctly from varied input that never existed in the training data.
It is intelligence.