The token system used by large language models like GPT-4 is designed to be comp...

Lockal · on Dec 4, 2023

Yes and no. I know that there is no word "understand" in the dictionary, only "under" and "stand", but other than that it is just a large table with probabilities to see tokens in specific context.

And even then ChatGPT fails to segment "policecaughttherapist" (https://chat.openai.com/share/21c7596a-6474-4639-8a92-5cea54...), even though:

1) If I talked about a therapist, sentence would look like "police caught _the_ therapist"

2) How often do the police even catch therapists? Come on, it looks like the training set was just heavily censored. No intelligence, just a broken ngram database (where n = length of articles in training set, see https://news.ycombinator.com/item?id=38458683).

corethree · on Dec 4, 2023

>Yes and no. I know that there is no word "understand" in the dictionary, only "under" and "stand", but other than that it is just a large table with probabilities to see tokens in specific context.

The training data is more important then the dictionary because the dictionary is designed to be able to form every possible combination of words and sentences that can be formed. It is not limited to specific words it is building words and sentences from building blocks.

1. That parsing is valid. Though unlikely. The choice it made is not incorrect. Thus not a sign of lack of intelligence.

2. Not often. But if you ask chatGPT to reinterpret the word in another way that is grammatically correct it will find the rapist. It shows definitively there is no censorship of the word.

3. I actually didn't see the alternative myself for some reason. Therapist jumped out at me and I didn't see what you were talking about for a good couple of minutes. I mean, unless you want to think of me (a human) as not "intelligent" then clearly it's not a factor here.