So first of, people who _can't_ read or write have a certain disability (blindne...

So first of, people who _can't_ read or write have a certain disability (blindness or developmental, etc). That's not a reasonable comparison for LLMs/AI (especially since text is the main modality of an LLM).

I'm assuming you meant to ask about people who haven't _learned_ to read or write, but would otherwise be capable.

Is your argument then, that a person who hasn't learned to read or write is able to model language as accurately as one who did?

Wouldn't you say that someone who has read a whole ton of books would maybe be a bit better at language modelling?

Also, perhaps most importantly: GPT (and pretty much any LLM I've talked to) does know the alphabet and its rules. It knows. Ask it to recite the alphabet. Ask it about any kind of grammatical or lexical rules. It knows all of it. It can also chop up a word from tokens into letters to spell it correctly, it knows those rules too. Now ask it about Chinese and Japanese characters, ask it any of the rules related to those alphabets and languages. It knows all the rules.

This to me shows the problem is that it's mainly incapable of reasoning and putting things together logically, not so much that it's trained on something that doesn't _quite_ look like letters as we know them. Sure it might be slightly harder to do, but it's not actually hard, especially not compared to the other things we expect LLMs to be good at. But especially especially not compared to the other things we expect people to be good at if they are considered "language experts".

If (smart/dedicated) humans can easily learn the Chinese, Japanese, Latin and Russian alphabets, then why can't LLMs learn how tokens relate to the Latin alphabet?

Remember that tokens were specifically designed to be easier and more regular to parse (encode/decode) than the encodings used in human languages ...