These models operate on tokens, not characters. It’s true that training budgets could be spent on exhaustively enumerating how many of each letter are in every word in every language, but it’s just not useful enough to be worth it.
It’s more like asking a human for the Fourier components of how they pronounce “strawberry”. I mean the audio waves are right there, why don’t you know?
Although a vast majority of tokens are 4+ characters, you’re seriously saying that each individual character of the English alphabet didn’t make the cut? What about 0-9?
Each character made the cut, but the word "strawberry" is a single token, and that single token is what the model gets as input. When humans read some text, they can see each individual character in the word "strawberry" everytime they see that word. LLMs don't see individual characters when they process input text containing the word "strawberry". They can only learn the spelling if some text explicitly maps "strawberry" to the sequence of characters s t r a w b e r r y. My guess is there are not enough of such mappings present in the training dataset for the model to learn it well.
The fact the word ends up being 1 token doesn’t mean model can’t track individual characters in it. The model transforms token into vector (of multiple thousands dimensionality), and I’m pretty sure there are dimensions corresponding to things like “if 1st character an ‘a’, 1st is ‘b’, 2nd is ‘a’ etc.
Character on 1st/2nd/3rd place is part of semantic space in generic meaning of the word. I ran experiments which seemingly ~support my hypothesis below.
I got 0.863 (for 1st)/0.559 (for 2nd)/0.447 (for 3rd) accuracy for Qwen 3 8B model embeddings. Note the code is hacky and might be wrong in ways + in reality transformers do know more because here I utilize only embedding layer.
However it does show there are very clear signals on characters in tokens in embedding vectors.
Thank you! I guess if there's enough spelling related text in the dataset, a model is forced to learn some info about token composition in order to predict such texts.
I wonder if it would help to explicitly insert this info into an embedding vector, similar to how we encode word position info. For example, allocate the first 20 vector elements to represent ASCII codes of token's characters (in some normalized way).
I got 3 tokens: st, raw, and berry. My point still stands: processing "berry" as a single token does not allow the model to learn its spelling directly, the way human readers do. It still has to rely on an explicit mapping of the word "berry" to b e r r y explained in some text in the training dataset. If that explanation is not present in the training data, it cannot learn the spelling - in principle.
Exactly. If “st” is 123, “raw” is 456, “berry” is 789, and “r” is 17… it makes little sense to ask the models to count the [17]’s in [123,466,789]: it demands an awareness of the abstraction that does not exist.
To the extent the knowledge is there it’s from data in the input corpus, not direct examination of the text or tokens in the prompt.
Every human learns that, when you hear the sound "strawberry" you don't hear the double r there, yet you still know the answer.