They are trained on many billions of tokens of text dealing with character level...

brookst · 2025-07-10T11:22:04 1752146524

These models operate on tokens, not characters. It’s true that training budgets could be spent on exhaustively enumerating how many of each letter are in every word in every language, but it’s just not useful enough to be worth it.

It’s more like asking a human for the Fourier components of how they pronounce “strawberry”. I mean the audio waves are right there, why don’t you know?

yahoozoo · 2025-07-10T12:09:11 1752149351

Although a vast majority of tokens are 4+ characters, you’re seriously saying that each individual character of the English alphabet didn’t make the cut? What about 0-9?

kadushka · 2025-07-10T14:12:32 1752156752

Each character made the cut, but the word "strawberry" is a single token, and that single token is what the model gets as input. When humans read some text, they can see each individual character in the word "strawberry" everytime they see that word. LLMs don't see individual characters when they process input text containing the word "strawberry". They can only learn the spelling if some text explicitly maps "strawberry" to the sequence of characters s t r a w b e r r y. My guess is there are not enough of such mappings present in the training dataset for the model to learn it well.

boroboro4 · 2025-07-10T17:57:49 1752170269

The fact the word ends up being 1 token doesn’t mean model can’t track individual characters in it. The model transforms token into vector (of multiple thousands dimensionality), and I’m pretty sure there are dimensions corresponding to things like “if 1st character an ‘a’, 1st is ‘b’, 2nd is ‘a’ etc.

So tokens aren’t as important.

brookst · 2025-07-10T21:13:11 1752181991

No, the vector is in a semantic embedding space. That's the magic.

So "the sky is blue" converts to the tokens [1820, 13180, 374, 6437]

And "le ciel est bleu" converts to the tokens [273, 12088, 301, 1826, 12704, 84]

Then the embeddings vectors created from these are very similar, despite the letters having very little in common.

boroboro4 · 2025-07-11T12:47:58 1752238078

Character on 1st/2nd/3rd place is part of semantic space in generic meaning of the word. I ran experiments which seemingly ~support my hypothesis below.

kadushka · 2025-07-10T20:10:43 1752178243

Is there any evidence to support your hypothesis?

boroboro4 · 2025-07-11T12:42:47 1752237767

Good question! I did a small experiment: trained a small logistics regression from embedding vectors into 1st/2nd/3rd character in token: https://chatgpt.com/share/6871061a-7948-8007-ab53-5b0b697e90...

I got 0.863 (for 1st)/0.559 (for 2nd)/0.447 (for 3rd) accuracy for Qwen 3 8B model embeddings. Note the code is hacky and might be wrong in ways + in reality transformers do know more because here I utilize only embedding layer. However it does show there are very clear signals on characters in tokens in embedding vectors.

kadushka · 2025-07-11T18:10:06 1752257406

Thank you! I guess if there's enough spelling related text in the dataset, a model is forced to learn some info about token composition in order to predict such texts.

I wonder if it would help to explicitly insert this info into an embedding vector, similar to how we encode word position info. For example, allocate the first 20 vector elements to represent ASCII codes of token's characters (in some normalized way).

boroboro4 · 2025-07-11T20:55:18 1752267318

Ok, bonus content #2.

I took Qwen3 1.7B model and did the same but rather then using embedding vector I used vector after 1st/etc layer, below accuracies for 1st positions:

- embeddings: 0.855

- 1st: 0.913

- 2nd: 0.870

- 3rd: 0.671

- 16th: 0.676

- 20th: 0.683

And now mega bonus content: the same but with prefix "count letters in ":

- 1st: 0.922

- 2nd: 0.924

- 3rd: 0.920

- 16th: 0.877

- 20th: 0.895

And for 2nd letter:

- embeddings: 0.686

- 1st: 0.679

- 2nd: 0.682

- 3rd: 0.674

- 16th: 0.572

boroboro4 · 2025-07-11T18:29:46 1752258586

One way here is to use one hot encoding in first (token length * alphabet length) dimensions.

But to be frank I don’t think it’s really needed, I bet everything really needed model learns by itself. If I had time I would’ve tried it though :)

Bonus content, accuracies for other models (notice DeepSeek!):

- Qwen3-32B: 0.873 / 0.585 / 0.467

- Qwen3-235B-A22B: 0.857 / 0.607 / 0.502

- DeepSeek-V3: 0.869 / 0.738 / 0.624

nl · 2025-07-10T14:30:56 1752157856

> the word "strawberry" is a single token, and that single token is what the model gets as input.

This is incorrect.

strawberry is actually 4 tokens (at least for GPT but most LLM are similar).

See https://platform.openai.com/tokenizer

kadushka · 2025-07-10T15:22:35 1752160955

I got 3 tokens: st, raw, and berry. My point still stands: processing "berry" as a single token does not allow the model to learn its spelling directly, the way human readers do. It still has to rely on an explicit mapping of the word "berry" to b e r r y explained in some text in the training dataset. If that explanation is not present in the training data, it cannot learn the spelling - in principle.

brookst · 2025-07-10T18:55:03 1752173703

Exactly. If “st” is 123, “raw” is 456, “berry” is 789, and “r” is 17… it makes little sense to ask the models to count the [17]’s in [123,466,789]: it demands an awareness of the abstraction that does not exist.

To the extent the knowledge is there it’s from data in the input corpus, not direct examination of the text or tokens in the prompt.

asadotzler · 2025-07-10T20:08:57 1752178137

So much for generalized intelligence, I guess.

kadushka · 2025-07-10T20:11:59 1752178319

Is a human who never learned how to read not generally intelligent?