Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They are trained on many billions of tokens of text dealing with character level input, they would be rather dumb if they couldn't learn it anyway.

Every human learns that, when you hear the sound "strawberry" you don't hear the double r there, yet you still know the answer.



These models operate on tokens, not characters. It’s true that training budgets could be spent on exhaustively enumerating how many of each letter are in every word in every language, but it’s just not useful enough to be worth it.

It’s more like asking a human for the Fourier components of how they pronounce “strawberry”. I mean the audio waves are right there, why don’t you know?


Although a vast majority of tokens are 4+ characters, you’re seriously saying that each individual character of the English alphabet didn’t make the cut? What about 0-9?


Each character made the cut, but the word "strawberry" is a single token, and that single token is what the model gets as input. When humans read some text, they can see each individual character in the word "strawberry" everytime they see that word. LLMs don't see individual characters when they process input text containing the word "strawberry". They can only learn the spelling if some text explicitly maps "strawberry" to the sequence of characters s t r a w b e r r y. My guess is there are not enough of such mappings present in the training dataset for the model to learn it well.


The fact the word ends up being 1 token doesn’t mean model can’t track individual characters in it. The model transforms token into vector (of multiple thousands dimensionality), and I’m pretty sure there are dimensions corresponding to things like “if 1st character an ‘a’, 1st is ‘b’, 2nd is ‘a’ etc.

So tokens aren’t as important.


No, the vector is in a semantic embedding space. That's the magic.

So "the sky is blue" converts to the tokens [1820, 13180, 374, 6437]

And "le ciel est bleu" converts to the tokens [273, 12088, 301, 1826, 12704, 84]

Then the embeddings vectors created from these are very similar, despite the letters having very little in common.


Character on 1st/2nd/3rd place is part of semantic space in generic meaning of the word. I ran experiments which seemingly ~support my hypothesis below.


Is there any evidence to support your hypothesis?


Good question! I did a small experiment: trained a small logistics regression from embedding vectors into 1st/2nd/3rd character in token: https://chatgpt.com/share/6871061a-7948-8007-ab53-5b0b697e90...

I got 0.863 (for 1st)/0.559 (for 2nd)/0.447 (for 3rd) accuracy for Qwen 3 8B model embeddings. Note the code is hacky and might be wrong in ways + in reality transformers do know more because here I utilize only embedding layer. However it does show there are very clear signals on characters in tokens in embedding vectors.


Thank you! I guess if there's enough spelling related text in the dataset, a model is forced to learn some info about token composition in order to predict such texts.

I wonder if it would help to explicitly insert this info into an embedding vector, similar to how we encode word position info. For example, allocate the first 20 vector elements to represent ASCII codes of token's characters (in some normalized way).


Ok, bonus content #2.

I took Qwen3 1.7B model and did the same but rather then using embedding vector I used vector after 1st/etc layer, below accuracies for 1st positions:

- embeddings: 0.855

- 1st: 0.913

- 2nd: 0.870

- 3rd: 0.671

- 16th: 0.676

- 20th: 0.683

And now mega bonus content: the same but with prefix "count letters in ":

- 1st: 0.922

- 2nd: 0.924

- 3rd: 0.920

- 16th: 0.877

- 20th: 0.895

And for 2nd letter:

- embeddings: 0.686

- 1st: 0.679

- 2nd: 0.682

- 3rd: 0.674

- 16th: 0.572


One way here is to use one hot encoding in first (token length * alphabet length) dimensions.

But to be frank I don’t think it’s really needed, I bet everything really needed model learns by itself. If I had time I would’ve tried it though :)

Bonus content, accuracies for other models (notice DeepSeek!):

- Qwen3-32B: 0.873 / 0.585 / 0.467

- Qwen3-235B-A22B: 0.857 / 0.607 / 0.502

- DeepSeek-V3: 0.869 / 0.738 / 0.624


> the word "strawberry" is a single token, and that single token is what the model gets as input.

This is incorrect.

strawberry is actually 4 tokens (at least for GPT but most LLM are similar).

See https://platform.openai.com/tokenizer


I got 3 tokens: st, raw, and berry. My point still stands: processing "berry" as a single token does not allow the model to learn its spelling directly, the way human readers do. It still has to rely on an explicit mapping of the word "berry" to b e r r y explained in some text in the training dataset. If that explanation is not present in the training data, it cannot learn the spelling - in principle.


Exactly. If “st” is 123, “raw” is 456, “berry” is 789, and “r” is 17… it makes little sense to ask the models to count the [17]’s in [123,466,789]: it demands an awareness of the abstraction that does not exist.

To the extent the knowledge is there it’s from data in the input corpus, not direct examination of the text or tokens in the prompt.


So much for generalized intelligence, I guess.


Is a human who never learned how to read not generally intelligent?




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: