Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This reasoning is interesting, but what is stopping an LLM from simply knowing the number of r's _inside_ one token?

Even if strawberry is decomposed as "straw-berry", the required logic to calculate 1+2 seems perfectly within reach.

Also, the LLM could associate a sequence of separate characters to each token. Most LLMs can spell out words perfectly fine.

Am I missing something?



The problem is not the addition, is that the LLM has no way to know how many r's a token might have, because the LLM receives each token as an atomic entity.

For example, according to https://platform.openai.com/tokenizer, "strawberry" would be tokenized by the GPT-4o tokenizer as "st" "raw" "berry" (tokens don't have to make sense because they are based on byte-pair encoding, which boils down to n-gram frequency statistics, i.e. it doesn't use morphology, syllables, semantics or anything like that).

Those tokens are then converted to integer IDs using a dictionary, say maybe "st" is token ID 4663, "raw" is 2168 and "berry" is 487 (made up numbers).

Then when you give the model the word "strawberry", it is tokenized and the input the LLM receives is [4463, 2168, 487]. Nothing else. That's the kind of input it always gets (also during training). So the model has no way to know how those IDs map to characters.

As some other comments in the thread are saying, it's actually somewhat impressive that LLMs can get character counts right at least sometimes, but this is probably just because they get the answer from the training set. If the training set contains a website where some human wrote "the word strawberry has 3 r's", the model could use that to get the question right. Just like if you ask it what is the capital of France, it will know the answer because many websites say that it's Paris. Maybe, just maybe, if the model has both "the word straw has 1 r" and "the word berry has 2 r's" and the training set, it might be able to add them up and give the right answer for "strawberry" because it notices that it's being asked about [4463, 2168, 487] and it knows about [4463, 2168] and [487]. I'm not sure, but it's at least plausible that a good LLM could do that. But there is no way it can count characters in tokens, it just doesn't see them.


Tokenization does not remove information from the input[1]. All the information required for character counting is still present in the input following tokenization. The reasons you give for why counting characters is hard could be applied to essentially all other forms of question answering. Ie, to answer questions of type X in general, the LLM will have to generalize from questions of type X in the training corpus to questions of type X with novel surface forms which it sees at test time. [1]tokenizers can remove information if designed to do so, but they don't in these simple scenarios


As far as I know, that's not the case. The tokenizer takes a bunch of characters, like "berry", identifies it as a token, and what the LLM gets is the token ID. It doesn't have access to the information about which letters that token is composed of. Here is an explanation by OpenAI themselves: https://help.openai.com/en/articles/4936856-what-are-tokens-... - as you can see, "Models take the prompt, convert the input into a list of tokens, processes the prompt, and convert the predicted tokens back to the words we see in the response". And the tokens are basically IDs, without any internal structure - there are examples there.

If I'm missing something and you have a source for the claim that character information is present in the input after tokenization, please provide it. I have never implemented an LLM or fiddled with them at low level so I might be missing some detail, but from everything I have read, I'm pretty sure it doesn't work that way.


A sequence of tokens can be converted back to the sequence of tokenized characters without loss of information. Eg, how do you think text is rendered for the user based on sequences of tokens generated by the LLM? Different tokenization schemes arrange that information differently and may make it (hand waving here) harder or easier for the model to reason about details like raw character counts that are affected by tokenization. If the training set included sufficiently many examples of character counting Q/A pairs, an LLM would have no trouble learning how to do this task.


Thank you for taking the time to write this response. Unfortunately, even though I agree that tokenization makes it pretty hard for the LLM to count characters, I'm still not convinced that it is a fundamental problem for doing so. I think the lack of (or limited amount of) symbolic processing is an even more important factor.

> But there is no way it can count characters in tokens, it just doesn't see them.

If that is the case, then how can most LLMs (tested with ChatGPT and Llama 3) spell out words correctly?


Might that also be the answer to why it says "2"? There are probably sources of people saying there are two R's in "berry", but no one bothers to say there is 1 R in "raw"?


The fact that any of those tasks at all work so well despite tokenization is quite remarkable indeed.

You should ask why it is that any of those tasks work, rather than ask why counting letter doesn't work.

Also, LLMs screw up many of those tasks more than you'd expect. I don't trust LLMs with any kind of numeracy what-so-ever.


It doesn't see "straw" or "berry". It sees a vector which happens to represent the word strawberry and is translated from and to English on the way in and out. It never sees the letters, 'strawberry' is represented by a number, or group of numbers. Try to count the Rs in "21009873628" - you can't.


I'm aware of this. The network could, and apparently does, associate single characters with words. It can associate "red" with "rose", and might associate "r" with "straw", and it might even associate some kind of embedding of "two r's" with "berry".


yes, you are missing that the tokens aren't words, they are 2-3 letter groups, or any number of arbitrary sizes depending on the model


Nope, I'm not missing that particular fact. I'm aware that sentences (and words) are split into tokens, which are vectors.

I don't understand how most LLMs can spell out words though, nor do I understand what is causing the failure to count characters in words. I was not convinced by the comment I was responding to.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: