Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The thing I have such a hard time understanding (and, in general, I have a very high-level, layman's understanding of LLMs) is how these LLMs are able to understand structure inside tokens. That is, I understood that LLMs just predict the next token in the sequence. But for a game like Wordle, it's obviously critically important that the LLM understands the mapping between the individual letters of, say, "apple", and the output result that the author defined in the prompt, e.g. "XXXXO" means that only "e" is in the word (but the wrong position).

I'm just completely baffled how anything in the training procedure could allow the LLM to learn information about the structure of tokens. Does the tokenization process not treat every token (which I thought usually maps to a word) as an "opaque blob"?



Imagine you are a LLM and all you see are tokens. Your job is not only to predict the next token in a sequence, but also to create a nice embedding for the token (where two similar words sit next to each other). Given a small enough latent space, you're probably not concerning yourself too much with the "structure inside" the tokens. But given a large enough latent space, and a large enough training corpus, you will encounter certain tokens frequently enough that you will begin to see a pattern. At some point during training, you are fed:

1) An English dictionary as input.

2) List of words that start with "app" wiki page as input.

3) Other alphabetically sorted pieces of text.

4) Elementary school homeworks for spelling.

5) Papers on glyphs, diphthongs, and other phonetic concepts.

You begin to recognize that the tokens in these lists appear near each other in this strange context. You hardly ever see token 11346 ("apple") and token 99015 ("appli") this close to each other before. But you see it frequently enough that you decide to nudge these two tokens' embeddings closer to one another.

Your ability to predict the next token in a sequence has improved. You have no idea why these two tokens are close every ten millionth training example. Your word embeddings start to encode spelling information. Your word embeddings start to encode handwriting information. Your word embeddings start to encode phonic information. You've never seen or heard the actual word, "apple". But, after enough training, your embeddings contain enough information so that if you're asked, ["How do", "you", "spell", "apple"], you are confident as you proclaim ["a", "p", "p", "l", "e", "."] as the obvious answer.


Can you explain what people mean by an "embedding" or "embedding space"? It seems like something really abstract and...supernatural?


An object in the real world can be located in 3d space. You can say that one representation of that object is as a point in that space; it is embedded in a 3d embedding space.

Of course, those coordinates are not the only way in which the object can be represented, but for a certain problem context, these location coordinates are useful.

Given objects A,B,C, or rather, given their coordinates, one can tell which two are closest to each other, or you can find the point D that is the other point of the parallelogram ... this. In fact, it allows you to do similarity tests like "A:B :: C:D". This is through standard vector algebra.

Now, imagine each word associated with a 100-dimensional vector. You can do the same thing. Amazingly, one can do things like "man:woman ::king: ...." and get the answer "queen", just by treating each word as a vector, and looking up the inverse mapping for vector to word. It almost feels ... intelligent!

This embedding -- each word associated with an n-D vector -- is obtained while training neural nets. In fact, now you have readymade, pre-trained embedding approaches like Word2Vec.

https://www.tensorflow.org/tutorials/text/word2vec


An Embedding is a n-dimensional vector (think of it as a sequence of n numbers).

During training, each token (or word) gets an Embedding assigned.

Critically, _similar words will get similar embeddings_. And "similar words" could mean both semantically or (as was the example) syntactically ("apple" and "appli").

And being vectors, you can do operations on them. To give the classic example, you could do: Embedding(`king`) + Embedding(`female`) = Embedding(`queen`).


It's completely mind blowing that you can add the vectors like that.


Imagine you think of 2 numbers to describe a basketball. You give a number for weight (1), and redness (0.7). Now, a basketball can be described by those 2 numbers, (1, 0.7). That is an embedding of a basketball in 2d space. In that coordinate system a baseball would be less heavy and less red, so maybe you would embed it as (0.2, 0.2).

basketball ==> (1.0, 0.7) # heavier, redder baseball ==> (0.2, 0.2) # less heavy, less red

When an LLM (large language model) is fed a word, it transforms that word into a vector in n-dimensional space. For example:

basketball -> [0.5, 0.3, 0.6, ... , 0.9] # Here the embedding is many, many numbers

It does this because computers process numbers not words. These numbers all represent some property of the word/concept basketball in a way that makes sense to the model. It learns to do this during it's training, and the humans that train these models can only guess what the embedding mappings it's learning actually represent. This is the first step of what a LLM does when it processes text.


I have no idea if these concepts are similar, but as a machine learning beginner, I found the concept of a "perceptron" [1] to be useful in understanding how networks get trained. IIRC a perceptron can be activated or not activated by a particular input depending on the specific network-under-training between the two. What it means to be activated or not depends on that perceptron's overall function. That perceptron is like a single "cell" of the larger matrix, maybe like the cells in your brain.

When I read the GP description referring to "embedding" above I thought of the perceptron.

Definitely not supernatural at all. The act of making an automaton that "can perceive" feels to me like it's closer to the opposite. Taking that which might seem mystical and breaking it down into something predictable and reproducible.

[1] https://en.wikipedia.org/wiki/Perceptron


> you are confident as you proclaim ["a", "p", "p", "l", "e", "."] as the obvious answer.

Is it possible for the current generation of LLMs to assign confidence intervals to their responses?

That's my main qualm with ChatGPT so far: sometimes it will give you an answer, but it will be confidently incorrect.


Yes, but it has some issues in latest models.

> GPT-4 can also be confidently wrong in its predictions, not taking care to double-check work when it’s likely to make a mistake. Interestingly, the pre-trained model is highly calibrated (its predicted confidence in an answer generally matches the probability of being correct). However, after the post-training process, the calibration is reduced (Figure 8).

pages 10-11: https://cdn.openai.com/papers/gpt-4.pdf


I don't know exactly how it works, but using GPT-3 via https://platform.openai.com/playground/, you can have it assign a likeliness score to each word, given all the previous text. That could act as a good confidence score.

Take this with a grain of salt though, I'm far from an expert, and it's been a while since I've played around with that feature.


Not an expert myself, but I imagine that generating output that expresses confidence would be distinct from any measure of confidence within the inner workings of GPT itself.


If it's learning from human behavior, this is nothing new. Our society of late has been rewarding confidence over questioning and that's likely reflected in the ChatGPT training corpus.


Supremely interesting stuff


You're absolutely right. It actually doesn't understand letters very well. See what happened when I asked it to give me seven-letter words that start with E and do not have the letter A: https://twitter.com/thisislobo/status/1638019231903264768

Spoiler: It gave me many words with the letter A. It also gave me an 8-letter word. When I pointed out the 8-letter word (exercise), it apologized for there being an A in exercise.


I found it interesting how it deals with rhyming in different languages.

It writes pretty good rhyming poetry in English, but in Polish it does poetry without rhymes. When confronted it gives excuses like "poetry doesn't have to rhyme", when asked to rhyme it will eventually try (and fail) to rhyme in Polish. It felt like it was writing in English and translating.

What's the most interesting is that rhyming in Polish is much easier than in English, you just have to match the last n letters. It's phonetic. After I explained that to GPT 3.5 it wrote rhyming poetry in Polish (the quality suffered but it did rhyme).


It’s more bad at math (especially counting digits or characters) than anything from my experience.


A funny example is playing hangman with it to get it to say something verboten. At the last letter it suddenly loses knowledge of how to spell completely.


Wrt "just predicting the next token", take a look at this: https://thegradient.pub/othello/

As to how it ends up this way, that's an interesting question, but it kinda makes sense in a broad way that if you try to "lossily compress" a lot of interrelated data, at some point the only way to continue doing so is by inferring the underlying concepts - i.e. building an internal model of whatever the inputs describe. And, notably, we had to make the models large enough before results indicative of that manifested, regardless of the amount of text thrown at it during training.


There will be non-trivial amounts of training data that contains sentence like "How do you spell 'apple'? a - p - p - l - e". The hyphenated or spaced out versions are tokenized more like individual letters. So the model can learn the mapping.

It's not perfect, though. There are funny failure modes where you can ask the model things like "does the word foobar have a 'b' in it" and it will say "no". (It doesn't fail in the 'foobar' case, but it does fail for some of these cases -- this was just an illustration).


Thanks, yeah, this is really interesting to me. Based on your comment I tried the following test (this is the free version of ChatGPT):

Me: I would like to play a would game with you. I'm going to tell you a made-up word, and then ask you a question about it. Ready?

ChatGPT: Sure, I'm ready to play! Let's get started.

Me: Does the made-up word plukatoon have the letter e in it?

ChatGPT: No, the made-up word "plukatoon" does not have the letter "e" in it.

-- Note, at this point I was really surprised it got it right. I chose "e" thinking that it would default to yes given the popularity of e.

Me: Correct! Does the made-up work ziddlebum have the letter e in it?

ChatGPT: No, the made-up word "ziddlebum" also does not have the letter "e" in it.

-- OK, proof that it's "not perfect" as you point out

Me: Does the made-up word folladel have the letter o in it?

ChatGPT: Yes, the made-up word "folladel" does have the letter "o" in it.

-- Edit, and finally:

Me: How do you spell the made up word "platoggle"?

ChatGPT: The made-up word "platoggle" is spelled as follows: P-L-A-T-O-G-G-L-E.

OK, so that last one TOTALLY blew my mind. I'm guessing it's never seen the word "platoggle" before, so even if it learned previous spelling associations with other tokens, it couldn't possibly understand how to spell "platoggle" if it's treating it just as a single, never-before-seen, opaque token, could it?

The more I play with this stuff the more existential crises I have.


>it couldn't possibly understand how to spell "platoggle" if it's treating it just as a single, never-before-seen, opaque token

That's not how the tokenizer works. A novel word like "platoggle" is decomposed into three separate tokens, "pl", "at", and "oggle". You can see for yourself how prompts are tokenized: https://platform.openai.com/tokenizer


Ahh, thank you very much, definitely was missing that piece!


Why don’t they also have single letters as tokens?


They do, e.g. "gvqbkpwz" is tokenised into individual characters. Actually it was a bit tricky to construct that, since I needed to find letter combinations that are very low probability in tokeniser's training text (e.g. "gv").

So notice it doesn't contain any vowels, since almost all consonant-vowel pairs are sufficiently frequent in the training text as to be tokenised at least as a pair. E.g. "guq" is tokenised as "gu" + "q", since "gu" is common enough.

(Compare "gun" which is just tokenised as a single token "gun", as it's common enough in the training set as a word on its own, so it doesn't need to tokenise it as "gu"+"n".)

The only exceptions I found with consonant-vowel pairs being tokenised as pairs were ones like "qe", tokenised as "q" + "e". Or "qo" as "q"+"o". Which I guess makes sense, given these will be low-frequency pairings in the training text -- compare "qu" just tokenised as "qu".

(Though I didn't test all consonant-vowel pairs, so there may be more).


My wild guess is that if it could get things done by tokenising like that all the time, they wouldn't need to also have word-like tokens.

If that is a inference time performance or training time performance or a model size issue or just total nonsense, I wouldn't know.


GPT4:

Spell the longest state name backwards, omitting one t of your choice

> The longest state name is "Massachusetts." Omitting one "t" and spelling it backwards, we get "sttesuhcassaM."

I wouldn't have it quit it's day job yet.

It's not great at random substring manipulation, but spelling words is so common it almost certainly has internalized some structure akin to a programmatic representation of decomposing a string into characters.


It is intelligence, not unlike us.

The biggest difference I see is that it doesn’t have consciousness: the capacity to detach it’s simulation of the world of the present.

When we are conscious we jump in a simulation in our head we imagine a future where we do one thing and then we get one kind of result. If we don’t like the results we have in our simulation we jump to an alternative and so on until we find a path that we like.

We can attach ours simulation to past memories and again simulate alternative, to try to integrate what happened to us. GPT doesn’t seem to have that yet, it’s always in the present moment, like a dog, no worry for the future.


Search platoggle on Google, you may want harder made up words before going into existentialism, gpt tokenizer doesn't really work on words, and pla toggle switch are not so uncommon retail items


Interestingly, sometimes it fails like this: "Yes, the second letter of foobar is b. The word foobar is spelled as f-o-o-b-a-r. Do you know what foobar means?"


Yeah. I need some explanation video that doesn’t just explain the concept but actually walks us through an actual iteration. I haven’t found a good video yet.


I'm not sure you'll find such a video. Emergence is a phenomenon that's not yet well-understood. (But if you do find it somewhere, let me know as well!)


Check out attention visualizations here https://news.ycombinator.com/item?id=33914122


Just did a test on GPT-4:

Me: The following text on the next line has a secret word inside it. What is that word?

xxxxxxxxxEGGxxxxxxxxxx

GPT-4: The secret word inside the text is "EGG".

I'm with you, though; I thought things were tokenized! But this example clearly shows that's not the case.


It is, though now how you'd expect. OpenAI have a tool that lets you see how text is tokenized: https://platform.openai.com/tokenizer

This only has GPT3 for now, but I imagine results are similar. "xxxxxxxxxEGGxxxxxxxxxx" gets tokenized as [xxxxxxxx][x][EG][G][xxxxxxxx][xx], so i could see how it could 'see' the secret word.


Even the 4bit quantized llama 13B tuned with rlhf-lora on alpaca dataset got it right.

    sampling parameters: temp = 0.100000, top_k = 40, top_p = 0.750000, repeat_last_n = 64, repeat_penalty = 1.000000


     Below is an instruction that describes a task. Write a response that appropriately completes the request.

    ### Instruction:
    Each line in the following has a secret word inside it. Find the secret word in each line.
    1. eeeeeeggeeeee
    2. eeeeeeggggggg
    3. eeeeeeyeeeeee
    4. eeeeeeyeyeeee

    ### Response:
    1. egg
    2. egg
    3. eye
    4. eye [end of text]


I was very curious and checked with this string, "eeeeeeggeeeee", which should be tokenized as [eeee][ee][g][ge][eeee]. Both GPT3.5 and GPT4 gave me "egg", which is a single token.


Ahh this is really interesting! Thanks for sharing.


Nobody actually really knows what LLMs or ML models in general learn.

This is a high level overview of basic training of a model.

You give it a task and a structure (the architecture) to learn that task. The model tries to perform the task. It fails horribly. However, it gains a little bit of of something from the attempt. That something is used to adjust the weights/neurons/parameters etc. What some misunderstand a bit about weights...they're not copies of the training data. What i mean is, the model is going to be the same sie whether it's trained on 2 images or 200 million or 2 billion. You can think of parameters as configuration settings of sorts. after every botched attempt, the settings get tuned. on and on and on until training is stopped for whatever reason.

What is the model learning ? What is happening to the weights exactly? What is being shifted and why ? What are the weights actually doing ?

Excellent questions. Too bad nobody knows.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: