Not just that: the LLM outputs not individual tokens, but a weighted recommendat...

contravariant · on July 21, 2023

> A more advanced feature not commonly used is to also enable back-tracking if the AI gets stuck and can’t produce a valid output.

Technically that part is mandatory if you don't just want it to produce an output but to make it produce an output that correctly matches the temperature (i.e. one that you could have gotten by randomly sampling the LLM until you got a correct one). Randomly picking the next tokens that isn't grammatically incorrect works but oversamples paths where most of the options are invalid. The ultimate example of this is that it can get stuck at a branch with probability 0.

From a probabilistic standpoint what you'd need to do is not just make it backtrack but make it keep generating until it generates a grammatically correct output in one go.

Maybe there is something clever that can be done to avoid regenerating from the start? What you'd need to achieve is that a token that has a x% probability of leading to an incorrect output also has x% probability to be erased.

newhouseb · on July 22, 2023

The way LLMs work is they output probabilities for every _token_, so you don't really need to backtrack you can just always pick a token that matches the provided grammar.

That said, you might want to do something like (backtracking) beam-search which uses various heuristics to simultaneously explore multiple different paths because the semantic information may not be front-loaded, i.e. let's say we had a grammar that had a key "healthy" with values "very_unhealthy" or "moderately_healthy." For broccoli, the LLM might intend to say "very_healthy" and choose "very" but then be pigeonholed into saying "very_unhealthy" because it's the only valid completion according to the grammar.

That said, there are a lot of shortcuts you can take to make this fairly efficient thanks to the autoregressive nature of (most modern) LLMs. You only need to regenerate / recompute from where you want to backtrack from.

Vetch · on July 22, 2023

Whether or not backtracking is needed is really down to the grammar's ambiguity.

The auto-regressive nature of LLMs is actually something that counts against them, at least as some tell it. Although, really, the root problem is generating autoregressively from LLMs precludes planning ahead while also lacking any iterative refinement stage.

Backtracking, look-ahead, early failure pruning and staged generation are all very useful for fitting both concepts (refinement and planning ahead) in an auto-regressive generation framework.

jiggawatts · on July 22, 2023

This is what Google Mind is working on: treating the output of LLMs as tree to be searched instead of just linearly outputting tokens in a "greedy" manner and hoping for the best.

Apparently GPT-4 gets a lot of its quality from generating many alternatives (16?) and then picking the best one, but this is 16x as much computer power.

A clever tree search (which itself could be a neural net!) could improve the efficiency of this many-fold while simultaneously improving the quality by a huge factor as well.

contravariant · on July 22, 2023

Arguably a '1 token at a time' model is itself a tree search, so it's more of a perspective than anything. It's really when you start pruning this tree that this distinction becomes interesting. And of course treating the tree as an explicit object may allow the model to do interesting stuff like jumping to a different branch entirely (deletions insertions etc.).

Generating 16 alternatives and picking the best one only makes sense to me if your standard for picking one is orthogonal to the model itself, if you just pick the one that your model deems the most likely you've just figure out a very crude and expensive way to lower the temperature.

Vetch · on July 22, 2023

That is stretching arguably too far. If you are taking 1 sample path, you are not in any meaningful sense searching a tree. In the context of sampling a probability distribution, which is what LLMs do in effect, there is extra depth to this. Any random response need not be representative of what the model "thinks". And maybe counter-intuitive to some but the most likely generation might actually be unrepresentative as well.

Drawing lots of samples and then marginalizing (as a kind of vote) is methodologically more principled where appropriate. Constraining generation according to some gating function, continually redrawing samples, can be used to significantly reduce error rates at the cost of longer generation times.

LLMs are not being used to their full potential because it is too costly to do so.

Zacharias030 · on July 22, 2023

Isn’t that the whole point of using RL with these things, that the chain of likeliest tokens one by one doesn’t lead to the best overall generation by the model (according to the model itself)? I believe that is one reason the rlhf is using rl and not supervised learning; credit assignment for a good sentence to each token is not trivial after all.

joaogui1 · on July 22, 2023

When we talk about tree search we allow for backtracking, so if a node has 3 children all 3 will be explored generally, or at least a subsample of the children will be, in LLM sampling you generally pick a single token/child and then just go on with that until the end of the generation.

If DeepMind is indeed doing something similar to AlphaZero to language modelling one would expect they would generate multiple "rollouts" from the current context and then use some kind of function/network to predict which next token will lead you to the best final generation and then output that token. How to do all of that using a sensible amount of compute is what remains to be seen

two_in_one · on July 22, 2023

Talking about efficiency. LLMs are often more efficient running batches. Sort of several lines at a time. Which means we can at some point branch new lines and run them in parallel. It will be more efficient than running one after another. More over, with some tricks we can share the 'history' instead of recomputing. This requires going deep into the model though.

Zacharias030 · on July 22, 2023

What tricks are you thinking about? Sharing the history still means you need to save the state of the autoregressive transformer, which is usually prohibitively large?

two_in_one · on July 22, 2023

I'm talking about inference. We need to save is keys, we need all of them to compute next tokens. We don't need queries. But we can play the fact that each next token depends only on the previous. And in whatever gets out of each tranformer's block it's the same. Let's call it 'history'. Which is 2d array [prev_size, embed_size]. Typical will be 1024x512 = 0.5M, may be more depending on the model, but looks like still affordable. prev_size here is [0..max_prompt_size] as we do inference. The idea is that we don't need to recompute it every time. Just add one element as we compute each next token. And if we want to try several alternative tokens, we can put them in one batch, and they will have the same 'history'. We need just a copy, or better reference. This way the branching is almost free. As opposite to 'normal' way when everything is recomputed for each alternative token.

refulgentis · on July 22, 2023

This isn't true, it's a telephone game version of "it's a mixture of experts model" that was used to explain the impossible claim that "it's a 1 trillion parameter" in fall 22

akomtu · on July 22, 2023

Well, if LLM suggests "moves", and an Expert Model judges the whole output, then combining the two with a tree search suspiciously resembles the AlphaGo idea.

refulgentis · on July 22, 2023

It’s not true.

jiggawatts · on July 22, 2023

Apparently it's both. There's a bunch of experts, and then those output many alternatives, of which you see the "best" one as selected by a final quality-check neural net.

refulgentis · on July 22, 2023

I can’t say this strongly enough: it’s not true. You’re just the latest victim.

skinner_ · on July 22, 2023

I understand that the people who claim this don't provide any evidence. But do you have any pointers for the claim that it is not true?

refulgentis · on July 22, 2023

Alas, no, though I'm going to think out loud a bit. I've had to go from making a comment like this once a month to twice a week, so I'm curious what pops out as helpful to point to.

Forgive opinionated language, it's more concise and is more clear to you what exactly I can give evidence of:

- December 22: proto-AI influencers are latching onto GPT4 rumors as a source of engagement. Bunch of people start repeating "RUMORS say GPT4 has ONE TRILLION parameters" Altman laughs, most people laugh, it's not quite so big a community yet.

This percolates, but you kinda ignore it: it's to non-tech people and it's unfalsifiable.

- Feb 23: GPT3.5 API announcement, run out of news, and GPT4 stuff circulates again. MS Euro executive throws gas on the fire by confirming it's release 1.5 weeks earlier. These claims circulate in coverage of what GPT4 might be. However, the circulation is 99.99% in non-tech circles still.

- Mar 23: GPT4 comes out, by now "Chinchilla scaling laws" went from something 10% of tech following AI knows about, to maybe 0.1%. OpenAI releases ~0 information on # of parameters, training, or runtime details, just a visualization of a Chinchilla-fit scaling curve and that they were able to predict the models abilities in advance based on scaling laws.

- Apr 23: GPT4 release content is old now, people needing content venture into claiming details about the model from leaks -- its just the same the trillion parameter thing.

- May 23: Tech substacks beging offering a perspective on AI. They're new and don't know enough to know Altman laughed it off...and that it would be absurd for 100 other reasons. It comes up. A particularly famous blog handwaves about "mixture of experts" to explain how the trillion parameter number could make sense given the most basic reason why they wouldn't, Chinchilla scaling, and the most factual reason it isn't: Altman laughing it off. "Altman was just parsing the idea closely to hide details, it was a showman stunt!"

- Jun 23: The tech community interested in AI outstrips the sober-minded/experienced with LLMs by 1000:1, and this sounds plausible, and it's unfalsifiable. There is no proof it _isn't_ true, and it could be true, and it's a comfortable way to "understand" without putting in the work to understand. People start laundering it to HN in subdiscussions. I see it once the whole month.

- end of July 23: I've seen it every week in July, twice this week.

This is the first time I've seen the mixture of experts simplified to "it generates 16 answers and picks one" ---

which is a thing!

Except that's top-K.

And it's a _completely independent claim_ from the original misunderstandings, and it is a misunderstanding of the misunderstandings that shores up the weak points of the misunderstandings.

Yet, the claim only would make sense if the misunderstandings were true at their face, weak points and all: generating 16 from the same model has existed for a very very long time. I only got in on this in 2019, but its been around since then, and I'm almost certain someone with formal ML training will pop in and say "1965 bro"

lucubratory · on July 24, 2023

Wait, so it was never even confirmed or actually leaked by OpenAI that they're using a MoE model? That was just invented by some blog? I've seen it mentioned everywhere as though it's true.

I think it's likely they're using a technique that is similar to or a descendant of the Tree of Thought technique, because in Karpathy's talk where he was not allowed to discuss GPT4s architecture so he had to discuss only information in the public domain about other models, he pretty strongly indicated that the direction of research he thought people should pursue was ToT. In the past, Karpathy has communicated basically as much as he can to try and educate people about how these models are made and how to do it yourself - he has one of the best YouTube tutorials on making an LLM up. I suspect that he personally probably does not agree with OpenAI's level of secrecy, but at minimum he shares a lot more information publicly than most OAI employees.

aljungberg · on July 22, 2023

We already do tree searches: see beam search and “best of” search. Arguable if it is a “clever” tree search but it’s not entirely unguided either since you prune your tree based on factors like perplexity which is a measure of how probable/plausible the model rates a branch as it stands so far.

In beam search you might keep the top n branches at each token generation step. Best of is in a sense the same but you take many steps using regular sampling at a time before pruning.

brucethemoose2 · on July 21, 2023

> Maybe there is something clever that can be done to avoid regenerating from the start? What you'd need to achieve is that a token that has a x% probability of leading to an incorrect output also has x% probability to be erased.

Like giving the llm a backspace token? There is a paper related to this:

https://news.ycombinator.com/item?id=36425375

contravariant · on July 22, 2023

I mean you're going to need to include a probability to backtrack one way or another, but simply having a backtrack character seems more like a trick to make fitting the model easier than a way to make constraining it more accurate.

Simply having the probability to backtrack does turn the whole generation process into a ergodic Markov chain though, so you might be able to use something like MCMC to make it work. Technically those only start sampling the distribution eventually but picking the first or nth full output might be good enough for all practical purposes. Especially at low temperatures where there aren't many reasonable options in the first place.