Hacker News new | past | comments | ask | show | jobs | submit login

Not just that: the LLM outputs not individual tokens, but a weighted recommendation. The most probable (“best”) token has the highest weight, but there may be many alternatives including JSON symbols like quote characters.

The “temperature” setting adjusts how likely it is that an output token is chosen that is not the top-rated option. That prevents repetitive output.

Forcing an LLM to obey a grammar is mostly about filtering the list before the token choice is made. There may still be a random element controlled by the temperature!

A more advanced feature not commonly used is to also enable back-tracking if the AI gets stuck and can’t produce a valid output.




> A more advanced feature not commonly used is to also enable back-tracking if the AI gets stuck and can’t produce a valid output.

Technically that part is mandatory if you don't just want it to produce an output but to make it produce an output that correctly matches the temperature (i.e. one that you could have gotten by randomly sampling the LLM until you got a correct one). Randomly picking the next tokens that isn't grammatically incorrect works but oversamples paths where most of the options are invalid. The ultimate example of this is that it can get stuck at a branch with probability 0.

From a probabilistic standpoint what you'd need to do is not just make it backtrack but make it keep generating until it generates a grammatically correct output in one go.

Maybe there is something clever that can be done to avoid regenerating from the start? What you'd need to achieve is that a token that has a x% probability of leading to an incorrect output also has x% probability to be erased.


The way LLMs work is they output probabilities for every _token_, so you don't really need to backtrack you can just always pick a token that matches the provided grammar.

That said, you might want to do something like (backtracking) beam-search which uses various heuristics to simultaneously explore multiple different paths because the semantic information may not be front-loaded, i.e. let's say we had a grammar that had a key "healthy" with values "very_unhealthy" or "moderately_healthy." For broccoli, the LLM might intend to say "very_healthy" and choose "very" but then be pigeonholed into saying "very_unhealthy" because it's the only valid completion according to the grammar.

That said, there are a lot of shortcuts you can take to make this fairly efficient thanks to the autoregressive nature of (most modern) LLMs. You only need to regenerate / recompute from where you want to backtrack from.


Whether or not backtracking is needed is really down to the grammar's ambiguity.

The auto-regressive nature of LLMs is actually something that counts against them, at least as some tell it. Although, really, the root problem is generating autoregressively from LLMs precludes planning ahead while also lacking any iterative refinement stage.

Backtracking, look-ahead, early failure pruning and staged generation are all very useful for fitting both concepts (refinement and planning ahead) in an auto-regressive generation framework.


This is what Google Mind is working on: treating the output of LLMs as tree to be searched instead of just linearly outputting tokens in a "greedy" manner and hoping for the best.

Apparently GPT-4 gets a lot of its quality from generating many alternatives (16?) and then picking the best one, but this is 16x as much computer power.

A clever tree search (which itself could be a neural net!) could improve the efficiency of this many-fold while simultaneously improving the quality by a huge factor as well.


Arguably a '1 token at a time' model is itself a tree search, so it's more of a perspective than anything. It's really when you start pruning this tree that this distinction becomes interesting. And of course treating the tree as an explicit object may allow the model to do interesting stuff like jumping to a different branch entirely (deletions insertions etc.).

Generating 16 alternatives and picking the best one only makes sense to me if your standard for picking one is orthogonal to the model itself, if you just pick the one that your model deems the most likely you've just figure out a very crude and expensive way to lower the temperature.


That is stretching arguably too far. If you are taking 1 sample path, you are not in any meaningful sense searching a tree. In the context of sampling a probability distribution, which is what LLMs do in effect, there is extra depth to this. Any random response need not be representative of what the model "thinks". And maybe counter-intuitive to some but the most likely generation might actually be unrepresentative as well.

Drawing lots of samples and then marginalizing (as a kind of vote) is methodologically more principled where appropriate. Constraining generation according to some gating function, continually redrawing samples, can be used to significantly reduce error rates at the cost of longer generation times.

LLMs are not being used to their full potential because it is too costly to do so.


Isn’t that the whole point of using RL with these things, that the chain of likeliest tokens one by one doesn’t lead to the best overall generation by the model (according to the model itself)? I believe that is one reason the rlhf is using rl and not supervised learning; credit assignment for a good sentence to each token is not trivial after all.


When we talk about tree search we allow for backtracking, so if a node has 3 children all 3 will be explored generally, or at least a subsample of the children will be, in LLM sampling you generally pick a single token/child and then just go on with that until the end of the generation.

If DeepMind is indeed doing something similar to AlphaZero to language modelling one would expect they would generate multiple "rollouts" from the current context and then use some kind of function/network to predict which next token will lead you to the best final generation and then output that token. How to do all of that using a sensible amount of compute is what remains to be seen


Talking about efficiency. LLMs are often more efficient running batches. Sort of several lines at a time. Which means we can at some point branch new lines and run them in parallel. It will be more efficient than running one after another. More over, with some tricks we can share the 'history' instead of recomputing. This requires going deep into the model though.


What tricks are you thinking about? Sharing the history still means you need to save the state of the autoregressive transformer, which is usually prohibitively large?


I'm talking about inference. We need to save is keys, we need all of them to compute next tokens. We don't need queries. But we can play the fact that each next token depends only on the previous. And in whatever gets out of each tranformer's block it's the same. Let's call it 'history'. Which is 2d array [prev_size, embed_size]. Typical will be 1024x512 = 0.5M, may be more depending on the model, but looks like still affordable. prev_size here is [0..max_prompt_size] as we do inference. The idea is that we don't need to recompute it every time. Just add one element as we compute each next token. And if we want to try several alternative tokens, we can put them in one batch, and they will have the same 'history'. We need just a copy, or better reference. This way the branching is almost free. As opposite to 'normal' way when everything is recomputed for each alternative token.


This isn't true, it's a telephone game version of "it's a mixture of experts model" that was used to explain the impossible claim that "it's a 1 trillion parameter" in fall 22


Well, if LLM suggests "moves", and an Expert Model judges the whole output, then combining the two with a tree search suspiciously resembles the AlphaGo idea.


It’s not true.


Apparently it's both. There's a bunch of experts, and then those output many alternatives, of which you see the "best" one as selected by a final quality-check neural net.


I can’t say this strongly enough: it’s not true. You’re just the latest victim.


I understand that the people who claim this don't provide any evidence. But do you have any pointers for the claim that it is not true?


Alas, no, though I'm going to think out loud a bit. I've had to go from making a comment like this once a month to twice a week, so I'm curious what pops out as helpful to point to.

Forgive opinionated language, it's more concise and is more clear to you what exactly I can give evidence of:

- December 22: proto-AI influencers are latching onto GPT4 rumors as a source of engagement. Bunch of people start repeating "RUMORS say GPT4 has ONE TRILLION parameters" Altman laughs, most people laugh, it's not quite so big a community yet.

This percolates, but you kinda ignore it: it's to non-tech people and it's unfalsifiable.

- Feb 23: GPT3.5 API announcement, run out of news, and GPT4 stuff circulates again. MS Euro executive throws gas on the fire by confirming it's release 1.5 weeks earlier. These claims circulate in coverage of what GPT4 might be. However, the circulation is 99.99% in non-tech circles still.

- Mar 23: GPT4 comes out, by now "Chinchilla scaling laws" went from something 10% of tech following AI knows about, to maybe 0.1%. OpenAI releases ~0 information on # of parameters, training, or runtime details, just a visualization of a Chinchilla-fit scaling curve and that they were able to predict the models abilities in advance based on scaling laws.

- Apr 23: GPT4 release content is old now, people needing content venture into claiming details about the model from leaks -- its just the same the trillion parameter thing.

- May 23: Tech substacks beging offering a perspective on AI. They're new and don't know enough to know Altman laughed it off...and that it would be absurd for 100 other reasons. It comes up. A particularly famous blog handwaves about "mixture of experts" to explain how the trillion parameter number could make sense given the most basic reason why they wouldn't, Chinchilla scaling, and the most factual reason it isn't: Altman laughing it off. "Altman was just parsing the idea closely to hide details, it was a showman stunt!"

- Jun 23: The tech community interested in AI outstrips the sober-minded/experienced with LLMs by 1000:1, and this sounds plausible, and it's unfalsifiable. There is no proof it _isn't_ true, and it could be true, and it's a comfortable way to "understand" without putting in the work to understand. People start laundering it to HN in subdiscussions. I see it once the whole month.

- end of July 23: I've seen it every week in July, twice this week.

This is the first time I've seen the mixture of experts simplified to "it generates 16 answers and picks one" ---

which is a thing!

Except that's top-K.

And it's a _completely independent claim_ from the original misunderstandings, and it is a misunderstanding of the misunderstandings that shores up the weak points of the misunderstandings.

Yet, the claim only would make sense if the misunderstandings were true at their face, weak points and all: generating 16 from the same model has existed for a very very long time. I only got in on this in 2019, but its been around since then, and I'm almost certain someone with formal ML training will pop in and say "1965 bro"


Wait, so it was never even confirmed or actually leaked by OpenAI that they're using a MoE model? That was just invented by some blog? I've seen it mentioned everywhere as though it's true.

I think it's likely they're using a technique that is similar to or a descendant of the Tree of Thought technique, because in Karpathy's talk where he was not allowed to discuss GPT4s architecture so he had to discuss only information in the public domain about other models, he pretty strongly indicated that the direction of research he thought people should pursue was ToT. In the past, Karpathy has communicated basically as much as he can to try and educate people about how these models are made and how to do it yourself - he has one of the best YouTube tutorials on making an LLM up. I suspect that he personally probably does not agree with OpenAI's level of secrecy, but at minimum he shares a lot more information publicly than most OAI employees.


We already do tree searches: see beam search and “best of” search. Arguable if it is a “clever” tree search but it’s not entirely unguided either since you prune your tree based on factors like perplexity which is a measure of how probable/plausible the model rates a branch as it stands so far.

In beam search you might keep the top n branches at each token generation step. Best of is in a sense the same but you take many steps using regular sampling at a time before pruning.


> Maybe there is something clever that can be done to avoid regenerating from the start? What you'd need to achieve is that a token that has a x% probability of leading to an incorrect output also has x% probability to be erased.

Like giving the llm a backspace token? There is a paper related to this:

https://news.ycombinator.com/item?id=36425375


I mean you're going to need to include a probability to backtrack one way or another, but simply having a backtrack character seems more like a trick to make fitting the model easier than a way to make constraining it more accurate.

Simply having the probability to backtrack does turn the whole generation process into a ergodic Markov chain though, so you might be able to use something like MCMC to make it work. Technically those only start sampling the distribution eventually but picking the first or nth full output might be good enough for all practical purposes. Especially at low temperatures where there aren't many reasonable options in the first place.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: