> Maybe there is something clever that can be done to avoid regenerating from the start? What you'd need to achieve is that a token that has a x% probability of leading to an incorrect output also has x% probability to be erased.
Like giving the llm a backspace token? There is a paper related to this:
I mean you're going to need to include a probability to backtrack one way or another, but simply having a backtrack character seems more like a trick to make fitting the model easier than a way to make constraining it more accurate.
Simply having the probability to backtrack does turn the whole generation process into a ergodic Markov chain though, so you might be able to use something like MCMC to make it work. Technically those only start sampling the distribution eventually but picking the first or nth full output might be good enough for all practical purposes. Especially at low temperatures where there aren't many reasonable options in the first place.
Like giving the llm a backspace token? There is a paper related to this:
https://news.ycombinator.com/item?id=36425375