> with most computations it would have to carry out the whole thing in one go Is...

thesz · on March 30, 2022

Here it is: https://arxiv.org/abs/1611.06188

RNN outputs "confidence" bit which can guide computation to perform more steps to obtain more confidence in the result. Essentially, RNN asks "let me think about that some more".

But, separate ablation study found that if you just drop confidence bit altogether and allow RNN to compute some more every time (e.g., always perform 4 computations on single input for 1 output), you get same or better results without extra complexity of training.

There is also Microsoft Research's paper I can't find right now about the variable computation for image classification where there is a "confidence" bit at some of the final layers - if lower layer is cinfident enough, it's output will be used for classification, otherwise the output of that layer will be passed into more transformation of upper layers.

rablackburn · on March 30, 2022

> But, separate ablation study found that if you just drop confidence bit altogether and allow RNN to compute some more every time (e.g., always perform 4 computations on single input for 1 output), you get same or better results without extra complexity of training.

Do they saw what happens if you do both? Perhaps the “benefit from a higher computation/per cycle” phenomena and the “benefit from signalling relative computation resource allocation” one are different.

I guess I’ll have to try and read the paper, but I’m new to the literature and am clueless about the current state of research.

durovo · on March 29, 2022

I believe GPT-3 has a transformer-based architecture. So it doesn't recursively ingest it's own output in each iteration. I believe attention-based transformer models have enough complexity to be able to learn what you are talking about on their own.

ravi-delia · on March 30, 2022

GPT-3's transformers only recur some finite amount. Attention does a lot compared to a bog standard RNN, and probably if the numbers were tokenized it would be enough for most reasonable computations, but eventually you definitely would hit a cap. That's probably a good thing, of course. The network and training are Turing complete together, but it would suck if the network itself could fail to terminate.

thrtythreeforty · on March 29, 2022

Thank you for pointing out the difference. I went and reread about transformers; previously I thought they were a kind of RNN. (I am not an ML engineer.)

zaptrem · on March 30, 2022

That would be neat. You could give it backspace and "let me think more" tokens that would signal the inference program to run it again on the prompt plus its own output. That way it could generate "thoughts thoughts thoughts [THINKMORE] thoughts thoughts thoughts [THINKMORE] [BACKSPACE]X 8 (The real output would go here""

It would of course have to be penalized in some way for [THINKMORE]ing to avoid infinite processing time. It would have to learn to reason at what point diminishing returns would kick in from continuing to [THINKMORE] VS recording its best answer. The penilization function would have to take into account remaining tokens that would fit in the transformer prompt.

ravi-delia · on March 29, 2022

I think it would work, but backprop would be computed in a different way every time. I'm not an expert, so there may be sneaky ways around it, but I'm pretty sure you'd lose out on a long history of little efficiency improvements when you could just make it more recurrent instead.