Beam search is well known. I mean strategies like beam search, but one's we don't know about.
I can imagine some, for example like beam search but you score every option with a smaller model. Of course one can say "but we see every token as it streams" to which I might say, are you sure? Perhaps they generate a hundred entire responses in the time it takes for one token to be shown. They just "stream" those tokens so slow to make it more "human pace" oriented.
interesting. but there should be physical limits to that that we can handicap to put bounds on speculation. so for example, FLOPS/s has an upper bound and you can make latency estimates for 1/10/100B models. this would put reasonable bounds for statements like "a hundred entire responses in the time it takes for one token to be shown"
are you referring to beam search? something else?