This is interesting for its breadth, but they are abstracting a huge set of grad...

This is interesting for its breadth, but they are abstracting a huge set of gradient-climbing behaviors that I'm not sure are of the same class. Also, this has been known forever (since before I was born) in game theory / repeated games as "exploration vs exploitation".

But, it's not surprising to me that a mathematically simple, yet provably convergent algorithm appears over and over. For example, reward maximization in repeated bayesian games looks a lot like "Run and tumble", once you add this critical step (from OP):

> When a “tumble” occurs, rather than sampling a new direction from a uniformly random distribution, we sample according to the distribution of expected rewards

You may want to read the book Algorithms to Live By, also.