Sure, but I think the key difference here is that the optimization problem here ...

platz · on March 1, 2017

I get the impression this won't be sufficient, but my first thought, for such problems, would be to consider Long Short Term Memory (LSTM) networks, whose defining feature is to be capable of learning long-term dependencies (Remembering information for long periods of time is practically their default behavior).

But I can also appreciate that, from what I'm reading here, that RL brings to center actions/decisions to effect an outcome that might not be as easy to tweak in a supervised setting.

gwern · on March 1, 2017

LSTMs still need a differentiable loss, because you have to backpropagate a gradient through a long unrolled series of RNN timesteps. Conceptually, there's not much difference between a RNN which takes 10 inputs 1 step at a time and a single big feedforward which takes 10 inputs 10 steps at a time. If you can't define the loss on the feedforward NN, you can't do it with the RNN either, and so you can't learn the parameters for the LSTM units.

An example here would be a char-RNN. It predicts one character at a time, log probability, and the loss function is the log vs the actual character. Nice and differentiable, so you can take the char-RNN unrolled over 10 timesteps, and at each timestep calculate the gradient to optimize the loss. This also gives you a generative model: sample a character based on the probabilities. Now, take the same char-RNN and redefine the loss as 'whether the user pushed upvote or downvote on the entire 10-character string generated'; you have the unrolled RNN which generated the full string, and you backpropagate... what? What is the gradient for each LSTM parameter, telling it how it should be tweaked to slightly increase/decrease the loss?

platz · on March 1, 2017

I see. This is a good example showing the limitations of neural networks and LSTMs, thanks.