Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sure, but I think the key difference here is that the optimization problem here has a particularly intractable form. In a supervised image problem, you spit out a classification probability, the loss function is a cross entropy loss or something, which is smooth and differentiable and you can do gradient descent over it no problem; you have any sort of X->Y problem with a differentiable loss, you train a differentiable or convex model and minimize/maximize the loss. In a RL problem, you might get back only 0/1, or you might have to give many classifications in a row before any loss arrives. How do you maximize/minimize on an entire series of discrete actions with global losses?


I get the impression this won't be sufficient, but my first thought, for such problems, would be to consider Long Short Term Memory (LSTM) networks, whose defining feature is to be capable of learning long-term dependencies (Remembering information for long periods of time is practically their default behavior).

But I can also appreciate that, from what I'm reading here, that RL brings to center actions/decisions to effect an outcome that might not be as easy to tweak in a supervised setting.


LSTMs still need a differentiable loss, because you have to backpropagate a gradient through a long unrolled series of RNN timesteps. Conceptually, there's not much difference between a RNN which takes 10 inputs 1 step at a time and a single big feedforward which takes 10 inputs 10 steps at a time. If you can't define the loss on the feedforward NN, you can't do it with the RNN either, and so you can't learn the parameters for the LSTM units.

An example here would be a char-RNN. It predicts one character at a time, log probability, and the loss function is the log vs the actual character. Nice and differentiable, so you can take the char-RNN unrolled over 10 timesteps, and at each timestep calculate the gradient to optimize the loss. This also gives you a generative model: sample a character based on the probabilities. Now, take the same char-RNN and redefine the loss as 'whether the user pushed upvote or downvote on the entire 10-character string generated'; you have the unrolled RNN which generated the full string, and you backpropagate... what? What is the gradient for each LSTM parameter, telling it how it should be tweaked to slightly increase/decrease the loss?


I see. This is a good example showing the limitations of neural networks and LSTMs, thanks.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: