Hacker News new | past | comments | ask | show | jobs | submit login

If such labels are collected and used to retrain the model then yes. But these models are not learning online.





ChatGPT came out and its interface was a chatbox and a thumbs up / thumbs down icon (or whichever) to rate the responses; surely that created a feedback loop of learning, like all machine learning has done for years now?

Really? Isn't that the point of RL used in the way R1 did?

Provide a cost function (vs labels) and have it argue itself to greatness as measured by that cost function?

I believe that's what GP meant by "respond", not telling GPT they were wrong.


That is still inference. It is using a model generated from the RL process. The RL process is what used the cost function to add another model layer. Any online/continual learning would have to be performed by a different algorithm than classical LLM or RL. You can think of RL as a revision, but it still happens offline. Online/continual learning is still a very difficult problem in ML.

Yes, that makes sense. We're both talking about offline learning.



Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: