That is still inference. It is using a model generated from the RL process. The RL process is what used the cost function to add another model layer. Any online/continual learning would have to be performed by a different algorithm than classical LLM or RL. You can think of RL as a revision, but it still happens offline. Online/continual learning is still a very difficult problem in ML.
Provide a cost function (vs labels) and have it argue itself to greatness as measured by that cost function?
I believe that's what GP meant by "respond", not telling GPT they were wrong.