You don't need honest user feedback because you could judge any message part of a conversation using hindsight.
Just ask a LLM to judge if a response is useful, while seeing what messages come after it. The judge model has privileged information. Maybe 5 messages later it turns out what the LLM replied was not a good idea.
You can also use related conversations by the same user. The idea is to extend context so you ca judge better. Sometimes the user tests the llm ideas in the real world and comes back with feedback, that is real world testing, something R1 can't do.
Tesla uses the same method to flag the seconds before a surprising event, it works because it has hindsight. It uses the environment to learn what was important.
You don't need honest user feedback because you could judge any message part of a conversation using hindsight.
Just ask a LLM to judge if a response is useful, while seeing what messages come after it. The judge model has privileged information. Maybe 5 messages later it turns out what the LLM replied was not a good idea.
You can also use related conversations by the same user. The idea is to extend context so you ca judge better. Sometimes the user tests the llm ideas in the real world and comes back with feedback, that is real world testing, something R1 can't do.
Tesla uses the same method to flag the seconds before a surprising event, it works because it has hindsight. It uses the environment to learn what was important.