Probably it's something like "give feedback that's on average slightly more corr...

		scarmig 7 days ago \| parent \| context \| favorite \| on: An analysis of DeepSeek's R1-Zero and R1 Probably it's something like "give feedback that's on average slightly more correct than incorrect," though you'd get more signal from perfect feedback. That said, I suspect the signal is very weak even today and probably not too useful except for learning about human stylistic preferences.