That is a slight exaggeration, extrapolation on the author's part. What happened...

		nlpnerd 7 days ago \| parent \| context \| favorite \| on: An analysis of DeepSeek's R1-Zero and R1 That is a slight exaggeration, extrapolation on the author's part. What happened was that RL training led to some emergent behavior in R1-Zero (chain-of-thought, and reflection) without being prompted or trained for explicitly. Don't see what is so domain specific about that though.