it seems to work and seems very scalable, "reasoning" helps to counter biases (answers become longer, ie. the system uses more tokens which means more time to answer a question -- likely longer answers allow better differentiation of answers from each other in the "answer space")
https://newsletter.languagemodels.co/i/155812052/large-scale...
also from the posted article
"""
The R1-Zero training process is capable of creating its own internal domain specific language (“DSL”) in token space via RL optimization.
This makes intuitive sense, as language itself is effectively a reasoning DSL.
"""