it seems to work and seems very scalable, "reasoning" helps to counter biases (a...

it seems to work and seems very scalable, "reasoning" helps to counter biases (answers become longer, ie. the system uses more tokens which means more time to answer a question -- likely longer answers allow better differentiation of answers from each other in the "answer space")

https://newsletter.languagemodels.co/i/155812052/large-scale...

also from the posted article

"""

The R1-Zero training process is capable of creating its own internal domain specific language (“DSL”) in token space via RL optimization.

This makes intuitive sense, as language itself is effectively a reasoning DSL.

"""