In the case of DeepSeek-R1, they used a series of heuristic reward functions that were built for different data types. The paper mentions the use of sandboxed environments to execute generated code against a suite of tests, for example, to evaluate it for correctness. The reward functions also evaluated syntax and formatting.
In general, the use of externally verifiable sources of truth (like simulators) is referred to as "grounding" and there has been quite a bit of research around it over the years, if you're interested in digging deeper. I've always found it super compelling as a research direction.
In general, the use of externally verifiable sources of truth (like simulators) is referred to as "grounding" and there has been quite a bit of research around it over the years, if you're interested in digging deeper. I've always found it super compelling as a research direction.