The claim is that this removes the human bottleneck (aka SFT or supervised fine ...

aithrowawaycomm · 2025-01-29T19:35:00 1738179300

It's also extremely hard to nail down in much of mathematics or computer science!

- is such-and-such theorem deep or shallow?

- is this definition/axiom useful? (there's a big difference between doing compass-straightedge proofs vs. wondering about the parallel postulate)

- more generally, discovering theorems is generally not amenable to verifiable rewards, except in domains where simpler deterministic tools exist (in which case LLMs can likely help reduce the amount of brute forcing)

- is this a good mathematical / software model of a given real-world system?

- is the flexibility of dynamic/gradual typing worth the risk of type errors? is static typing more or less confusing for developers?

- what features should be part of a programming language's syntax? should we opt for lean-and-extensible or batteries-included?

- are we prematurely optimizing this function?

- will this program's memory needs play nicely with Rust's memory model? What architectural decisions do we need to make now to avoid headaches 6 months down the line?

Davidzheng · 2025-01-29T21:19:57 1738185597

Not clear to me that theorem discovery is not amenable to verifiable rewards. I think most important theorems probably are recovered automatically by asking AI systems to proof increasing complicated human conjectures. Along the way I expect emergent behaviors of creating conjectures and recognizing important self-breakthroughs. Much like regret emergence

youoy · 2025-01-29T21:34:52 1738186492

Theorems discovery is amenable to verifiable rewards. But is meaningful theorems discovery too? Is the ability to discern between meaningful theorems and bad ones an emergent behaviour? You can check for yourself examples of automatic proofs, and the huge amount of intermediate theorems that they can generate which are not very meaningful.

janalsncm · 2025-01-30T17:48:55 1738259335

Unless you can quantify what you mean by “meaningful” then it won’t be possible. It can’t read your mind.

janalsncm · 2025-01-30T17:47:04 1738259224

For questions with a correct answer, you don’t need to verify the reasoning process. RL training will discover it. That’s R1-Zero.

The point of R1 was to fix problems with the reasoning tokens and expand to subjective domains like creative writing.

nextos · 2025-01-29T18:44:46 1738176286

IMHO, there are strategies that could extend this approach to many other domains.

I was discussing this idea (along with a small prototype) with a prominent symbolic AI researcher who also agrees, and thinks that with the emergence of RL as a viable training method for LLMs, it might be possible to pursue neuro-symbolic learning at a large scale.

Current systems are impressive, but reasoning is too fragile to trust them. They fall into obvious logical and statistical fallacies that are evident to a layperson.

gadtfly · 2025-01-29T18:42:53 1738176173

Reasoning transfers across domains.

Philpax · 2025-01-29T19:40:15 1738179615

See https://www.interconnects.ai/p/why-reasoning-models-will-gen... for more information.

Onavo · 2025-01-29T18:12:08 1738174328

By verifiable do they mean it in the complexity theory P/NP sense of the word?

calebkaiser · 2025-01-29T20:56:26 1738184186

In the case of DeepSeek-R1, they used a series of heuristic reward functions that were built for different data types. The paper mentions the use of sandboxed environments to execute generated code against a suite of tests, for example, to evaluate it for correctness. The reward functions also evaluated syntax and formatting.

In general, the use of externally verifiable sources of truth (like simulators) is referred to as "grounding" and there has been quite a bit of research around it over the years, if you're interested in digging deeper. I've always found it super compelling as a research direction.

ks2048 · 2025-01-29T18:27:19 1738175239

I think it just means that you can objectively score an answer as being correct or not. (e.g. if the generated program passes some tests; a discovered proof is valid, etc).

logicchains · 2025-01-29T18:28:47 1738175327

As in there's an objective truth that can be determined by a computer. E.g. whether code compiles, whether a unit test passes, whether the answer given to a mathematical question like 3+5 is correct. Many other fields have no objective truth (like art or creative writing), or objective truth requires measurement of the physical world (although if the world can be simulated accurately enough for the problem class at hand, then sufficient training data can still be generated by a computer).

bpfrh · 2025-01-29T18:35:51 1738175751

Isn't "code compiles" an insufficient criteria?

e.g you would need to prove that for all inputs the code produces the correct output which would in turn make the problem way more complex

pressbuttons · 2025-01-29T18:44:56 1738176296

Not if the problem as written is "does this code compile", which is still a useful stepping stone for some workflows. Yours is certainly a more useful query in most cases but repositioning or re-scoping the original question can still lead to a net win.

logicchains · 2025-01-29T18:48:25 1738176505

It's not a sufficient criteria by itself, but where no better criteria is possible it would still produce better results in reinforcement learning than if the model has no reward for producing correctly compiling code vs code that failed to compile.

drdeca · 2025-01-29T19:34:58 1738179298

The other replies have said what was meant, but I don’t think they’ve explicitly addressed whether or not that is the sense used in the idea of NP.

I would say… it is at least somewhat similar.

A problem in NP might be of the form “For this value of X, does there exist a Y such that q(X,Y)?” for some predicate q and value X, and where when the answer is “yes”, the answer of “yes” can be verified by being given a value Y, and evaluating q(X,Y). (Specifically in the case of 3SAT, X would be a 3CNF formula, Y would be an assignment of values to the variables in the formula, and q(X,Y) would be “the formula X when evaluated with variable assignments Y, results in 'true’.”.)

This is sort of like the task of “Given requirements X that can be checked automatically, produce code Y which satisfies those requirements”, except that in this case the question is specifically asking for Y, not just asking whether such a Y exists, but.. well, often in practice when one wants a solution to a problem in NP, one actually wants the witness, not just whether there exists such a Y, right?

So, I would say there is a substantial similarity, but also a difference.

HarHarVeryFunny · 2025-01-29T21:39:34 1738186774

For some reasoning data (e.g. you talking out loud as you figure something out, mistakes and all) to be useful for RL training, the conclusion to your reasoning needs to be correct/verified, else that's not the kind of reasoning you want to learn!

Some types of reasoning output, such as solving a math problem or writing a computer program can be automatically verified (e.g. respectively by a symbolic solver, or by compiling and running the program), but in the general case it's hard for a computer to verify whether a chain of reasoning is correct and arrived at a valid answer or not, although LLM-as-judge should work some of the time.

pertymcpert · 2025-01-29T18:27:53 1738175273

They mean that the solutions can be verified to be correct in a binary sense. E.g. a coding solution passes all the unit tests vs writing poetry.

sgt101 · 2025-01-29T20:11:16 1738181476

There's a big difference. The membership of these classes is determined in the worst case - so if there is no polynomial time solution in the worst case then it's NP.

For this problem we don't care if it's possible that sometimes there are things that aren't verifiable, or the answers aren't exact, we just need training signal.

wildermuthn · 2025-01-30T14:58:23 1738249103

This feels quite close to the definition of the singularity; if an LLM can become both the Generator and the Discriminator (to use a GAN analogy), then we have takeoff.

nycdatasci · 2025-01-31T04:23:41 1738297421

Really? What about drug development? Protein folding?