RLHF to me seems more as a PR play than anything else, but inasmuch as it does a...

vhiremath4 · on June 27, 2024

This is kind of what I was thinking. I don’t get it. It seems like CriticGPT was maybe trained using RM/RL with PPO as well? So there’s gonna be mistakes with what CriticGPT pushes back on which may make the labeler doubt themselves?

kenjackson · on June 27, 2024

This is not adversarial learning. It's really about augmenting the ability of humans to determine if a snippet of code is correct and write proper critiques of incorrect code.

Any system that helps you more accurately label data with good critiques should help the model. I'm not sure how you come to your conclusion. Do you have some data to indicate that even with improved accuracy that some LLM bias would lead to a worse trained model? I haven't seen that data or assertion elsewhere, but that's the only thing I can gather you might be referring.

advael · on June 27, 2024

Well, first of all, the stated purpose of RLHF isn't to "improve model accuracy" in the first place (and what we mean by accuracy here is pretty fraught by itself, as this could mean at least three different things). They initially pitched it as a "safety" measure (and I think if it wasn't obvious immediately how nonsensical a claim that is, it should at least be apparent now that the company's shucked nearly the entire subset of its members that claimed to care about "AI safety" that this is not a priority)

The idea of RLHF as a mechanism for tuning models based on the principle that humans might have some hard-to-capture insight that could steer them independent of the way they're normally trained is the very best steelman for its value I could come up with. This aim is directly subverted by trying to use another language model to influence the human rater, so from my perspective it really brings us back to square one on what the fuck RLHF is supposed to be doing

Really, a lot of this comes down to what these models do versus how they are being advertised. A generative language model produces plausible prose that follows from the prompt it receives. From this, the claim that it should write working code is actually quite a bit stronger than the claim that it should write true facts, because plausibile autocompletion will learn to mimic syntactic constraints but actually has very little to do with whether something is true, or whatever proxy or heuristic we may apply in place of "true" when assessing information (supported by evidence, perhaps. Logically sound, perhaps. The distinction between "plausible" and "true" is in many ways the whole point of every human epistemology). Like if you ask something trained on all human writing whether the Axis or the Allies won WWII, the answer will depend on whether you phrased the question in a way that sounds like Phillip K Dick would write it. This isn't even incorrect behavior by the standards of the model, but people want to use these things like some kind of oracle or to replace google search or whatever, which is a misconception about what the thing does, and one that's very profitable for the people selling it

manilbeat · on June 28, 2024

RLHF worked well for midjourney but I think that is because it is outsourcing something that is ultimately completely subjective and very subtle like human visual aesthetic choice that can't be "wrong".

I tried to understand the paper and I can't really make sense of it for "code".

It seems like this would inherit a subtler version of all the problems from expert systems.

A press release of this does feel rather AI bubbly. Not quite Why The Future Doesn't Need Us level but I think we are getting close.