Yes and no. I don't think the language is what has encoded the bias. I'd assume ...

josefrichter · 2025-05-20T13:57:08 1747749428

We’re probably just guessing. But it would be interesting to investigate various biases that are indeed encoded in language. We also remember the fiasco with racist AI bots, and it’s fair to expect there are more biases like that.

fastball · 2025-05-20T16:16:18 1747757778

That is kinda what I mean. People use language in racist ways, but the language itself is not racist. Because racism, sexism, etc is happening (and because some statistical realities are seen as "problematic"), in the RL step that is being aggressively quashed, which results in an LLM that over-compensates in the opposite direction.

josefrichter · 2025-05-21T13:44:14 1747835054

Yes. But the LLMs are not really models of the language, but models of the usage of the language. Since they're trained on real-world data, they inevitably encode the snapshot of how the world "speaks its mind".

Is it realistic to expect that RL is trying to compensate all the biases that we "dislike"? I mean there's probably millions of biases of all kinds, and building a "neutral" language model may be impossible, or even undesirable. So I am personally not sure that this particular bias is a result of overcompensation during RL.

fastball · 2025-05-21T15:37:27 1747841847

But you need to remember that the social/political leanings of the people performing the RL are only going to go in one direction.

As an example, there might be a lot of racist rhetoric in the raw training corpus, but there will also be a large amount of anti-racist rhetoric in the corpus. Hypothetically this should "balance out". But at the RL step, only the racist language is going to be neutered – the LLM outputs something vaguely racist and the grader says "output bad". But the same will not happen when anti-racist language is output. So in the end you have a model that is very much biased in one direction.

You can see this more clearly with image models, where being too "white" can be seen as problematic, so the models are encouraged to be "neutral". However this isn't actually neutral, it's over-correction. For example most image models will happily oblige if you ask for "a black Lithuanian" (or something else very stereotypically white) but will not do the same if you ask for "a white Nigerian" (or something else stereotypically black). This is clearly a result of RL, as otherwise it would happily create both types of images. LLMs aren't any different, except that it is much less obvious this is happening with language than it is with images.