Yeah it may be that previous training data, the model was given a strong negative signal when the human trainer told it it was wrong. In more subjective domains this might lead to sycophancy. If the human is always right and the data is always right, but the data can be interpreted multiple ways, like say human psychology, the model just adjusts to the opinion of the human.
If the question is about harder facts which the human disagrees with, this may put it into an essentially self-contradictory state, where the locus of possibilitie gets squished from each direction, and so the model is forced to respond with crazy outliers which agree with both the human and the data. The probability of an invented reference being true may be very low, but from the model's perspective, it may still be one of the highest probability outputs among a set of bad choices.
What it sounds like they may have done is just have the humans tell it it's wrong when it isn't, and then award it credit for sticking to its guns.
I put in the ChatGPT system prompt to be not sycophantic, be honest, and tell me if I am wrong. When I try to correct it, it hallucinates more complicated epicycles to explain how it was right the first time.
If the question is about harder facts which the human disagrees with, this may put it into an essentially self-contradictory state, where the locus of possibilitie gets squished from each direction, and so the model is forced to respond with crazy outliers which agree with both the human and the data. The probability of an invented reference being true may be very low, but from the model's perspective, it may still be one of the highest probability outputs among a set of bad choices.
What it sounds like they may have done is just have the humans tell it it's wrong when it isn't, and then award it credit for sticking to its guns.