I too have questioned the approach of showing the long side-by-side answers from...

I too have questioned the approach of showing the long side-by-side answers from two different models.

1) sometimes I wanted the short answer, and so even though the long answer is better I picked the short one.

2) sometimes both contain code that is different enough that I am inclined to go with the one that is more similar to what I already had, even if the other approach seems a bit more solid.

3) Sometimes one will have less detail but more big picture awareness and the other will have excellent detail but miss some overarching point that is valuable. Depending on my mood I sometimes choose but it is annoying to have to do so because I am not allowed to say why I made the choice.

The area of human training methodology seems to be a big part of what got Deepseek's model so strong. I read the explanation of the test results as an acknowledgement by OpenAI of some weaknesses in its human feedback paradigm.

IMO the way it should work is that the thumbs up or down should be read in context by a reasoning being and a more in-depth training case should be developed that helps future models learn whatever insight the feedback should have triggered.

Feedback that A is better or worse than B is definitely not (in my view) sufficient except in cases where a response is a total dud. Usually the responses have different strengths and weaknesses and it's pretty subjective which one is better.