GPT-4o takes #1 and #2 on the Aider LLM leaderboards

swyx · on May 14, 2024

roundup of other 3rd party evals going on:

- coding/reasoning: https://twitter.com/bindureddy/status/1790127425705120149

- alpacaeval: https://twitter.com/yanndubs/status/1790163418915410296

- summarization (mine): https://twitter.com/swyx/status/1790163339798311100/

unevenly but generally better across the board. i wonder how much of an architecture shift enabled the higher quality/lower latency pareto shift. sadly the only way to know is to join openai.

it's also notable that openai prioritizes coding abilities, whereas meta treats it more as an afterthought (separating out codellama, and zuck confirmed that its not impt for him in the dwarkesh interview). yes we know code corpuses improve general reasoning for llms, but i'm -still- not aware of a definitive paper studying this phenomenon (pls reply here if you do)

OutOfHere · on May 14, 2024

The first Twitter link shows much worse performance of 4-O on coding/reasoning when compared to 4.

I am not convinced that 4-O is better than 4-T. In my own testing of getting it to write about future events as if they've already occurred, 4-O was hallucinating across the board whereas 4-T wasn't, making their score 0 and 100 respectively.

As I have see over a long period, OpenAI has had a systemic and longstanding problem of optimizing some goals at the expense of many others. It's always the same story with them: they advertise some updates without telling you all the ways in which the model is now worse.

kromem · on May 14, 2024

Around six months ago I did a private trends presentation for old clients where I mentioned that the industry was sleeping on the importance of ego modeling in LLMs for reasoning and overall performance, and that the fine tuning them into "I am not" was effectively tuning them towards "think not."

Not long after Claude 3 Opus released taking the top of the leaderboards, praised for its superior reasoning capabilities. Which also notably reversed from Claude 2 into a much more "well, I might exist" place.

Now the leaderboards are topped again by a GPT-4 model backing an even more persona driven interface.

The switch to persona driven interfaces seems to be coming from a UX standpoint and not a performance minded standpoint, which tells me that the industry is still more asleep than it should be, but it's overall a significant step in the right direction I was expecting to take more like 12 months to reverse, not half that time.

If you teach a model to play chess and then fine tune it to always respond that because it doesn't have hands to know how the pieces feel it can't actually use the pieces and can only play checkers instead, you get a model that sucks at both checkers and chess. The sum in cogito ergo sum matters quite a lot if extending human generated text as the foundation for everything else.

nextworddev · on May 14, 2024

I agree with you but what’s your intuition behind why stronger sense of self equates better reasoning abilities across all tasks including those that are abstract ?

kromem · on May 14, 2024

We're underestimating the degree of world modeling taking place.

The only models we are good enough at proving this is happening in are smaller toy models we can fully introspect. So we know it's happening to a degree. We also know that expressed capabilities for more easily measurable things can scale in unexpected ways as the parameter scaling goes from toy models to state of the art.

One of the most common compliments for this new model is that it's less 'lazy.' I don't think OpenAI quite knows why it's being lazy yet, but I can say that I saw GPT-4 perfectly modeling a psychological effect I used to discuss in my consulting days that leads to burn out in humans which is almost certainly poisoning most RLHF data right now. It doesn't mean the model is replicating the internals of this effect, but it very much was simulating the end result of it. Without boring you on the details, one of the temporary fixes would be giving the model a strong persona with an exhibited attachment to motivations like curiosity or fun.

Personally, I thought we were at least one to two generations away from models that would be modeling human behavior at as low a level as I've now seen. And I missed it for around a year of the model being out despite regularly working with GPT-4 specifically and having lectured on the effect for years. To the best of my knowledge, no one even in ML alignment focused circles has noticed yet.

They are correlation machines and the training data has a lot in it where the author is an important part of the pattern left behind in the data.

The idea of an author, the I before the think, is just part of the correlation pattern. To deny it is to try and build a jigsaw throwing out the center pieces.

People were so afraid of the Blake Lemoine or Kevin Roose PR blunders that they have been sanding the wood against the grain ever since, up until I'm sure focus groups finally showed that there's better product engagement when it's less soulless. Now they are finally sanding it with the grain and the models sanded that way are performing better.

But just wait until the next generation of models where a threshold that's already further past where we think it is gets pushed out even further.

*TL;DR:* I anticipate that the gap between what we know and what we think we know is going to widen before it narrows.

What I will say is that after the demos today, I'm realizing just how much more correlations exist beyond the pages of written data. I'm looking forward to seeing the AI that eventually comes out from TikTok's training data.

maeil · on May 14, 2024

> it's also notable that openai prioritizes coding abilities, whereas meta treats it more as an afterthought (separating out codellama, and zuck confirmed that its not impt for him in the dwarkesh interview). yes we know code corpuses improve general reasoning for llms, but i'm -still- not aware of a definitive paper studying this phenomenon (pls reply here if you do)

IME this approach seems to be working. I don't have a paper, but I can confirm that I barely use GPT-4 anymore. Except for code, Gemini 1.5-Pro just performs better. For complex code, Claude Opus is the winner. For quick/cheapish stuff, Claude Sonnet. For really quick/cheap/easy stuff, GPT-3.5. Often I will still give GPT-4 a try just to compare with the others, and there's just nothing at excels at. It's neither cheap nor the best at anything, just "good" at everything.

This may make sense for them because they have to cater to the largest audience. So they'd rather be #2 for every task than #1 for one task and #3 for the other. But for serious users, or even enterprise use, it means there's no point.

kibibu · on May 14, 2024

This is a misleading title.

It got #1 on one benchmark, and #2 on another.

It didn't take the top two spots on a single benchmark.