I have to admit I'm kind of surprised by the SWE-bench results. At the highest level of performance o3-mini's CodeForces score is, well, high. I've honestly never really sat down to understand how elo works, all I know is that it scored better than o1, which allegedly as better than ~90% of all competitors on CodeForces. So, you know, o3-mini is pretty good at CodeForces.
But it's SWE-bench scores aren't meaningfully better than Claude, 49.3 vs Claude's 49.0 on the public leaderboard (might be higher now due to recent updates?)
My immediate thoughts, CodeForces (and competitive programming in general) is a poor proxy for performance on general software engineering tasks. Besides that, for all the work put into OpenAI's most recent model it still has a hard time living up to an LLM initially released by Anthropic some time ago, at least according to this benchmark.
Mind you, the Github issues that the problems in SWE-bench were based-off have been around long enough that it's pretty much a given that they've all found their way into the training data of most modern LLMs, so I'm really surprised that o3 isn't meaningfully better than Sonnet.
I'm not that surprised. Codeforces is a relatively low area of vocabulary knowledge needed.
Real software has a lot more complexity and constraints, as well as ambiguity. Claude nowhere scores so high on codeforces, but crushes o1 on webarena: https://web.lmarena.ai/leaderboard
I also ran a hold out test myself for o3 mini - asked it to implement a function I need for Python 2.5. Claude and O1 get it - o3 mini hard incorrectly believes some functions are available in this certain that aren't. If I correct it, it's revised solution is very hacky (technically works, but I would take Claude's solution over it)
> My immediate thoughts, CodeForces (and competitive programming in general) is a poor proxy for performance on general software engineering task
Yep. A general software engineering task has a lot of information encoded in it that is either already known to a human or is contextually understood by a human.
A competitive programming task often has to provide all the context as it's not based off an existing product or codebase or technology or paradigm known to the user
But it's SWE-bench scores aren't meaningfully better than Claude, 49.3 vs Claude's 49.0 on the public leaderboard (might be higher now due to recent updates?)
My immediate thoughts, CodeForces (and competitive programming in general) is a poor proxy for performance on general software engineering tasks. Besides that, for all the work put into OpenAI's most recent model it still has a hard time living up to an LLM initially released by Anthropic some time ago, at least according to this benchmark.
Mind you, the Github issues that the problems in SWE-bench were based-off have been around long enough that it's pretty much a given that they've all found their way into the training data of most modern LLMs, so I'm really surprised that o3 isn't meaningfully better than Sonnet.