I have to admit I'm kind of surprised by the SWE-bench results. At the highest l...

usaar333 · 2025-02-01T15:30:40 1738423840

I'm not that surprised. Codeforces is a relatively low area of vocabulary knowledge needed.

Real software has a lot more complexity and constraints, as well as ambiguity. Claude nowhere scores so high on codeforces, but crushes o1 on webarena: https://web.lmarena.ai/leaderboard

I also ran a hold out test myself for o3 mini - asked it to implement a function I need for Python 2.5. Claude and O1 get it - o3 mini hard incorrectly believes some functions are available in this certain that aren't. If I correct it, it's revised solution is very hacky (technically works, but I would take Claude's solution over it)

aprilthird2021 · 2025-02-01T02:54:22 1738378462

> My immediate thoughts, CodeForces (and competitive programming in general) is a poor proxy for performance on general software engineering task

Yep. A general software engineering task has a lot of information encoded in it that is either already known to a human or is contextually understood by a human.

A competitive programming task often has to provide all the context as it's not based off an existing product or codebase or technology or paradigm known to the user

dagelf · 2025-02-01T02:16:49 1738376209

I think the innovation here is probably that its a much smaller and so cheaper model to run.

vectorhacker · 2025-02-01T03:58:45 1738382325

Yeah, I no longer consider the SWE-bench useful because these models can just "memorize" the solutions to the PRs.