I'm not that surprised. Codeforces is a relatively low area of vocabulary knowledge needed.
Real software has a lot more complexity and constraints, as well as ambiguity. Claude nowhere scores so high on codeforces, but crushes o1 on webarena: https://web.lmarena.ai/leaderboard
I also ran a hold out test myself for o3 mini - asked it to implement a function I need for Python 2.5. Claude and O1 get it - o3 mini hard incorrectly believes some functions are available in this certain that aren't. If I correct it, it's revised solution is very hacky (technically works, but I would take Claude's solution over it)
Real software has a lot more complexity and constraints, as well as ambiguity. Claude nowhere scores so high on codeforces, but crushes o1 on webarena: https://web.lmarena.ai/leaderboard
I also ran a hold out test myself for o3 mini - asked it to implement a function I need for Python 2.5. Claude and O1 get it - o3 mini hard incorrectly believes some functions are available in this certain that aren't. If I correct it, it's revised solution is very hacky (technically works, but I would take Claude's solution over it)