almost as well as o3? kind of like gemini 2.5? I dug deeper and surprise surpris...

orbital-decay · 2025-06-28T11:08:56 1751108936

Not everything that's written is worth reading, let alone drawing conclusions from. That benchmark shows different trees each time the author runs it, which should tell you something about it. It also stacks grok-3-beta together with gpt-4.5-preview in the GPT family, making the former appear to be trained on the latter. This doesn't make sense if you check the release dates. And previously it classified gpt-4.5-preview to be in a completely different branch than 4o (which does make some sense but now it's different).

EQBench, another "slop benchmark" from the same author, is equally dubious, as is most of his work, e.g. antislop sampler which is trying to solve an NLP task in a programmatic manner.