I agree that these benchmarks don’t mean as much anymore because it’s highly likely they were already present in the training set, but also believe it’s likely these tools will be significantly better in a few research cycles
A significant number of bugs just end in 'stupid mistake I didn't notice' or 'weird behaviour with a fix described on SO/docs/forum post'. Current day LLMs are much better positioned to solve these issues than humans are.