I had in mind the datasets of Easy2Hard-Bench that the study tested against: mat...

I had in mind the datasets of Easy2Hard-Bench that the study tested against: math competitions, math word problems, programming, chess puzzles, science QA, and commonsense reasoning.

The last problem like this that I myself asked an LLM to solve was to find tax and base price of items on an invoice given total price and tax rates. I couldn't make sense of the answer, but asking the LLM questions made me realize that I had framed the problem badly, and moreso that I didn't know how to ask. (Though the process also triggered a surprising ability of my own to dredge up and actually apply basic algebra.) I'm sure it's that I'm still learning what and how to ask.