As a point of interest and for comparison, Gemini 2.5 Pro is able to generate a ...

As a point of interest and for comparison, Gemini 2.5 Pro is able to generate a Python program that outputs the complete correct solution when run, but it can't figure out how to one-shot the problem if asked directly.

This is just a for-fun test to get a sense of how models are progressing; it highlights the jagged nature of their intelligence and capabilities. None of the big AI labs are testing for such a basic problem type, which makes it a bit of an interesting check.

I think it's still interesting to see how Grok 4 performs, even if we don't use this test to draw any broader conclusions about what capabilities it offers.