IMO, it’s just a small scale example of “training to the tests” because “count the ‘r’s in strawberry” became such a popular test that would make the news when a powerful model couldn’t answer such a simple question correctly while being advertised as the smartest model ever.
Assigning this as an indicator for improvement of intelligence seems like a mistake (or wishful).
If done at scale, they are kinda crowd sourcing the test set from the entire internet, personal and business world. It will be harder and harder at least to pinpoint weaknesses, at least for the general public. It probably has little to do with intelligence (at least fluid intelligence as defined by Chollet et al) - but I guess it is sound tactic if the strategy is "fake it till you make it". And we might be surprised as to how far along that can go...