I've always been skeptical of benchmarking because of the memorization problem. I recently made up my own (simple) date reasoning benchmark to test this, and found that GPT-4 Turbo actually outperformed GPT-4: https://open.substack.com/pub/talcai/p/making-up-a-new-llm-b...
I like the test but do you take multiple samples / runs of a result? IMO for a proper benchmark you should ask it the same question 10+ times and show a confidence interval, otherwise you don't know if it's just a fluke or a lucky guess.
Ahh good suggestion, I should clarify this in the article. I tried to compensate with volume -- I used a set of 200 questions for the testing. I was using temperature 0, so I'd get the same answer if I ran a single question multiple times.