I've always been skeptical of benchmarking because of the memorization problem. ...

tornato7 · on Nov 9, 2023

I like the test but do you take multiple samples / runs of a result? IMO for a proper benchmark you should ask it the same question 10+ times and show a confidence interval, otherwise you don't know if it's just a fluke or a lucky guess.

maxrmk · on Nov 9, 2023

Ahh good suggestion, I should clarify this in the article. I tried to compensate with volume -- I used a set of 200 questions for the testing. I was using temperature 0, so I'd get the same answer if I ran a single question multiple times.