Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've always been skeptical of benchmarking because of the memorization problem. I recently made up my own (simple) date reasoning benchmark to test this, and found that GPT-4 Turbo actually outperformed GPT-4: https://open.substack.com/pub/talcai/p/making-up-a-new-llm-b...


I like the test but do you take multiple samples / runs of a result? IMO for a proper benchmark you should ask it the same question 10+ times and show a confidence interval, otherwise you don't know if it's just a fluke or a lucky guess.


Ahh good suggestion, I should clarify this in the article. I tried to compensate with volume -- I used a set of 200 questions for the testing. I was using temperature 0, so I'd get the same answer if I ran a single question multiple times.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: