Huh, that one got it wrong for me too. I don't have patience to try it 10 times each to see if it was a coincidence, but it is absolutely true that not all implementations of LLMs produce the same outputs. It is in fact common for subtle bugs to happen that cause the outputs to be worse but not catastrophically bad, and therefore go unnoticed. So I wouldn't trust any implementation but the original for benchmarking or even general use unless I tested it extensively.