I used R1 hosted at NVidia here: https://build.nvidia.com/deepseek-ai/deepseek-r...

modeless · 2025-01-31T23:41:50 1738366910

Huh, that one got it wrong for me too. I don't have patience to try it 10 times each to see if it was a coincidence, but it is absolutely true that not all implementations of LLMs produce the same outputs. It is in fact common for subtle bugs to happen that cause the outputs to be worse but not catastrophically bad, and therefore go unnoticed. So I wouldn't trust any implementation but the original for benchmarking or even general use unless I tested it extensively.

SparkyMcUnicorn · 2025-01-31T22:20:00 1738362000

Same. With the recommended settings, it got it right. I regenerated a bunch of times, and it did suggest Cathy once or twice.

R1 70b also got it right just as many times for me.