Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There several reasons responses from the same model might vary:

- "temperature" - intentional random sampling from the most likely next tokens to improve "creativity" and help avoid repetition

- quantization - running models with lower numeric precision (saves on both memory and compute, without impacting accuracy too much)

- differences in/existence of a system prompt, especially when using something end-user-oriented like Qwen Chat

- not-quite-deterministic GPU acceleration

Benchmarks are usually run at temperature zero (always take the most likely next token), with the full-precision weights, and no additions to the benchmark prompt except necessary formatting and stuff like end-of-turn tokens. They also usually are multiple-choice or otherwise expect very short responses, which leaves less room for run-to-run variance.

Of course a benchmark still can't tell you everything - real-world performance can be very different.



AFAIK the batch your query lands in can also matter[1].

Though I imagine this should be a smaller effect than different quantization levels say.

[1]: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...


Thanks, this is a good checklist.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: