There several reasons responses from the same model might vary:
- "temperature" - intentional random sampling from the most likely next tokens to improve "creativity" and help avoid repetition
- quantization - running models with lower numeric precision (saves on both memory and compute, without impacting accuracy too much)
- differences in/existence of a system prompt, especially when using something end-user-oriented like Qwen Chat
- not-quite-deterministic GPU acceleration
Benchmarks are usually run at temperature zero (always take the most likely next token), with the full-precision weights, and no additions to the benchmark prompt except necessary formatting and stuff like end-of-turn tokens. They also usually are multiple-choice or otherwise expect very short responses, which leaves less room for run-to-run variance.
Of course a benchmark still can't tell you everything - real-world performance can be very different.
- "temperature" - intentional random sampling from the most likely next tokens to improve "creativity" and help avoid repetition
- quantization - running models with lower numeric precision (saves on both memory and compute, without impacting accuracy too much)
- differences in/existence of a system prompt, especially when using something end-user-oriented like Qwen Chat
- not-quite-deterministic GPU acceleration
Benchmarks are usually run at temperature zero (always take the most likely next token), with the full-precision weights, and no additions to the benchmark prompt except necessary formatting and stuff like end-of-turn tokens. They also usually are multiple-choice or otherwise expect very short responses, which leaves less room for run-to-run variance.
Of course a benchmark still can't tell you everything - real-world performance can be very different.