LLMs are non-deterministic, I think benchmarks should be more about averages of N runs, rather than single shot experiments.
LLMs are non-deterministic, I think benchmarks should be more about averages of N runs, rather than single shot experiments.