just ran the LLM to SQL benchmark over opus-4.1 and it didn't top previous versi... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		alrocar 3 months ago \| parent \| context \| favorite \| on: Claude Opus 4.1 just ran the LLM to SQL benchmark over opus-4.1 and it didn't top previous version :thinking: => https://llm-benchmark.tinybird.live/

epolanski 3 months ago [–]

How does running it multiple times performs?

LLMs are non-deterministic, I think benchmarks should be more about averages of N runs, rather than single shot experiments.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact