Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

just ran the LLM to SQL benchmark over opus-4.1 and it didn't top previous version :thinking: => https://llm-benchmark.tinybird.live/


How does running it multiple times performs?

LLMs are non-deterministic, I think benchmarks should be more about averages of N runs, rather than single shot experiments.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: