Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think there's a fundamental limit to benchmarks when it comes to real-world utility. The best option would be more like a user survey.


That's Chatbot Arena: https://lmarena.ai/leaderboard


And unfortunately revealed to be largely a vibe check these days with that whole Llama 4 debacle. But why should we be surprised, really, when users have an easier time feeling if the replies sound human and conversational and _appear_ knowledgeable than actually outsmarting them. This Arena worked well in the ChatGPT 3.0 days… But now?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: