38% on osworld vs 22% for Claude. That seems like a jump | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

usaar333 3 months ago | parent | context | favorite | on: Operator research preview

38% on osworld vs 22% for Claude. That seems like a jump

achierius 3 months ago [–]

But of course, after all the benchmark issues we've had thus far -- memorization, conflicts of interest, and just plainly low-quality questions -- I think it's fair to be suspicious of the extent to which these numbers will actually map to usability in the real world.

Join us for AI Startup School this June 16-17 in San Francisco!
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact