Hacker News new | past | comments | ask | show | jobs | submit login

38% on osworld vs 22% for Claude. That seems like a jump



But of course, after all the benchmark issues we've had thus far -- memorization, conflicts of interest, and just plainly low-quality questions -- I think it's fair to be suspicious of the extent to which these numbers will actually map to usability in the real world.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: