Hacker News new | past | comments | ask | show | jobs | submit login

OpenAI is merely matching SOTA in browser tasks as compared to existing browser-use agents. It is a big improvement over Claude Computer Use, but it is more of the same in the specific domain of browser tasks when comparing against browser-use agents (which can use the DOM, browser-specific APIs, and so on.)

The truth is that while 87% on WebVoyager is impressive, most of the tasks are quite simple. I've played with some browse-use agents that are SOTA and they can still get very easily confused with more complex tasks or unfamiliar interfaces.

You can see some of the examples in OpenAI's blog post. They need to quite carefully write the prompts in some instances to get the thing to work. The truth is that needing to iterate to get the prompt just right really negates a lot of the value of delegating a one-off task to an agent.




Well that's fair. I wasn't saying that this was necessarily at a level of competence to be useful, simply that it seemed to be a lot better than Claude.


Yeah, and Browser Use already has 89% on WebVoyager https://browser-use.com/posts/sota-technical-report


> OpenAI is merely matching SOTA in browser tasks as compared to existing browser-use agents.

No. It's not matching them, it's clearly exceeding them. The previous post provided the numbers.


Those numbers are not the full story. Note that GP specifically says: "Big jumps in benchmarks from _Claude's Computer Use_ though." Claude Computer Use was not SOTA for browser tasks at the time of its release (and is still not.)

In WebArena, Operator does 58.1%. Previous SOTA for browser-use agents is 57.1%. In WebVoyager, Operator does 87.0%. Previous SOTA for browser-use agents is the exact same.

See here for details: https://openai.com/index/computer-using-agent/


Those two were two different models (Kura and jace.ai), and one model being SOTA at one benchmark doesn't make it SOTA overall. Moreover, both are specific for browser use, so they don't operate only on raw pixels but can read HTML/DOM, unlike general computer use models which rely on raw screenshots only.


I think I hit all those points in my previous post, except for the fact that it's two different models, as you've noted. That said, neither of them seem to report scores for the other benchmark in each particular case.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: