Sharing ideas early is not a bad thing and very much encouraged by YC, we are gauging interest in collaboration on the topic. Our company has open sourced almost our entire compute use stack already https://github.com/agentsea
We are in the process of collecting the data right now which is fairly involved, we are going to be opening up that platform for others as well shortly.
It's not my intention to be dismissive because the project idea seems really cool. I'm just curious, why not wait until the source code is ready to post it on HN?
Because gauging interest early and finding other people interested in building is a good idea and frankly very inline with YC thinking. We have already open sourced an enormous amount of code and datasets for computer use https://github.com/agentsea
We are working to apply the ideas of R1 to computer use. The primary struggle is creating reliable neural reward models since hard-verification rewards are not available at scale in GUI interactions.
Our team is currently deep in the weeds of collecting reasoning annotation data for GUI interfaces to train a reliable reward model.
We would love all thoughts, feedback, and collaborations!
Training a base model just for computer use seems like an overkill as normal reasoning model like o3 for planning + a vision model like gemini-flash is good enough[1] without being trained specifically for computer use.
But if you still want to try out this path, Google has made the screenQA dataset(rico) available[2] along with bounding boxes.
They aren't good enough to be reliable, and these agents aren't much use without being reliable. We initially started that way, and tried just about everything before focusing on training a base model.
Free advice (though, worth less than free, because A) it's unsolicited B) it's saying "don't do it")
TL;DR:
- Turns out that if you do UXR, even if computer use is 100% successful in the action execution, and there's no latency, people don't use it. (interesting to me is, the core demo was buying airline tickets, and so is OpenAI's. no one would defer to a computer on that, for humanist / design reasons)
- You're not going to be able out-do model companies on building models, they have too much funds.
- Try writing GUI-based integration tests. Then imagine an LLM, miraculously, always chooses the right route. Does the UX look good?
- Note the reasoning models are worse at tool calling. It's very, very, VERY stark when you have Claude next to o1/4o. OpenAI also owns up to this in the o3-mini paper, though its not under a blaring red line headline or phrased that straightforwardly.
- Why is that? You're fighting against the current when you're trying to teach the next token predictor to throw a bunch of text out there to <think>, then generate perfectly correct JSON/python/whatever given N tools.
LLM computer use is like that inflatable autopilot from Airplane, it makes no sense for something that can interface directly with the underlying system to have to interact with the GUI that's only there because we are apes who like brightly coloured rectangles and clicking things.
If you had a real super-AGI it could directly interact with any computer program by injecting itself into the process.
It is possible for an human to inject features into notepad.exe without having access to the source code (I learn that in cracking tutorials from +HCU almost 30years ago...) it should be possible for an AI to do so.
> Turns out that if you do UXR, even if computer use is 100% successful in the action execution, and there's no latency, people don't use it. (interesting to me is, the core demo was buying airline tickets, and so is OpenAI's. no one would defer to a computer on that, for humanist / design reasons)
I would never buy plane tickets that way, we built it because there are tons of things we couldn't automate and this was the only way to do it
> - You're not going to be able out-do model companies on building models, they have too much funds.
There are plenty of people edging out the model companies all over the place. We aren't powerless to them
> Note the reasoning models are worse at tool calling. It's very, very, VERY stark when you have Claude next to o1/4o. OpenAI also owns up to this in the o3-mini paper, though its not under a blaring red line headline or phrased that straightforwardly. Why is that? You're fighting against the current when you're trying to teach the next token predictor to throw a bunch of text out there to <think>, then generate perfectly correct JSON/python/whatever given N tools.
The reason why they are bad at tool calling is they aren't trained on it. The current reasoning models require hard-verification reward models, we don't have those for tool calling, whereas that data is easy to get for math/code. Reasoning will improve tool calling, OpenAI just talked about it recently as being the answer to autonomous agents
> The reason why they are bad at tool calling is they aren't trained on it.
Yes they are.*
> OpenAI just talked about it recently as being the answer to autonomous agents
You nailed it.
It's key to autonomous agents, OpenAI says so out loud, and yet, we're 6 months into reasoning models and performance is regressing, which OpenAI also says out loud but in the fine print of a model card.
I know this is a flaming hot take because current thing is reasoning. But it's completely mistaken that reasoning models help with tool-use, both in theory and practice, which puts them in quite a situation.
I'm sure they'll figure it out, but I'm also sure on a long enough timeline the LLM is a computer.
* I have a bad habit of dismissing without evidence that which is asserted without evidence, and 'not even wrong', in the Pauli sense. It's a cheap way to avoid confrontation. But it makes me look petulant. We can observe, inter alia, every release post September 2024 (i.e. o1-mini and o1-preview) can call tools. (i.e. o1, o3-mini)
Please cite the source for o1 being trained heavily on tool usage.
Reasoning will obviously improve agentic workflows, reasoning is the main problem we see today with autonomous agents. It seems to be likely the _last_ issue we have with them
- It's going to be difficult to find a source that says heavily on tool use. They don't talk about training at all anymore! :( And saying you trained heavily on tool use would imply your model isn't trained well for other stuff, which they'd want to avoid.
- Nothing implies they didn't do the exact same tool training they have been. Both model cards mention tool use.
- Both o1 and o3-mini have bog-standard tool calling via API. I'm not sure what gives us the sense that they weren't trained on them
- The claim shifted significantly:
From: "The reason why they are bad at tool calling is they aren't trained on it"
To: "[I don't think you can find a source that says] o1 [was] trained heavily on tool usage."
Are people concerned about the privacy implications of computer use at all? This is why I haven’t been using Claude computer use personally. Somehow the idea of sending everything I do on my computer to a random third party seems creepy. There are a lot of applications of AI (rewind comes to mind) that I simply cannot accept the idea of sharing my screen with
I share this feeling every time a popup asks me to accept coookies for a website and its 1243 "trusted partners" — which, in this context, feels like a Ghengis Khan scale harem rather than any sane business relationship.
That is why open source and local hostable models are so important. The privacy considerations are what are paramount, not just the ability to have unlimited token generation.
We have more than a quarter of a century of the normalization of zero privacy, and this is obviously anti Chinese AI company propaganda because the reality is that RTB is so bad that the security concerns are simply not new so why should people to even care?
Google, Meta, Microsoft and other RTB firms send RTB data about people in the U.S. to Russia and China and anyone else who signs up...many people don't see the difference. In fact for many people, the CCP having your data is far less of a risk vector than the thousands of others who get your data every single time you hit a webpage, visit the local store etc...
When Google, Meta, Microsoft etc... are selling your data thousands of times a day, and companies aggregate that to sell even sensitive information completely based on even national security sensitive categories.
* you can always choose to self a small model, although it probably doesn't work as well
* it's not a "random third party". You know to whom the data is being sent, and at least according to service agreements, most services don't use your data for training. If you don't trust Claude, you could trust AWS hosted version, or GPT/Deepseek hosted on Azure. Well, if you think Amazon/Microsoft is not trustworthy and they may misuse your data in these cloud services (not some random consumer facing service where you are the product), you might as well give up your digital life.
You can run AIs locally. I have a laptop that can DeepSeek's thinking <= 70b versions of qwen and llama locally, and they are a blast to play with even on an airplane without an internet connection.
It does not appear that this aims to use an API, but rather to train/finetune their own model using insights from R1 ([mostly] pure RL approach towards bootstrapping reasoning in an LLM)
Outside of tech circles? No, not really. The past decade is showed that if anything goes out of the window first, it'll be privacy if it helps in terms of speed, convenience or money.
I’m concerned. Not just privacy but AI request forgery will also be a thing.
Running local models don’t protect you from prompt injection attacks or hallucinations.
There are some startups building capabilities apis to limit that but most websites/apps either don’t have the resources or aren’t willing to expose those capabilities.
And as some others have mentioned, users have a track record of giving up privacy for convenience. I’m not convinced educating non-technical users about the risks involved will ward them off.
> There are some startups building capabilities apis to limit that but most websites/apps either don’t have the resources or aren’t willing to expose those capabilities.
Care to elaborate with example(s) of a startups doing this?
What does your perception look like, are you using raw screenshots? GUI snapshots? Vision is very difficult for these, and snapshots are incomplete, is what I've found in some earlier experiments.
i wonder how good is R1 at counting pixels from a screenshot. what enabled claude and OAI's CUA to develop computer use was being able to precisely give x-y coordinates of a click location.
also, how big of a gain to have reasoning for computer use? i feel like reasoning unlocks a lot when there is a single complex question but not so much better at taking actions in a long term plan.
This is the type of post some VP at my company sees and starts telling people that R1 can use computer and then I have to be like "well actually" to 25 people.
Computer use is pretty exciting stuff in general though, good luck
People have tons of workflows that involve a lot of clicks and typing in response to data that are too difficult or one-off to automate with fragile macros.
But if my computer can quickly realize that I'm deleting every odd-numbered page of a PDF, or renaming every file to add a prefix, or following each link on a website and saving an image... and then just instantly automate the next 100 times... that's going to be huge!
> But if my computer can quickly realize that I'm deleting every odd-numbered page of a PDF, or renaming every file to add a prefix, or following each link on a website and saving an image... and then just instantly automate the next 100 times... that's going to be huge!
The first two tasks could be easily done by asking ChatGPT to write a script for you. Scraping a website can be a bit more tricky. Still, I don't see why you have to rely on "computer use" for these tasks -- there are much more efficient and reliable approaches to the tasks.
Those are just simple examples. Most of the clicking I do on my computer doesn't have a command-line equivalent. Nor do I want to have to type out a request to ChatGPT, even if there is one.
There's a gigantic area of productivity improvement around repetitive actions that aren't easily scriptable or no scripting interface exists. But where an AI assistant that interfaces with your screen, pointer and keyboard would be a huge help.
That's about manually setting up agents, that run on a server, that seem to interact largely with the web (from the examples).
I'm talking about not manually setting up anything -- I'm talking about an AI that simply observes the repetitive actions you're taking on your computer, infers patterns from them, and then offers to take over and finish the job.
There was something like this on Macs in the mid 1990s, that would watch you work and suggest timed automations - there was a bit of a developer panic the first time it told someone "I see you launch <popular desktop game> around 4:30pm every Friday, would you like to do that automatically?"
As a start, I want to see if the agent can figure out how I play a clicker style game such as adventure capitalist on the computer. I am thinking I have a certain style of playing. I still don't understand how an agent can somehow figure out valid gameplay (earth, moon, mars, events) AND figure out a valid gameplay much less play the game in my own style.
I think we should start with something simple, repeatable, and does little to no harm if/when things go wrong.
Sorry to be a party-pooper, but does it really make sense to add a citation when you don't have fully working code yet, let alone a paper about it?