At multiple points in the paper, they describe the agent as "embodied", but it seems like they don't really justify their use of that term. Embodiment is a loaded word with a lot of associations in neighboring fields. I think it's a little wild that they're claiming this aspect in their abstract and conclusion and not really touching on it substantively in the body of the paper.
Off the cuff, I think any system that has a perceptual connection to an "environment" separate from itself and can induce changes in that environment can be generally considered "embodied". This ability to cognitively distinguish self from environment and to understand one's effect on that environment are the crucial elements. I think an explicitly spatial notion of embodiment typical of humans, is insufficiently general.
As someone in CV/ML with a background in cognitive science, and partial to the work of Lakoff and others in regards to the role of embodiment, it does irk me how loose papers in vision can be with the term. In this case specifically, it's rather hard to justify. I'm guessing it may be influenced by other papers in vision that use agents in a simulated 3D environment. In those papers saying that there is some kind of embodiment is iffy, but depending on the quality of the simulation, maybe defensible.
But not just "consciousness" which has a bunch of problems of poor definition and epistemic inaccessibility, but "cognition" more broadly.
In this "OS" situation, aside from not even simulating a virtual body, it's not clear that the inputs available to the agent are especially tied to the agent's ongoing actions. By comparison, what you see is a function of how you move your body, head, eyes and eyelids, and "seeing" involves saccades that let you take in the multiple important parts of a scene. So I think this doesn't even have an instantiation which acts _analogously_ to being embodied.
It's going to be really interesting to see how work like this impacts the Robotic Process Automation or "RPA" product space.
Many of the enterprise products in this category have used computer vision to help reduce brittleness, but for all of the improvements, these tools have remained highly error prone, not to mention extremely expensive.
The explosion of interest in the category across a broader community of researchers and developers seems like both a boon for the RPA space, and a major threat of disruption.
As an aside: I'm stunned how willing my own org is to train an RPA process that can break so easily, vs scripting the same actions in PowerShell and performing it against a REST interface in ServiceNow (our ticketing system). They scream and cry about proper authentication methods to ServiceNow (OAuth not username + password), and then they're happy to let an RPA process clunk around doing something recorded. They shot down the script that works, and are investing hundreds of hours in an RPA expert to do it with many less scripted validation steps. Just ugh.
Wait…are they using RPA to automate interactions with ServiceNow? i.e. they’re controlling the ServiceNow UI? If so, that’s pretty terrible.
This kind of project baffles my mind. RPA should be used for situations when there is no API available, and from what I understand of the ServiceNow product, it has all kinds of APIs for automation use cases.
Our ServiceNow team literally believes use of the REST API is a security risk, because you can insert "bad data" into the incident table and others. I showed them different ways you can break the web GUI to do things like creating a ticket without a customer, or setting a ticket to On Hold without an On Hold Reason, etc. I actually created a dashboard called the "Gallery of Broken Tickets" because they didn't know how to make reports and dashboards, and then I presented this to them to show them bad data that currently exists in `incident`. I'm never getting hired to that team.
But yes, RPA is big at our org right now. In ServiceNow too.
Add another layer of AI on top of RPA and the mess will be harder to or impossible to ever untangle. If it works with high accuracy it will still be a great thing but a complete blackbox and expensive one to maintain too.
I suspect they’ll co-exist, but I don’t think full replacement is likely in many cases. Existing RPA products will continue to exist if for no other reason that they’re already deeply embedded in many environments. Over time, those same products will just get easier to use.
One of the core features of RPA products is centralized visibility and management of the lifecycle of automations. This will remain even if the entire interaction layer changes.
Something like the OS-Copilot layer lowers the barrier to entry though and I could easily see there being multiple choices for orgs looking to do this kind of automation in the future, and this is where I see the opportunity for disruption.
Most likely the RPA products will add all of the necessary buzzwords to keep people in buying roles interested and more likely to keep renewing.
I would assume that any RPA tool without a large multimodal (language) agent framework running it is basically already obsolete.
I hope we get more open source options like Llava or CogVLM.
The paper makes it seem like there may be some way to control the mouse and read screen captures, but doesn't give any details. There is a GitHub link though so maybe it's in there.
In my wildest speculation behind the gossip of recent goings on I imagine a board comprised of individuals saturated in the technology, presented with the latest technology, a leading edge AI, tasked with improving it's own code, as the board and co. sat back and watched, the machine made improvements rapidly, rapidly enough that the engineers had to admit they could no longer follow the code changes, watched as the AI redeveloped the underlying language then kernel and started planning fabrication technology and how to fund it, mere trillions, the benchmarks were smashed and they hit stop, what do you do next?
Just to be clear, you’re saying that Sam Altman is currently in an Ex Machina-esque relationship with his own wannabe Rehoboam, and that his recent trillion dollar manufacturing proposal was really the first acts of a nascent AGI?
Because if so, you’re right on. The question is whether Sam knows it
I recall reading a roadmap for camera sensors, this might be folklore it was a long time ago, the takeaway was that they had the technology for the 100mp sensor and the roadmap showed how they would maximise profit using a 20 year curve working up to the release of the final sensor, by which time a new method would meet that curve to continue their dominance, if I had AGI, I would let it out in very small drips.
Just because we have the ability to move individual atoms around, doesnt mean its affordable to do so. It probably would have been pretty expensive (and low yield, and cumbersom, and error prone) to make a 100mp camera in 2004.
Fair, it may well have been a roadmap based on some kind of theoretical limit on the technology, also, I am fairly old, so 10mp was probably more likely!
You can hire human captcha farms for sub pennies already. Systems like reCATCHA are behaviour / profile-based*; the actual captcha bit is almost a stagescreen.
*At certain thresholds, even solving the captcha correctly will falsely claim it’s wrong. Particularly if you’re not logged into a Google account.
I asked this when UFO was trending but does this work? I cannot try more of them. I want this, but so far everything is so terrible. So who tried it for non trivial/cherry picked stuff?
I hope it's better at interacting with the OS than ChatGPT is. If I ask ChatGPT to write me a powershell/batch script to do something it works about 25% of the time.
ChatGPT and Copilot both frequently hallucinate entire plausible-sounding APIs or properties/methods that simply don’t exist. Often this takes a lot of time to debug/resolve. They also often omit proper input validation, error handling, and logging even when prompted to include it.
I want to do a structured efficiency study of programming tasks of human vs human-plus-AI from problem statement through production-ready code. But my org doesn’t have enough devs to make it statistically significant, nor the spare human capacity to invest on duplicative tasks. I assume there must be some studies out there anyone have a reference?
This is the endgame that’ll put many out of their jobs. Not that it’s bad. I personally think humanity will unfortunately need greater and greater efficiency in the coming decades.
This is not an LLM, so it doesn't make much sense to compare it with one. A more appropriate comparison would be something along the lines of GPT-4 OS-Copilot vs. GPT-4 CoT or vs. GPT-4 zero-shot.