OS-Copilot: Towards Generalist Computer Agents with Self-Improvement

abeppu · 2024-02-17T22:07:12 1708207632

At multiple points in the paper, they describe the agent as "embodied", but it seems like they don't really justify their use of that term. Embodiment is a loaded word with a lot of associations in neighboring fields. I think it's a little wild that they're claiming this aspect in their abstract and conclusion and not really touching on it substantively in the body of the paper.

naasking · 2024-02-18T05:57:24 1708235844

Off the cuff, I think any system that has a perceptual connection to an "environment" separate from itself and can induce changes in that environment can be generally considered "embodied". This ability to cognitively distinguish self from environment and to understand one's effect on that environment are the crucial elements. I think an explicitly spatial notion of embodiment typical of humans, is insufficiently general.

dimatura · 2024-02-18T16:40:17 1708274417

As someone in CV/ML with a background in cognitive science, and partial to the work of Lakoff and others in regards to the role of embodiment, it does irk me how loose papers in vision can be with the term. In this case specifically, it's rather hard to justify. I'm guessing it may be influenced by other papers in vision that use agents in a simulated 3D environment. In those papers saying that there is some kind of embodiment is iffy, but depending on the quality of the simulation, maybe defensible.

xpe · 2024-02-17T23:09:30 1708211370

What fields do you think use the term relatively clearly?

cdogl · 2024-02-18T00:18:00 1708215480

Cognitive science. “The Embodied Mind”[1] defines embodiment quite clearly and argues for its centrality to consciousness.

https://mitpress.mit.edu/9780262220422/

abeppu · 2024-02-18T00:35:57 1708216557

But not just "consciousness" which has a bunch of problems of poor definition and epistemic inaccessibility, but "cognition" more broadly.

In this "OS" situation, aside from not even simulating a virtual body, it's not clear that the inputs available to the agent are especially tied to the agent's ongoing actions. By comparison, what you see is a function of how you move your body, head, eyes and eyelids, and "seeing" involves saccades that let you take in the multiple important parts of a scene. So I think this doesn't even have an instantiation which acts _analogously_ to being embodied.

haswell · 2024-02-17T22:50:20 1708210220

It's going to be really interesting to see how work like this impacts the Robotic Process Automation or "RPA" product space.

Many of the enterprise products in this category have used computer vision to help reduce brittleness, but for all of the improvements, these tools have remained highly error prone, not to mention extremely expensive.

The explosion of interest in the category across a broader community of researchers and developers seems like both a boon for the RPA space, and a major threat of disruption.

throwaway8481 · 2024-02-18T07:49:50 1708242590

As an aside: I'm stunned how willing my own org is to train an RPA process that can break so easily, vs scripting the same actions in PowerShell and performing it against a REST interface in ServiceNow (our ticketing system). They scream and cry about proper authentication methods to ServiceNow (OAuth not username + password), and then they're happy to let an RPA process clunk around doing something recorded. They shot down the script that works, and are investing hundreds of hours in an RPA expert to do it with many less scripted validation steps. Just ugh.

haswell · 2024-02-18T12:43:02 1708260182

Wait…are they using RPA to automate interactions with ServiceNow? i.e. they’re controlling the ServiceNow UI? If so, that’s pretty terrible.

This kind of project baffles my mind. RPA should be used for situations when there is no API available, and from what I understand of the ServiceNow product, it has all kinds of APIs for automation use cases.

throwaway8481 · 2024-02-19T08:10:07 1708330207

Our ServiceNow team literally believes use of the REST API is a security risk, because you can insert "bad data" into the incident table and others. I showed them different ways you can break the web GUI to do things like creating a ticket without a customer, or setting a ticket to On Hold without an On Hold Reason, etc. I actually created a dashboard called the "Gallery of Broken Tickets" because they didn't know how to make reports and dashboards, and then I presented this to them to show them bad data that currently exists in `incident`. I'm never getting hired to that team.

But yes, RPA is big at our org right now. In ServiceNow too.

grugagag · 2024-02-17T23:29:48 1708212588

Add another layer of AI on top of RPA and the mess will be harder to or impossible to ever untangle. If it works with high accuracy it will still be a great thing but a complete blackbox and expensive one to maintain too.

ukuina · 2024-02-17T23:46:29 1708213589

Won't this replace the entire RPA layer rather than sit atop?

I would never have considered RPA, but I would build my own layer via LLM for my automation needs.

haswell · 2024-02-18T00:19:01 1708215541

I suspect they’ll co-exist, but I don’t think full replacement is likely in many cases. Existing RPA products will continue to exist if for no other reason that they’re already deeply embedded in many environments. Over time, those same products will just get easier to use.

One of the core features of RPA products is centralized visibility and management of the lifecycle of automations. This will remain even if the entire interaction layer changes.

Something like the OS-Copilot layer lowers the barrier to entry though and I could easily see there being multiple choices for orgs looking to do this kind of automation in the future, and this is where I see the opportunity for disruption.

Most likely the RPA products will add all of the necessary buzzwords to keep people in buying roles interested and more likely to keep renewing.

grugagag · 2024-02-17T23:54:16 1708214056

When does a layer get removed, ever??

ilaksh · 2024-02-18T02:05:50 1708221950

I would assume that any RPA tool without a large multimodal (language) agent framework running it is basically already obsolete.

I hope we get more open source options like Llava or CogVLM.

The paper makes it seem like there may be some way to control the mouse and read screen captures, but doesn't give any details. There is a GitHub link though so maybe it's in there.

tudorw · 2024-02-17T22:56:28 1708210588

In my wildest speculation behind the gossip of recent goings on I imagine a board comprised of individuals saturated in the technology, presented with the latest technology, a leading edge AI, tasked with improving it's own code, as the board and co. sat back and watched, the machine made improvements rapidly, rapidly enough that the engineers had to admit they could no longer follow the code changes, watched as the AI redeveloped the underlying language then kernel and started planning fabrication technology and how to fund it, mere trillions, the benchmarks were smashed and they hit stop, what do you do next?

userbinator · 2024-02-18T06:03:45 1708236225

It's far more likely that AI will lead to a steep descent into mediocrity.

bbor · 2024-02-18T05:44:42 1708235082

Just to be clear, you’re saying that Sam Altman is currently in an Ex Machina-esque relationship with his own wannabe Rehoboam, and that his recent trillion dollar manufacturing proposal was really the first acts of a nascent AGI?

Because if so, you’re right on. The question is whether Sam knows it

taneq · 2024-02-18T02:19:32 1708222772

If your system works that well, others won’t be far behind. If you stop, another AI will still take off, but you’ll have even less control over it.

bamboozled · 2024-02-17T23:01:52 1708210912

Keep it going because your company might be the next unicorn if you do ?

tudorw · 2024-02-17T23:16:14 1708211774

I recall reading a roadmap for camera sensors, this might be folklore it was a long time ago, the takeaway was that they had the technology for the 100mp sensor and the roadmap showed how they would maximise profit using a 20 year curve working up to the release of the final sensor, by which time a new method would meet that curve to continue their dominance, if I had AGI, I would let it out in very small drips.

bamboozled · 2024-02-18T00:31:17 1708216277

Too much fame, money and notoriety at stake, imagine the CEO of OpenAI having 20 years of patience ?

nirav72 · 2024-02-18T03:20:51 1708226451

He’s a relatively young guy that obviously lives well enough right now to be able wait 20 years.

bamboozled · 2024-02-18T04:17:49 1708229869

I'm not sure if greed has boundaries, if something is there for the taking, most business leaders will take it.

Geisterde · 2024-02-18T01:48:36 1708220916

Just because we have the ability to move individual atoms around, doesnt mean its affordable to do so. It probably would have been pretty expensive (and low yield, and cumbersom, and error prone) to make a 100mp camera in 2004.

tudorw · 2024-02-18T03:10:32 1708225832

Fair, it may well have been a roadmap based on some kind of theoretical limit on the technology, also, I am fairly old, so 10mp was probably more likely!

euroderf · 2024-02-18T09:00:46 1708246846

Roll camera! Take one!

marojejian · 2024-02-17T22:52:35 1708210355

Interesting. But gut reaction is that the main use for this is impersonating a human, evading captchas etc.

dannyw · 2024-02-18T01:22:56 1708219376

You can hire human captcha farms for sub pennies already. Systems like reCATCHA are behaviour / profile-based*; the actual captcha bit is almost a stagescreen.

*At certain thresholds, even solving the captcha correctly will falsely claim it’s wrong. Particularly if you’re not logged into a Google account.

troymc · 2024-02-18T00:12:39 1708215159

When I first heard the term "Windows Copilot", I assumed it was something like this, but so far it seems way less useful.

factormeta · 2024-02-18T03:19:31 1708226371

Wonder if it can tain Automator in Mac?

https://support.apple.com/guide/automator/welcome/mac

anonzzzies · 2024-02-18T10:41:05 1708252865

I asked this when UFO was trending but does this work? I cannot try more of them. I want this, but so far everything is so terrible. So who tried it for non trivial/cherry picked stuff?

causality0 · 2024-02-18T02:17:13 1708222633

I hope it's better at interacting with the OS than ChatGPT is. If I ask ChatGPT to write me a powershell/batch script to do something it works about 25% of the time.

tatersolid · 2024-02-18T15:43:34 1708271014

ChatGPT and Copilot both frequently hallucinate entire plausible-sounding APIs or properties/methods that simply don’t exist. Often this takes a lot of time to debug/resolve. They also often omit proper input validation, error handling, and logging even when prompted to include it.

I want to do a structured efficiency study of programming tasks of human vs human-plus-AI from problem statement through production-ready code. But my org doesn’t have enough devs to make it statistically significant, nor the spare human capacity to invest on duplicative tasks. I assume there must be some studies out there anyone have a reference?

dartharva · 2024-02-18T11:50:42 1708257042

Microsoft's UFO yesterday, and this today. Just beautiful.

I am eagerly looking forward to getting the chance to use these once matured and offload all my gruntwork.

Aerbil313 · 2024-02-18T00:53:55 1708217635

This is the endgame that’ll put many out of their jobs. Not that it’s bad. I personally think humanity will unfortunately need greater and greater efficiency in the coming decades.

lokimedes · 2024-02-17T20:26:18 1708201578

Seemingly beating GPT-4 at GAIA.

WiSaGaN · 2024-02-18T02:46:33 1708224393

This is not an LLM, so it doesn't make much sense to compare it with one. A more appropriate comparison would be something along the lines of GPT-4 OS-Copilot vs. GPT-4 CoT or vs. GPT-4 zero-shot.

hackerlight · 2024-02-17T23:53:02 1708213982

They use GPT-4.

nomel · 2024-02-18T07:46:20 1708242380

Yes, their use is beating raw GPT-4.