It's going to be really interesting to see how work like this impacts the Robotic Process Automation or "RPA" product space.
Many of the enterprise products in this category have used computer vision to help reduce brittleness, but for all of the improvements, these tools have remained highly error prone, not to mention extremely expensive.
The explosion of interest in the category across a broader community of researchers and developers seems like both a boon for the RPA space, and a major threat of disruption.
As an aside: I'm stunned how willing my own org is to train an RPA process that can break so easily, vs scripting the same actions in PowerShell and performing it against a REST interface in ServiceNow (our ticketing system). They scream and cry about proper authentication methods to ServiceNow (OAuth not username + password), and then they're happy to let an RPA process clunk around doing something recorded. They shot down the script that works, and are investing hundreds of hours in an RPA expert to do it with many less scripted validation steps. Just ugh.
Wait…are they using RPA to automate interactions with ServiceNow? i.e. they’re controlling the ServiceNow UI? If so, that’s pretty terrible.
This kind of project baffles my mind. RPA should be used for situations when there is no API available, and from what I understand of the ServiceNow product, it has all kinds of APIs for automation use cases.
Our ServiceNow team literally believes use of the REST API is a security risk, because you can insert "bad data" into the incident table and others. I showed them different ways you can break the web GUI to do things like creating a ticket without a customer, or setting a ticket to On Hold without an On Hold Reason, etc. I actually created a dashboard called the "Gallery of Broken Tickets" because they didn't know how to make reports and dashboards, and then I presented this to them to show them bad data that currently exists in `incident`. I'm never getting hired to that team.
But yes, RPA is big at our org right now. In ServiceNow too.
Add another layer of AI on top of RPA and the mess will be harder to or impossible to ever untangle. If it works with high accuracy it will still be a great thing but a complete blackbox and expensive one to maintain too.
I suspect they’ll co-exist, but I don’t think full replacement is likely in many cases. Existing RPA products will continue to exist if for no other reason that they’re already deeply embedded in many environments. Over time, those same products will just get easier to use.
One of the core features of RPA products is centralized visibility and management of the lifecycle of automations. This will remain even if the entire interaction layer changes.
Something like the OS-Copilot layer lowers the barrier to entry though and I could easily see there being multiple choices for orgs looking to do this kind of automation in the future, and this is where I see the opportunity for disruption.
Most likely the RPA products will add all of the necessary buzzwords to keep people in buying roles interested and more likely to keep renewing.
I would assume that any RPA tool without a large multimodal (language) agent framework running it is basically already obsolete.
I hope we get more open source options like Llava or CogVLM.
The paper makes it seem like there may be some way to control the mouse and read screen captures, but doesn't give any details. There is a GitHub link though so maybe it's in there.
Many of the enterprise products in this category have used computer vision to help reduce brittleness, but for all of the improvements, these tools have remained highly error prone, not to mention extremely expensive.
The explosion of interest in the category across a broader community of researchers and developers seems like both a boon for the RPA space, and a major threat of disruption.