Sorry didnt see this earlier. If you're interested reach out to me (Kosta Welke) on linkedin. Or write me an email, you can find me on Octominds About page.
Yes and no. Getting a VLM to work on the web would definitely be great, but it comes with its own problems, mainly around developing and acting on bounding boxes. We have vision as a default fallback for Stagehand, but we've found that the screenshot sent to the VLM often has to have pre-labeled elements on it. More notably, the screenshot with everything prelabeled leads to a cluttered and unusable image to process. Not pre-labeling runs the risk of missing important elements. I imagine a happy medium where the DOM+a11y tree can be used for candidate generation to a VLM.
Solely depending on a VLM is indeed reminiscent of how humans interact with the web, but when a model thrives with more data, why restrict the data sent to the model?
Thanks so much! Yes, a lot of antibots are able to detect Playwright based on browser config. Generally, antibots are a good thing -- I think in the future, as web agents become more popular, I'd imagine a fruitful partnership to prevent misuse if it's coming from a trusted web agent v. an unknown one
Its not at a stage I'd be comfortable to put it on GitHub yet, maybe in a few months.
And I think you misunderstood my comment, I didn't describe my project, but extrapolated from the parents desire and my motivations for my project.
Mine is actually pretty close to stagehand, at least I could very well use it. It's basically a web UI to configure browser tasks like open webpage x, iterate over "item type", with LLM integration to determine what the CSS selector for that would be.
On next execution it would attempt to use the previously determined CSS selector instead of the LLM integration. On failures, it'd raise a notification with an admin tasks to verify new selectors/fix the script
But it's a lot of code to put together as a generic UI - as I want these tasks to be repeatable without restarting from the beginning etc
Still very much in the PoC stage without any tests, barely working persistence etc
Yes^ this is what we suggest. Stagehand is meant to execute isolated tasks on browsers; we support using custom contexts (cookies) with the following command:
Yes! These are both phenomenal projects, and kudos to their authors as well. Stagehand is different in that it makes fine-grained control a first-class citizen. Often times, you want to control the exact steps a web agent takes. Our experience using other tools was that the only control you have over these steps in other tools is in the natural language prompt.
However with Stagehand, because it's an extension of Playwright, it allows you to confirm each step of the underlying agent's workflow, making it the most customizable option for engineers who want/need that
Thanks so much! Crawlspace is pretty sick too, as is Integuru. A lot of people have different takes here on the level of automation to leave up to the user. As a developer building for developers, I wanted to meet in the middle and build off an existing incumbent that most people are likely familiar with already