You might consider the Accessibility Tree and its semantics. Plain divs are basically filtered out so you're left with interactive objects and some structural/layout cues.
I've been trying (albeit not very hard) to build an accessibility library and toolset that can be exposed via mcp server. I think it has the potential to be much more ergonomic for generalized computer-use agents than stuff like playwright or the classic screenshot approach. Low latency computer use is another thing that I'd like to solve.
The issue is mac and windows accessibility APIs are opaque and I have no idea what I'm doing so I'm forced to vibe code it all which is not turning out too well... :-)
I suffer from mild carpal tunnel so I want to build a really low latency computer use agent that can do anything on my computer without me having to learn the talon voice syntax or some other traditional accessibility software like mac dictation.
My guess is that this is for impatient people; people who think that the prescribed use cases are somehow necessary for their "workflows"; people who subscribe to terms like "cognitive friction" within the context of these use cases; people who are...sort of lazy.
That's a really good question. Maybe it's because laziness is associated with a lack of intellect? And certain technologies, like AI and other software, are meant to augment our intellect.
These fancy words carry an intellectual/productive effect. When they're put to use it probably makes people feel like they're getting things done. And they never feel lazy because of this.
I mean, what else do you use to run things in the browser?
Pouchdb. Hypercore (pear). It’s nice to be able to spin up JS versions of things and have them “just work” in the most widely deployed platform in the world.
TensorflowJS was awesome for years, with things like blazeface, readyplayer me avatars and hallway tile and other models working in realtime at the edge. Before chatgpt was even conceived. What’s your solution, transpile Go into wasm?
Agents can work in people’s browsers as well as node.js around the world. Being inside a browser gives a great sandbox, and it’s private on the person’s own machine too.
> what else do you use to run things in the browser?
I do my best to run as little in the browser as possible. Everything is an order of magnitude simpler and faster to build if you do the bulk of things on a server in a language of your choice and render to the browser as necessary.
For those of us forced to be in the JS ecosystem, finally having a runtime that Just Works has been great.
Bun has replaced a massive number of tools and dependencies from our stack and really counteracted the tooling explosion that we were forced into with node.
In our case, it's not so much being forced to use Bun, but rather that Bun is in real terms infinitely more convenient than lower-level languages. Firstly, even the most novice of novices tend to have a passing familiarity with JS/TS, whereas this is not true for C/Zig/Rust/etc, so it's easier for people to contribute to our projects. Bun also provides so many things for free, statically, and cross platform. You want a TCP server? A websocket server? SQLite database? You want to include static assets? You want to generate static assets at compile time? Etc? Bun provides it.
Attempting to replicate even a modicum of this in lower-level languages can be a real struggle. Rust is definitively the least-worst in this respect because there's been a concerted effort by the community to provide stable packages that do most things. But Rust is a complicated and unapproachable language. Using other low-level languages like C/Zig, and you immediately run into issues of libraries and static linking. And even if you find a library, its documentation is either lacklustre or outright missing (looking at you libuv and libxev respectively).
The amount of manual setup and third-party builds-system finagling just to: 1) run a TCP server; 2) fetch data over HTTP; 3) do both of these using a single event loop (no separate threads); 4) use SQLite for storage; and 5) have all this produce a single self-contained executable. Yet I cannot understate how trivial this is with Bun.
> you still have to figure out how to concretely receive a response back
Isn't that handled by whatever Tool API you're using? There's usually a `function_call_output` or `tool_result` message type. I haven't had a need for a separate protocol just to send responses.
I'm unaware of the GitHub MCP "exploit", but given the overall state of LLM/MCP security FUD, there's probably some self promotion blog post from a security company about an LLM doing something stupid with GitHub data that the owner of the LLM using system didn't intend.
For example, let's say I create an application that lets you chat with my open source repo. I set up my LLM with a GitHub tool. I don't want to think about oauth and getting a token from the end user, so I give it a PAT that I generated from my account. I'm even more lazy so I just used a PAT I already had laying around, and it unfortunately had read/write access to SSH keys. The user can add their ssh key to my account and do malicious things.
Oh no, MCP is super vulnerable, please buy my LLM security product.
If you give the LLM a tool, and you give the LLM input from a user, the user has access to that tool. That shrimple.
Also currently on the front page. It's mainly that this tool hits the trifecta of having privileged access, untrusted inputs, and ability to exfiltrate. Most tools only do 1-2 of those so attacks need to be more sophisticated to coordinate that.
I haven't looked at MCP payloads properly to compare but often the raw OpenAPI spec is overly verbose and eats context space pretty quick.
Really trivial to have the LLM first filter it down to the sections it cares about and then condense those sections though.
Wrap that process in a small tool and give that to the LLM along with a `fetch` tool that handles credentials based on URLs and agent capabilities explode pretty rapidly.
No, algebraic effects are a generalization that support more cases than LISP's condition system since continuations are multi-shot. The closest thing is `call/cc` from Scheme.
Sometimes making these parallelism hurts more than not having them in the first place
What a thought-terminating way to approach an idea. Effects are not simply renamed conditions, and we have a whole article here describing them in more detail than that one sentence, so you can see some of the differences for yourself.
Does having access to Chromium internals give you any super powers over connecting over the Chrome Devtools Protocol?
reply