LLMs have access to the same tools --- they run on a computer.
The problem here is the basic implementation of LLMs. It is non-deterministic (i.e. probabilistic) which makes it inherently inadequate and unreliable for *a lot* of what people have come to expect from a computer.
You can try to obscure the problem but you can't totally eliminate it without redesigning LLMs. At this time, the only real cure is to verify everything ---which nullifies a lot of the incentive to use LLMs in the first place.
> LLMs have access to the same tools --- they run on a computer.
That doesn't give them access to anything. Tool access is provided either by the harness that runs the model or by downstream software, if it is provided at all, either to specific tools or to common standard interfaces like MCP that allow the user to provide tool definitions for tools external to the harness. Otherwise LLMs have no tools at all.
> The problem here is the basic implementation of LLMs. It is non-deterministic (i.e. probabilistic) which makes it inherently inadequate and unreliable for a lot of what people have come to expect from a computer.
LLMs, run with the usual software, are deterministic [ignoring hardware errors ans cosmic ray bit flips, which if considered make all software non-deterministic] (having only pseudorandomness if non-zero temperature is used) but hard to predict, though because implementations can allow interference from separate queries processed in a batch, and the end user doesn't know what other typical hosted models are non-deterministic when considered from the perspective of the known input being only what is sent by one user.
But your problem is probably actually that the result of untested combinations of configuration and input are not analytically predictable because of complexity, not that they are non-deterministic.
LLMs absolutely do not have access to the same tools unless they're explicitly given access to them. Running on a computer means nothing.
It sounds like you don't like LLMs! In that case, you may be more interested in our REST Api. All the same functions, but designed for edge computing, where dependency bloat is a real issue https://tinyfn.io/edge
They're building a moat with data. They're building their own datasets of trusted sources, using their own teams of physicians and researchers. They've got hundreds of thousands of physicians asking millions of questions everyday. None of the labs have this sort of data coming in or this sort of focus on such a valuable niche
> They're building their own datasets of trusted sources, using their own teams of physicians and researchers.
Oh so they are not just helping in search but also in curating data.
> They've got hundreds of thousands of physicians asking millions of questions everyday. None of the labs have this sort of data coming in or this sort of focus on such a valuable niche
I don't take this too seriously because lots of physicians use ChatGPT already.
saying they aren't pioneering is very different than saying they aren't a major player in the space. There're only like 5-7 players with a foundational model that they can serve at scale. xAI is one of them
This is an interesting read, and while I support being nice to every_thing_ in principle. Most of the research into this actually shows that being mean yeilds better results
I've read the blurb from previous years about doing one-shots with threats of death, or etc - but I've never seen that for long many prompt sessions.
I wonder - if you hired a programmer for a day, trapped them in a cage, and then threatened them, maybe it would be more productive for a while. I mean, if I were writing that book, I could see how they would do great work for a bit.
This looks pretty cool. I keep seeing people (an am myself) using claude code for more an more _non-dev_ work. Managing different aspects of life, work, etc. Anthropic has built the best harness right now. Building out the UI makes sense to get genpop adoption
Yeah, the harness quality matters a lot. We're seeing the same pattern at Gobii - started building browser-native agents and quickly realized most of the interesting workflows aren't "code this feature" but "navigate this nightmare enterprise SaaS and do the thing I actually need done." The gap between what devs use Claude Code for vs. what everyone else needs is mostly just the interface.
I read it. i agree this is out of touch. Not because the things its saying are wrong, but because the things its saying have been true for almost a year now. They are not "getting worse" they "have been bad". I am staggered to find this article qualifies as "news".
If you're going to write about something that's been true and discussed widely online for a year+, at least have the awareness/integrity to not brand it as "this new thing is happening".
The models are gotten very good, but I rather have an obviously broken pile of crap that I can spot immediately, than something that is deep fried with RL to always succeed, but has subtle problems that someone will lgtm :( I guess its not much different with human written code, but the models seem to have weirdly inhuman failures - like, you would just skim some code, cause you just cant believe that anyone can do it wrong, and it turns out to be.
Well, for some reason it doesnt let me respond to the child comments :(
The problem (which should be obvious) is that with a/b real you cant construct an exhaustive input/output set. The test case can just prove the presence of a bug, but not its absence.
Another category of problems that you cant just test and have to prove is concurrency problems.
Of course you can. You can write test cases for anything.
Even an add_numbers function can have bugs, e.g. you have to ensure the inputs are numbers. Most coding agents would catch this in loosely-typed languages.
reply