More

tacoooooooo · 2026-02-04T13:56:42 1770213402

Humans are also bad at math! So we created calculators. LLMs should have the same access to tools to give them the best opportunity to succeed

jqpabc123 · 2026-02-04T14:36:50 1770215810

LLMs have access to the same tools --- they run on a computer.

The problem here is the basic implementation of LLMs. It is non-deterministic (i.e. probabilistic) which makes it inherently inadequate and unreliable for *a lot* of what people have come to expect from a computer.

You can try to obscure the problem but you can't totally eliminate it without redesigning LLMs. At this time, the only real cure is to verify everything ---which nullifies a lot of the incentive to use LLMs in the first place.

dragonwriter · 2026-02-04T15:31:40 1770219100

> LLMs have access to the same tools --- they run on a computer.

That doesn't give them access to anything. Tool access is provided either by the harness that runs the model or by downstream software, if it is provided at all, either to specific tools or to common standard interfaces like MCP that allow the user to provide tool definitions for tools external to the harness. Otherwise LLMs have no tools at all.

> The problem here is the basic implementation of LLMs. It is non-deterministic (i.e. probabilistic) which makes it inherently inadequate and unreliable for a lot of what people have come to expect from a computer.

LLMs, run with the usual software, are deterministic [ignoring hardware errors ans cosmic ray bit flips, which if considered make all software non-deterministic] (having only pseudorandomness if non-zero temperature is used) but hard to predict, though because implementations can allow interference from separate queries processed in a batch, and the end user doesn't know what other typical hosted models are non-deterministic when considered from the perspective of the known input being only what is sent by one user.

But your problem is probably actually that the result of untested combinations of configuration and input are not analytically predictable because of complexity, not that they are non-deterministic.

tacoooooooo · 2026-02-04T15:19:34 1770218374

LLMs absolutely do not have access to the same tools unless they're explicitly given access to them. Running on a computer means nothing.

It sounds like you don't like LLMs! In that case, you may be more interested in our REST Api. All the same functions, but designed for edge computing, where dependency bloat is a real issue https://tinyfn.io/edge

jqpabc123 · 2026-02-04T16:44:17 1770223457

It sounds like you don't like LLMs!

I tend to prefer predictability/consistency and reliability. I fail to understand why anyone would prefer otherwise.

tacoooooooo · 2026-02-04T19:41:11 1770234071

same--this is exactly why I built this

tacoooooooo · 2026-02-03T22:19:47 1770157187

They're building a moat with data. They're building their own datasets of trusted sources, using their own teams of physicians and researchers. They've got hundreds of thousands of physicians asking millions of questions everyday. None of the labs have this sort of data coming in or this sort of focus on such a valuable niche

simianwords · 2026-02-03T22:26:01 1770157561

> They're building their own datasets of trusted sources, using their own teams of physicians and researchers.

Oh so they are not just helping in search but also in curating data.

> They've got hundreds of thousands of physicians asking millions of questions everyday. None of the labs have this sort of data coming in or this sort of focus on such a valuable niche

I don't take this too seriously because lots of physicians use ChatGPT already.

some_random · 2026-02-03T23:05:48 1770159948

Lots of physicians use ChatGPT but so do lots of non-physicians and I suspect there's some value in knowing which are which

tacoooooooo · 2026-02-02T22:57:41 1770073061

saying they aren't pioneering is very different than saying they aren't a major player in the space. There're only like 5-7 players with a foundational model that they can serve at scale. xAI is one of them

tacoooooooo · 2026-01-14T19:19:46 1768418386

https://alex-jacobs.com

tacoooooooo · 2026-01-12T19:44:15 1768247055

This is an interesting read, and while I support being nice to every_thing_ in principle. Most of the research into this actually shows that being mean yeilds better results

bubblegumcrisis · 2026-01-12T20:27:32 1768249652

I've read the blurb from previous years about doing one-shots with threats of death, or etc - but I've never seen that for long many prompt sessions.

I wonder - if you hired a programmer for a day, trapped them in a cage, and then threatened them, maybe it would be more productive for a while. I mean, if I were writing that book, I could see how they would do great work for a bit.

tacoooooooo · 2026-01-12T19:37:57 1768246677

This looks pretty cool. I keep seeing people (an am myself) using claude code for more an more _non-dev_ work. Managing different aspects of life, work, etc. Anthropic has built the best harness right now. Building out the UI makes sense to get genpop adoption

ai-christianson · 2026-01-12T20:16:02 1768248962

Yeah, the harness quality matters a lot. We're seeing the same pattern at Gobii - started building browser-native agents and quickly realized most of the interesting workflows aren't "code this feature" but "navigate this nightmare enterprise SaaS and do the thing I actually need done." The gap between what devs use Claude Code for vs. what everyone else needs is mostly just the interface.

tacoooooooo · 2026-01-10T16:55:26 1768064126

fizsh sounds really cool, but the last commit was 7+ years ago. do you run into any issues? https://github.com/zsh-users/fizsh

tmsbrg · 2026-01-10T20:14:41 1768076081

I never noticed any issues, actually. I guess the zsh base is solid and stable.

tacoooooooo · 2026-01-09T14:44:09 1767969849

not sure there are any models yet that you can get the quality out you need to do this and run on your mbp

tacoooooooo · 2026-01-08T15:37:07 1767886627

This is a wildly out of touch thing to say

fourside · 2026-01-08T15:38:31 1767886711

Did you read the article?

dhorthy · 2026-01-08T15:51:02 1767887462

I read it. i agree this is out of touch. Not because the things its saying are wrong, but because the things its saying have been true for almost a year now. They are not "getting worse" they "have been bad". I am staggered to find this article qualifies as "news".

If you're going to write about something that's been true and discussed widely online for a year+, at least have the awareness/integrity to not brand it as "this new thing is happening".

flumpcakes · 2026-01-08T15:59:31 1767887971

Perhaps the advertising money from the big AI money sinks is running out and we are finally seeing more AI scepticism articles.

minimaxir · 2026-01-08T16:00:52 1767888052

> They are not "getting worse" they "have been bad".

The agents available in January 2025 were much much worse than the agents available in November 2025.

Snuggly73 · 2026-01-08T16:16:17 1767888977

Yes, and for some cases no.

The models are gotten very good, but I rather have an obviously broken pile of crap that I can spot immediately, than something that is deep fried with RL to always succeed, but has subtle problems that someone will lgtm :( I guess its not much different with human written code, but the models seem to have weirdly inhuman failures - like, you would just skim some code, cause you just cant believe that anyone can do it wrong, and it turns out to be.

minimaxir · 2026-01-08T16:18:33 1767889113

That's what test cases are for, which is good for both humans and nonhumans.

Snuggly73 · 2026-01-08T16:26:16 1767889576

Test cases are great, but not a total solution. Can you write a test case for the add_numbers(a, b) function?

Snuggly73 · 2026-01-08T16:42:25 1767890545

Well, for some reason it doesnt let me respond to the child comments :(

The problem (which should be obvious) is that with a/b real you cant construct an exhaustive input/output set. The test case can just prove the presence of a bug, but not its absence.

Another category of problems that you cant just test and have to prove is concurrency problems.

And so forth and so on.

minimaxir · 2026-01-08T16:34:34 1767890074

Of course you can. You can write test cases for anything.

Even an add_numbers function can have bugs, e.g. you have to ensure the inputs are numbers. Most coding agents would catch this in loosely-typed languages.

Snuggly73 · 2026-01-08T16:32:46 1767889966

I mean "have been bad" doesnt exclude "getting worse" right :)

tacoooooooo · 2025-12-30T17:45:05 1767116705

correct afaik :(

https://github.com/timescale/pgvectorscale/issues/113