Fundamentally, we are at a point in time where models are already very capable, ...

vlovich123 · 2025-07-21T15:04:40 1753110280

How are you defining reliability here?

dbuxton · 2025-07-21T16:00:19 1753113619

Great question. For me reliability is variance in performance and capability is average performance.

In practice high variance translates on the downside into failure to do basic things that a minimally competent human would basically never get wrong. In agents it's exacerbated by the compounding impact of repeated calls but even for basic workflows it can be annoying.

vlovich123 · 2025-07-21T19:28:04 1753126084

I don’t think variance is relevant to this application which is essentially a search function. As long as they find the answer 1/100, it doesn’t matter if it took them 100 tries - that’s just a cost optimization problem here.

That being said, I think variance implicitly improves in this context because this is the same as poll averaging that Nate Silver does - as long as the models are truly independent this averaging technique works as an improved result across the board (ie average and variance). However, if the models start converging with datasets and techniques this will degrade to become worse just as with polling with pollster herding and other problems the industry creates for themselves.