Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Fundamentally, we are at a point in time where models are already very capable, but not very reliable.

This is very interesting finding about how to improve capability.

I don't see reliability expressly addressed here, but my assumption is that these alloys will be less rather than more reliable - stronger, but more brittle, to extend the alloy metaphor.

Unfortunately for many if not most B2B use cases this reliability is the primary constraint! Would love to see similar ideas in the reliability space.



How are you defining reliability here?


Great question. For me reliability is variance in performance and capability is average performance.

In practice high variance translates on the downside into failure to do basic things that a minimally competent human would basically never get wrong. In agents it's exacerbated by the compounding impact of repeated calls but even for basic workflows it can be annoying.


I don’t think variance is relevant to this application which is essentially a search function. As long as they find the answer 1/100, it doesn’t matter if it took them 100 tries - that’s just a cost optimization problem here.

That being said, I think variance implicitly improves in this context because this is the same as poll averaging that Nate Silver does - as long as the models are truly independent this averaging technique works as an improved result across the board (ie average and variance). However, if the models start converging with datasets and techniques this will degrade to become worse just as with polling with pollster herding and other problems the industry creates for themselves.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: