I think you're probably right. As an example of this challenge, I've noticed tha...

I think you're probably right. As an example of this challenge, I've noticed that engineers who don't have a background in ML often lack the "mental models" to understand how to think about testing ML models (i.e. statistical testing as opposed to the kind of pass/fail test cases that are used to test code).

The way I look at this is that plexe can be useful even if it doesn't solve this fundamental problem. When a team doesn't have ML expertise, their choices are A) don't use ML B) acquire ML expertise C) use ChatGPT as your predictor. Option C suffers of the same problem you mentioned, in addition to latency/scalability/cost and the model not being trained on your data etc. So something like Plexe could be an improvement on option C by at least addressing the latter pain points.

Plus: we can keep throwing more compute at the agentic model building process, doing more analysis, more planning, more evaluation, more testing, etc. It still won't solve the problem you bring up, but hopefully it gets us closer to the point of "good enough to not matter" :)

Would love to hear your thoughts on this.