I think there are indeed many challenges when evaluating Compound AI Systems (http://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems...)
But evals in complex systems are the best we have at the moment. It’s a “best-practice” just like all the forms of testing in the “test pyramid” (https://martinfowler.com/articles/practical-test-pyramid.htm...)
Nothing is a silver bullet. Just hard won, ideally automated, integrated quality and verification checks, built deep into the system and SDLC.
I think there are indeed many challenges when evaluating Compound AI Systems (http://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems...)
But evals in complex systems are the best we have at the moment. It’s a “best-practice” just like all the forms of testing in the “test pyramid” (https://martinfowler.com/articles/practical-test-pyramid.htm...)
Nothing is a silver bullet. Just hard won, ideally automated, integrated quality and verification checks, built deep into the system and SDLC.