That's a valid concern, I think just like with any other software you need to write tests for the AI model to constantly check if your prompts are working as intended. Basic unit tests would work well in this case.
The differences here compared to unit tests are that the breaking is outside of your control and the updating process is tedious. Also, testing requires making real API calls rather than using stubs so it requires additional infrastructure.
That's all true, but sometimes things break because some package is bumped. I'd still like to know if my app is basically broken if an LLM has changed somehow.
You are right that fine tuning would probably help to minimize the risks, but it probably never can be zero. New tests will also be needed when customers find new edge cases that break our assumptions.
Testing LLM prompts is a new paradigm that we'll have to learn to deal with.