We do content generation and need a high level of consistency and quality
Our prompts get very complex and we have around 3,000 lines of code that does nothing but build prompts based on user's options (using dozens of helpers in other files)
They aren't going to get more interchangeable because of the subjective nature of them
Give five humans the same task and even if that task is just a little bit complex you'll get wildly different results. And the more complex it gets the more different the results will become
It's the same with LLMs. Most of our prompt changes are more significant but in one recent case it was a simple as changing the word "should" to "must" to get similar behavior between two different models.
One of them basically ignored things we said it "should" do and never performed the thing we wanted it to whereas the other did it often, despite these being minor version differences of the same model
Our prompts get very complex and we have around 3,000 lines of code that does nothing but build prompts based on user's options (using dozens of helpers in other files)
They aren't going to get more interchangeable because of the subjective nature of them
Give five humans the same task and even if that task is just a little bit complex you'll get wildly different results. And the more complex it gets the more different the results will become
It's the same with LLMs. Most of our prompt changes are more significant but in one recent case it was a simple as changing the word "should" to "must" to get similar behavior between two different models.
One of them basically ignored things we said it "should" do and never performed the thing we wanted it to whereas the other did it often, despite these being minor version differences of the same model