The most useful feature of llms is how much output you get from such little signal. Just yesterday I created a fairly advanced script from my phone on the bus ride home with chatgpt which was an absolute pleasure. I think multi-prompt conversations don't get nearly as much attention as they should in llm evaluations.
I suppose multi-prompt conversations are just a variation on few-shot prompting. I do agree though, that they don't play a big enough role in eval, but also in the heads of many people. So many capable engineers I now nope out of GPT because the first answer isn't satisfactory, instead of continuing the dialog.