Are you suggesting that there's a correlation between what input formats provide best performance for an LLM input, and what sequence of tokens the same LLM outputs when prompted about what input formats provide best performance? Why would that be?
Why wouldn't that be? We've had several generations of LLMs since ChatGPT took the world by storm; current models are very much aware of LLMs that came before them, as well as associated discussions on how to best use them.