Unless I misread this paper, their argument is entirely hypothetical. Meaning that the LLM is still a black box and they can only hypothesise what is going internally by viewing the output(s) and guessing at what it would take to get there.
There's nothing wrong with a hypothesis or that process, but it means we still don't know whether models are doing this or not.
Maybe I mixed up that paper with another but the one I meant to post shows that you can read something like a world model from the activations of the layers.
There was a paper that shows a model trained on Othello moves creates a model of the board, models the skill level of their opponent and more.
There's nothing wrong with a hypothesis or that process, but it means we still don't know whether models are doing this or not.