Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Unless I misread this paper, their argument is entirely hypothetical. Meaning that the LLM is still a black box and they can only hypothesise what is going internally by viewing the output(s) and guessing at what it would take to get there.

There's nothing wrong with a hypothesis or that process, but it means we still don't know whether models are doing this or not.



Maybe I mixed up that paper with another but the one I meant to post shows that you can read something like a world model from the activations of the layers.

There was a paper that shows a model trained on Othello moves creates a model of the board, models the skill level of their opponent and more.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: