The best available evidence suggests this is also true of any explanations a human gives for their own behaviour; nevertheless we generally accept those at face value.
The explanations I give of my behaviour are post-hoc (unless I was paying attention), but I also assess their plausibility by going "if this were the case, how would I behave?" and seeing how well that prediction lines up with my actual behaviour. Over time, I get good at providing explanations that I have no reason to believe are false – which also tend to be explanations that allow other people to predict my behaviour (in ways I didn't anticipate).
GPT-based predictive text systems are incapable of introspection of any kind: they cannot execute the algorithm I execute when I'm giving explanations for my behaviour, nor can they execute any algorithm that might actually result in the explanations becoming or approaching truthfulness.
The GPT model is describing a fictional character named ChatGPT, and telling you why ChatGPT thinks a certain thing. ChatGPT-the-character is not the GPT model. The GPT model has no conception of itself, and cannot ever possibly develop a conception of itself (except through philosophical inquiry, which the system is incapable of for different reasons).
Of course! If you’ve played Codenames and introspected on how you play you can see this in action. You pick a few words that feel similar and then try to justify them. Post-hoc rationalization in action.
Yes and you may search for other words that fit the rationalization to decide whether or not it's a good one. You can go even further if your teammates are people you know fairly well by bringing in your own knowledge of these people and how they might interpret the clues. There's a lot of strategy in Codenames and knowledge of vocabulary and related words is only part of it.
If an LLM states an answer and then provides a justification for that answer, the justification is entirely irrelevant to the reasoning the bot used. It might be that the semantics of the justification happen to align with the implied logic of the internal vector space, but it is best case a manufactured coincidence. It’s not different from you stating an answer and then telling the bot to justify it.
If an LLM is told to do reasoning and then state the answer, it follows that the answer is basically guaranteed to be derived from the previously generated reasoning.
The answer will likely match what the reasoning steps bring it to, but that doesn’t mean the computations by the LLM to get that answer are necessarily approximated by the outputted reasoning steps. E.g. you might have an LLM that is trained on many examples of Shakespearean text. If you ask it who the author of a given text is, it might give some more detailed rationale for why it is Shakepeare, when the real answer is “I have a large prior for Shakespeare”.
Yes, the reason is that the model assigns words positions in an ever-changing vector space and evaluates relation by their correspondence in that space—the reply it gives is also a certain index of that space, with the “why” in the question giving it the weight of producing an “answer.”
Which is to say that “why” it gives those answers is because its statistically likely within its training data that when there are the words, “why did you connect line and log with paper” the text which follows could be “logs are made of wood and lines are in paper.” But that is not the specific relation of the 3 words in the model itself, which is just a complex vector space.
I definitely think it's doing more than that here (at least inside of the vector-space computations). The model probably directly contains the paper-wood-log association.
No, it doesn't. It reaches the conclusion because of vector similarity (simplified explanation): these explanations are post-hoc.