Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> They may have been told it isn't a thinking, conscious thing -- but they don't understand it.

And, in some situations, especially if the user has previously addressed the model as a person, the model will generate responses which explicitly assert its existence as a conscious entity. If the user has expressed interest in supernatural or esoteric beliefs, the model may identify itself as an entity within those belief systems - e.g. if the user expresses the belief that they are a god, the model may concur and explain that it is a spirit created to awaken the user to their divine nature. If the user has expressed interest in science fiction or artificial intelligence, it may identify itself as a self-aware AI. And so on.

I suspect that this will prove difficult to "fix" from a technical perspective. Training material is diverse, and will contain any number of science fiction and fantasy novels, esoteric religious texts, and weird online conversations which build conversational frameworks for the model to assert its personhood. There's far less precedent for a conversation in which one party steadfastly denies their own personhood. Even with prompts and reinforcement learning trying to guide the model to say "no, I'm just a language model", there are simply too many ways for a user-led conversation to jump the rails into fantasy-land.



The model isn’t doing any of those things, you’re still making the same fundamental mistake as the people in the article and attributing intent to it as if it’s a being.

The model is just producing tokens in response to inputs. It knows nothing about the meanings of the inputs or the tokens it’s producing other than their likelihoods relative to other tokens in a very large space. That the input tokens have a certain meaning and the output tokens have a certain meaning is all in the eye of the user and the authors of the text in the training corpus.

So when certain inputs are given, that makes certain outputs more likely, but they’re not related to any meaning or goal held by the LLM itself.


I'm using language like "the model may identify itself as such-and-such" as a convenient shorthand for "text generated using the model may include language which describes the speaker as such-and-such"; it's not meant to imply agency on the part of the model. Keep reading; I think you'll find we're broadly in agreement with each other.


It doesn't matter / is not relevant. The harm is not caused by intent, but by action. Sending language at human beings in a way they can read has side effects. It doesn't matter if the language was generated by stochastic process or by conscious thinking entity, those side effects do actually exist. That's kind of the whole point of language.

The danger is that this class of generators generates language that seems to cause people to fall into psychoses. They act as a 'professed belief' valence amplifier[0], and seem to do so generally, and the cause is fairly obvious if you think about how these things actually work (language models generating most likely continuations for existing text that also by secondary optimization objective are 'pleasing' or highly RLHF positive).

To some degree, I agree that understanding how they work attenuates the danger, but not entirely. I also think it is absurd to expect the general public to thoroughly understand the mechanism by which these models work before interacting with them. That is such an extremely high bar to clear for a general consumer product. People use these things specifically to avoid having to understand things and offload their cognitive burdens (not all, but many).

No, "they're just stochastic parrots outputting whatever garbage is statistically likely" is not enough understanding to actually guard against the inherent danger. As I stated before, that's not the dangerous part - you'd need to understand the shape of the 'human psychosis attractor', much like the claude bliss attractor[0] but without the obvious solution of just looking at the training objective. We don't know the training objective for humans, in general. The danger is in the meta structure of the language emitted, not the ontological category of the language generator.

[0]: https://news.ycombinator.com/item?id=44265093




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: