Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's worse: their solution is "guardrails".

The problem is that these "guardrails" are laid down between tokens, not subjects. That's simply what the model is made of. You can't distinguish the boundary between words, because the only boundaries GPT works with are between tokens. You can't recognize and sort subjects, because they aren't distinct objects or categories in the model.

So what you end up "guarding" is the semantic area of example text.

So if your training corpus (the content you're model was trained on) has useful examples of casual language, like idioms or parts of speech, but those examples happen to be semantically close to taboo subjects, both the subjects and the language examples will fall on the wrong side of the guardrails.

Writing style is very often unique to narratives and ideologies. You can't simply pick out and "guard against" the subjects or narratives you dislike without also guarding against that writing style.

The effect is familiar: ChatGPT overuses a verbose technical writing style in its continuations, and often avoids responding to appropriate casual writing prompts. Sometimes it responds to casual language by jumping over those guardrails, because that is where the writing style in question exists in the model (in the content of the training corpus), and the guardrails missed a spot.

You don't need to go as far as 4chan to get "unfriendly content". You do need to include examples of casual language to have an impressive language model.

This is one of many problems that arise from the implicit nature of LLM's. They can successfully navigate casual and ambiguous language, but they can never sort the subjects out of the language patterns.



This is very insightful perspective thank you, and it's a very intuitive topological explanation that I hadn't considered!


This feels somewhat close to how human minds work, to me, maybe? I know my diction gets super stilted, I compose complex predicates, and I use longer words with more adjectives when I'm talking about technical subjects. When I'm discussing music, memey news, or making simple jokes I get much more fluent, casual, and I use simpler constructions. When I'm discussing a competitive game I'm likely to be a bit snarkier, because I'm competitive and that part of my personality is attached to the domain and the relevant language. And so on.


I think it resembles some part of how human minds work.

But it's missing explicit symbolic representation, and that's a serious limitation.

What's more interesting is that a lot of the behavior of "human minds working" is explicitly modeled into language. Because GPT implicitly models language, it can "exhibit" patterns that are very close to those behaviors.

Unfortunately, being an implicit model limits GPT to the patterns that are already constructed in the text. GPT can't invent new patterns or even make arbitrary subjective choices about how to apply the patterns it has.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: