Hacker News new | past | comments | ask | show | jobs | submit login

> "Prompt injection" will always happen, because you can't fundamentally separate trusted from untrusted input for LLMs

Current state-of-the-art LLMs do not separate trusted from untrusted input, but there's no fundamental reason it has to be that way. A LLM could have separate streams for instructions, untrusted input and its own output, and be trained using RLHF to follow instructions in the "instructions" stream while treating the input and ouput streams as pure data. Or they could continue to jumble everything up in a single stream but have completely disjoint token sets for input and instructions. Or encode the input as a sequence of opaque identifiers that are different every time.

A currently often-used approch is to put special delimiter tokens between trusted and untrusted content, which doesn't seem to work that well, probably because the attention mechanism can cross the delimiter without any consequences, but not all means of separation necessarily have to share that flaw.




> Current state-of-the-art LLMs do not separate trusted from untrusted input, but there's no fundamental reason it has to be that way.

No it's pretty fundamental, or at least solving it is really hard. In particular solving "prompt injection" is exactly equivalent to solving the problem of AI alignment. If you could solve prompt injection, you've also exactly solved the problem of making sure the AI only does what you (the designer) want, since prompt injection is fundamentally about the outside world (not necessarily just a malicious attacker) making the AI do something you didn't want it to do.

Your suggestion to use RLHF is effectively what OpenAI already does with its "system prompt" and "user prompt," but RLHF is a crude cudgel which we've already seen users get around in all sorts of ways.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: