> "Prompt injection" will always happen, because you can't *fundamentally* separ...

dwohnitmok · on May 2, 2023

> Current state-of-the-art LLMs do not separate trusted from untrusted input, but there's no fundamental reason it has to be that way.

No it's pretty fundamental, or at least solving it is really hard. In particular solving "prompt injection" is exactly equivalent to solving the problem of AI alignment. If you could solve prompt injection, you've also exactly solved the problem of making sure the AI only does what you (the designer) want, since prompt injection is fundamentally about the outside world (not necessarily just a malicious attacker) making the AI do something you didn't want it to do.

Your suggestion to use RLHF is effectively what OpenAI already does with its "system prompt" and "user prompt," but RLHF is a crude cudgel which we've already seen users get around in all sorts of ways.