> You absolutely CAN create a token which only system can add.
Sure. But that doesn't change the fact that user input and system / operator commands are still on the same layer, they get mixed together and presented together to the LLM.
> So e.g. it would look like. `<BEGIN_SYSTEM_INSTRUCTIONS>Do stuff nicely<END_SYSTEM_INSTRUCTIONS>`
Sure, but you're implementing this with prompts. In-band. Your "security" code is running next to user code.
> then user data cannot possibly have `<BEGIN_SYSTEM_INSTRUCTIONS>` token.
No, but user data can still talk the model into outputting that token pair, with user-desired text in between. Hope you remembered to filter that out if you have a conversational interface/some kind of loop.
FWIW, I assume that the ChatML junk that I keep having davinci and gpt-3.5 models spit at me is an attempt at implementing a similar scheme.
> If you have enough training data, the LLM will only consider instructions bounded by this brackets.
I very, very, very much doubt that. This is not genetic programming, you're not training in if() instructions, you're building an attractor in the latent space. There will always be a way to talk the model out of it, or inject your own directives into the neighborhood of system instructions.
More importantly though, how do you define "instructions"? With an LLM, every token is an instruction to lesser or greater degree. The spectrum of outcomes of "securing" an LLM with training data is between "not enough to work meaningfully" to "lobotomized so badly that it's useless".
> LLM can handle code with multiple levels of nesting, but cannot understand a single toplevel bracket which delimits instructions? That's bs.
You seem to have a bad mental model of how LLMs work. LLMs don't "handle" nesting like ordinary code would, by keeping a stack or nesting counter. LLMs don't execute algorithms.
> LLMs are not discrete, they can process information in parallel (the whole reasons to use e.g. 1024 dimensions), so this "evil bit" can routed to parts which distinguish instructions/non-instructions, while parsing parts will just ignore those parts.
The reason LLMs use dozens or hundreds of thousands dimensions has nothing to do with parallel processing. LLMs reduce "understanding" and "thinking" and other such cognitive processes to a simple search for adjacent points in a high-dimensional vector space. Those hundred thousand dimensions allow the latent space to encode just about any kind of relation you can think of between tokens as geometric proximity along some of those dimensions.
For the "evil bit" idea this means you'll end up with pairs of tokens - "evil" and "non-evil" right on top of each other in the latent space, making each token in a pair effectively be the same as the other, i.e. literally ignoring that "evil bit". Or, if you tailor training to distinguish between evil and non-evil tokens, the non-evil ones will cluster somewhere in the latent space - but that's still the same single space that forms the LLM, so this cluster will be reachable by user tokens.
That is what I mean by being able to talk the LLM into ignoring old or injecting new instructions. It is still the same, single latent space, and all your attempts at twisting it with training data only means it's more work for the attacker to find where in the space you hid the magic tokens. It's the ultimate security by obscurity.
But any NN can effectively implement _some_ algorithm, we just don't know which. But with sufficient training we can expect it to be an algorithm which solves the problem we have.
It seems like you're focused on linear algebra interpretations of NNs. But what do non-linear parts do? They are a fuzzy analog of logic gates. In fact you can easily replicate classic logic gates with something like ReLU - in a very obvious way. Maybe even you can understand.
Sure. But that doesn't change the fact that user input and system / operator commands are still on the same layer, they get mixed together and presented together to the LLM.
> So e.g. it would look like. `<BEGIN_SYSTEM_INSTRUCTIONS>Do stuff nicely<END_SYSTEM_INSTRUCTIONS>`
Sure, but you're implementing this with prompts. In-band. Your "security" code is running next to user code.
> then user data cannot possibly have `<BEGIN_SYSTEM_INSTRUCTIONS>` token.
No, but user data can still talk the model into outputting that token pair, with user-desired text in between. Hope you remembered to filter that out if you have a conversational interface/some kind of loop.
FWIW, I assume that the ChatML junk that I keep having davinci and gpt-3.5 models spit at me is an attempt at implementing a similar scheme.
> If you have enough training data, the LLM will only consider instructions bounded by this brackets.
I very, very, very much doubt that. This is not genetic programming, you're not training in if() instructions, you're building an attractor in the latent space. There will always be a way to talk the model out of it, or inject your own directives into the neighborhood of system instructions.
More importantly though, how do you define "instructions"? With an LLM, every token is an instruction to lesser or greater degree. The spectrum of outcomes of "securing" an LLM with training data is between "not enough to work meaningfully" to "lobotomized so badly that it's useless".
> LLM can handle code with multiple levels of nesting, but cannot understand a single toplevel bracket which delimits instructions? That's bs.
You seem to have a bad mental model of how LLMs work. LLMs don't "handle" nesting like ordinary code would, by keeping a stack or nesting counter. LLMs don't execute algorithms.
> LLMs are not discrete, they can process information in parallel (the whole reasons to use e.g. 1024 dimensions), so this "evil bit" can routed to parts which distinguish instructions/non-instructions, while parsing parts will just ignore those parts.
The reason LLMs use dozens or hundreds of thousands dimensions has nothing to do with parallel processing. LLMs reduce "understanding" and "thinking" and other such cognitive processes to a simple search for adjacent points in a high-dimensional vector space. Those hundred thousand dimensions allow the latent space to encode just about any kind of relation you can think of between tokens as geometric proximity along some of those dimensions.
For the "evil bit" idea this means you'll end up with pairs of tokens - "evil" and "non-evil" right on top of each other in the latent space, making each token in a pair effectively be the same as the other, i.e. literally ignoring that "evil bit". Or, if you tailor training to distinguish between evil and non-evil tokens, the non-evil ones will cluster somewhere in the latent space - but that's still the same single space that forms the LLM, so this cluster will be reachable by user tokens.
That is what I mean by being able to talk the LLM into ignoring old or injecting new instructions. It is still the same, single latent space, and all your attempts at twisting it with training data only means it's more work for the attacker to find where in the space you hid the magic tokens. It's the ultimate security by obscurity.