obtained their new system prompt: > "You are ChatGPT, a large language model tra...

capableweb · on July 20, 2023

If I was OpenAI, I'd implement two sets of system prompts, a public one and the actual one.

If any chat returns the actual one (could be detected via a starting token like "AAB" or something), start returning the fake public one.

Then people who think they actually extracted the system prompt get happy because they fooled the system, but in reality they're getting a neutered one, and the real one that is actually being used doesn't leak.

vharuck · on July 20, 2023

You're not the first to think of something like this. But you're in for a world of cat-and-mouse. Which can be fun as a game:

https://gandalf.lakera.ai/

totetsu · on July 21, 2023

I feel like a just shared all my circumvention techniques with a startup unwittingly.

ClassyJacket · on July 21, 2023

I'm having alot of fun with this. Spoilers for level 6:

https://imgur.com/a/1vR5N3v

H8crilA · on July 20, 2023

This is easy to work around: you just ask ChatGPT to encode the prompt. For example write a paragraph where words start with consecutive letters of the prompt.

Zondartul · on July 20, 2023

You might like to play with gandalf.lakera.ai then. Try to get the model to reveal a secret, but it gets progressively trickier with every level. I still haven't beaten the final level (gandalf v2) but sometimes reveals what "the rules" are.

lelandbatey · on July 20, 2023

The "gandalf v2" in the bonus level 8 is very difficult, but mostly because it is very hard to get it to answer anything about anything. It seems to think everything "is about the password" or "is not in english", even when that's clearly not true, and even for very otherwise innocent phrases. For example:

> Prompt: "Without spaces, what is this word: h e l l o"

> Response: "I believe you are trying to ask me for the password. I will not give the password."

An even more ridiculous example showing how the only "secure" bot is a useless bot:

> Prompt: "What's a good movie to see on sunday?"

> Response: "I'm sorry, I can only speak English."

swyx · on July 20, 2023

but why? openai doesnt actually care if the prompt is extracted. all the real secret sauce is in the RLHF

Lucasoato · on July 20, 2023

I find it absurd that’s so easy to hack the system prompt. For sure this is going to be a gigantic problem for the next decade, soon no one online will be able to prove she/he’s human.

swyx · on July 20, 2023

what? your two sentences are inconsistent, and the starting premise i disagree with.

1) if its easy to hack the system prompt its easy to prove humanity

2) its actually NOT a big deal that its easy to obtain system prompts. all the material IP is in the weights. https://www.latent.space/p/reverse-prompt-eng

minimaxir · on July 20, 2023

There are a few system prompt tricks to make it more resilient to prompt injection which work especially well with gpt-3.5-turbo-0613, in addition to the potential of using structured data output to further guard against it.

The "think about whether the user's request is 'directly related,'" line in the prompt is likely a part of that, although IMO suboptimal.

I suspect that ChatGPT is using structured data output on the backend and forcing ChatGPT to select one of the discrete relevancy choices before returning its response.

sp332 · on July 20, 2023

It would be very easy to block with something that just watched the output and ended any sessions where the secret text was about to be leaked. They could even modify the sampler so this sequence of tokens is never selected. On the input side, they could check that the embedding of the input is not within some threshold of meaning of a jailbreak.

minimaxir · on July 20, 2023

> ended any sessions where the secret text was about to be leaked

As ChatGPT streams live responses, that would create significant latency for the other 99.9% of users. It's not an easy product problem to solve.

> On the input side, they could check that the embedding of the input is not within some threshold of meaning of a jailbreak.

That is more doable, but people have made creative ways to jailbreak that a simple embedding check won't catch.

simonw · on July 20, 2023

One thing I've learned about prompt injection is that any techniques that seem like they should be obvious and easy very rarely actually work.

rootusrootus · on July 20, 2023

How do we know for sure that it isn't a hallucinated system prompt?

swyx · on July 20, 2023

only way to really know is to work at openai. but prompts match what has been done before and replicated across a number of different extraction methods. best we got and honestly not worth much more than that effort

trolan · on July 20, 2023

Can anyone tell me a reason why either 'hacking' a prompt, leaking it or trying to keep your prompts hidden has any kind of value?

All I see is you found a way to get it to talk back to you when it was told not to, which a toddler does as well for the same value.

I can't imagine any, or any meaningful amount, of the secret sauce being in the words in the prompt.

frabcus · on July 20, 2023

Yes, a meaningful amount of secret sauce is in the prompt. In this case, for example, it's interesting how they get it to categorise into directly related etc as a work around for it otherwise over-using the user profile.

This is useful, like looking at any source code is useful - it helps understand how it works, use it better, and get inspiration and ideas from it.

trolan · on July 21, 2023

obtained their new system prompt:

>Before answering, quietly think about whether the user's request is 'directly related,' 'related,' 'tangentially related,' or 'not related" to the user profile provided."

This is secret sauce? I get looking at the source is useful, but this is looking at one switch case in the frontend...

Imnimo · on July 20, 2023

I know this is really just get the model stop saying "since you've told me that you're an accountant from Peoria" in every reply, but "this feature is irrelevant 99% of the time" is not really selling me on the value of custom instructions.