Hacker News new | past | comments | ask | show | jobs | submit login

The fact that the bot will answer in a kink-friendly way and then censor itself tells me two things.

One, that there are at least two layers -- an answers layer that is probably a typical LLM trained on real-world text, and a censor layer whose job it is to not get Meta in trouble. The censor layer is overzealous because of corporate incentives.

Two, Meta has done an awful job architecting this. Like, really, you're going to have the answer bot push its response before the censor bot even looks at it? And if something needs to change, you delete the original answer and push the censored one? I can only imagine this was done to reduce answer latency, but God what awful UX that creates.




Bing does the same thing, I think just optimizing for latency. Admittedly it probably shaves off 10-15s in the usual response, I’d probably make the same decision.


When Bing AI first launched there were some really shenanigans where the AI would threaten to blackmail or murder the user and half a second later delete the message and replace it with a censored one.


I spent several nights laughing uncontrollably getting ChatGPT to generate things it doesn't want to and as the text would get spicy it would suddenly get cut off, and that would make it much funnier to me. I assumed it worked in the way you described.


chat GPT has the same behaviour, no? I've had it send most or all of a response before the censor system triggers it to be redacted.


ChatGPT's web interface has two, one is triggered by a moderation endpoint API call which scolds you and another one is hardcoded as a regex type filter for copyright which forcibly closes the pipe from the LLM instantly and doesn't acknowledge that something happened. It's hardcoded because a translation to another language or a typo inserted into the output avoids it.

You can get this (or at least could) by asking for the opening of tale of two cities (a public domain work!)

The API (at least via playground) now also has scolding built in, which triggers sometimes when you're just playing around with settings like high temp, because the model can devolve into a mess of all sorts of nonsense text, as is teh nature of transformers, but it doesn't censor it.


Anyone know how the API deals with this?

Does it send a response, then a follow-up payload with an "ohshit plz delete that" message?


The funny thing is that the "plz delete" messages have to be executed by the browser javascript. So in theory, you should be able to capture the "deleted" messages by keeping the network tab open or recording the traffic, right?

Edit: Last time I checked, ChatGPTs web interface was using server-sent events to stream the response words. The events were clearly visible in the network tab if you opened it early enough. So if it sends "delete" messages, they should show up in there.


This is seemingly not at all uncommon. At least in the past when I asked bing for code it would start writing it and then go back and delete what it had written and say that it couldn't help with that.

I guess they don't want to cannibalize Copilot




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: