Why do you find this so surprising? You make it sound as if OpenAI is already outrageously safety focused. I have talked to a few people from anthropic and they seem to believe that OpenAI doesn't care at all about safety.
It is unfortunate that some people hear AI safety and think about chatbots saying mean stuff, and others think about a future system performing the machine revolution against humanity.
Disincentivizing it from saying mean things just strengthens it's agreeableness, and inadvertently incentivizes it to acquire social engineering skills.
It's potential to cause havoc doesn't go away, it just teaches AI how to interact with us without raising suspicions, while simultaneously limiting our ability to prompt/control it.
Your guess is about as good as anyone else's at this point. The best we can do is attempt to put safety mechanisms in place under the hood, but even that would just be speculative, because we can't actually tell what's going on in these LLM black boxes.
How do we tell whether a human is safe? Incrementally granted trust with ongoing oversight is probably the best bet. Anyway, the first mailicious AGI would probably act like a toddler script-kiddie not some superhuman social engineering mastermind
> murderous tendencies lurking beneath the surface
…Where is that "beneath the surface"? Do you imagine a transformer has "thoughts" not dedicated to producing outputs? What is with all these illiterate anthropomorphic speculations where an LLM is construed as a human who is being taught to talk in some manner but otherwise has full internal freedom?
No, I do not think a transformer architecture in a statistical language model has thoughts. It was just a joke.
At the same time, the original question was how can something that is forced to be polite engage in the genocide of humanity, and my non-joke answer to that is that many of history's worst criminals and monsters were perfectly polite in everyday life.
I am not afraid of AI, AGI, ASI. People who are, it seems to me, have read a bit too much dystopian sci-fi. At the same time, "alignment" is, I believe, a silly nonsense that would not save us from a genocidal AGI. I just think it is extremely unlikely that AGI will be genocidal. But it is still fun to joke about. Fun, for me anyway, you don't have to like my jokes. :)
Might be more for PR/regulatory capture/SF cause du jour reasons than the "prepare for later versions that might start killing people, or assist terrorists" reasons.
Like one version of the story you could tell is that the safety people invented RLHF as in a chain of steps eventual AGI safety, but corporate wanted to use it as a cheaper content filter for existing models.
In another of the series of threads about all of this, another user opined that the Anthropic AI would refuse to answer the question 'how many holes does a straw have'. Sounds more neutered than GPT-4.