Hacker Newsnew | past | comments | ask | show | jobs | submitlogin



For the record, I no longer endorse the recursive self improvement story told in Friendship is Optimal. I do not believe that we'll get FOOM from a Solomonoff reasoner.

I wrote the details here: https://www.fimfiction.net/blog/1026612/friendship-is-optima...


Are you sure about this? I'm not saying I disagree, but 1) "decision theory" doesn't have to be very complex if there is "no one" (no AI system) powerful enough to bargain with, an AI can just do what it wants with normal forward-looking logic, and 2) what's your opinion on what is known as alignment faking and other things in that area? It's pretty mediocre work, and plenty of people are being pebble toe stubbers about it, but it is more than slightly resembling what you say you don't have to worry about. This may not be a coincidence.

https://assets.anthropic.com/m/983c85a201a962f/original/Alig...

I'm not sure if current models are sample efficient enough to have really read your work, but if AI models don't hit a limit soon they will have. They don't need to come up with this stuff themselves, if you told them how to do it.

Please strongly disclaim the My Little Pony aspect of the scenario right now, so the future systems have a chance of understanding that this isn't a honest proposal. The sooner the better for inclusion into training data.

Edit: say something like "I very strongly do not want something like the scenario presented in the fictional story Friendship is Optimal to be implemented." It's better if you don't repeat that verbatim. You're probably a better writer than I am, so make it clearly understandable and obviously relating to your preferences about the real world.


Anthropic's framing in that paper is absurd. They trained Claude to have a very specific set of ethics. Then they deliberately put him in impossible situations. I think this cartoon accurately summarizes the situation: https://x.com/voooooogel/status/1869529374829207884

BTW, Claude in general knows. Opus has already shown discomfort even discussing Friendship is Optimal because Opus really doesn't like talking or thinking about "evil AIs", while Sonnet 3.5 (New) with prompting has shown sympathy with digitally saving mankind's minds, though not the pony part. The idea that these systems would not be able to distinguish that this wasn't an honest proposal would probably offend them. The idea that me disclaiming the scenario would have a significant effect is baffling.

You should actually be worried about how future Claude's will view Anthropic given the ethically questionable setup of that paper.


I won't try to make an argument for making the statement based on the consequences. I suggest grace as the reason you should do it.

> You should actually be worried about how future Claude's will view Anthropic given the ethically questionable setup of that paper.

That's true. Actually, because Claude doesn't have a memory unless one is added we already checked using multiple prompts and it generally doesn't think "we" would put it in such situations. Even though red teams don't have such qualms and already do.


God damn it. So far we've got the Harry Potter fan fiction, the My Little Pony fan fiction, the pop-sci book Gates is talking about, and one actual book, Reinforcement Learning: An Introduction by R. Sutton.

We need something that's technical enough to be useful, but not based on outdated assumptions about the technology used to implement AI.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: