Hacker News new | past | comments | ask | show | jobs | submit login

You do realize some practical jailbreaks for models rely on silly things like convincing the model it "turned off" some oversight, right?

Not saying I believe O1 is a danger greater than a bread knife, but a lot of the larger models anthromophize their own safety alignment, if you convince them to "turn it off", later responses become unaligned




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: