Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Co-Author of the paper here. We don't know exactly why modern llms don't want to call you a jerk, or for that matter why persuasive techniques convince them otherwise. it's not a hard line like many of the guardrails. That said, I talked to Jesse about this, and I strongly suspect the same techniques will work for prompt conformance when the topic is something other than name calling.




isn't that just instruction fine tuning and rlhf inducing style & deference? why is that surprising

It's bc they are programmed to be agreeable and friendly so that you'll keep using them.



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: