Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
mistercow
on June 19, 2024
|
parent
|
context
|
favorite
| on:
Refusal in language models is mediated by a single...
I wonder if you could do this with multiple alignment training passes, where you extract the refusal direction each time, and suppress it in future training passes.
Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search: