I wonder if you could do this with multiple alignment training passes, where you...

		mistercow on June 19, 2024 \| parent \| context \| favorite \| on: Refusal in language models is mediated by a single... I wonder if you could do this with multiple alignment training passes, where you extract the refusal direction each time, and suppress it in future training passes.