Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Evaluating Control Protocols for Untrusted AI Agents (arxiv.org)
1 point by timini 73 days ago | hide | past | favorite | 1 comment


This paper evaluates three control strategies for untrusted agents: deferral to trusted models, resampling, and critical action deferral. Initial testing showed resampling and critical action deferral achieving 96% safety. However, adversarial testing revealed resampling crashes to 17% safety when attackers can detect resampling or simulate monitors, while critical action deferral remained robust against all attack strategies.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: