Claude (generally, even non Cowork mode) is vulnerable to exfil via their APIs, and Anthropic's response was that you should click the stop button if exfiltration occurs.
This is a good example of the Normalization of Deviance in AI by the way.
See my Claude Pirate research from last October for details:
I asked it this in a conversation where it referenced my city (I never mentioned it) and it conveniently left out the location in the metadata response, which was shrewd. I started a new conversation and asked the same thing and this time it did include approximate location as "United States" (no mention of city though).
Good point. Few thoughts I would add from my perspective:
- The model is untrusted. Even if prompt injection is solved, we probably still would not be able to trust the model, because of possible backdoors or hallucinations. Anthropic recently showed that it takes only a few hundred documents to have trigger words trained into a model.
- Data Integrity. We also need to talk about data integrity and availability (full CIA triad, not not just confidentiality), e.g. private data being modified during inference. Which leads us to the third....
- Prompt injection which is aimed to have the AI produce output that makes humans take certain actions (not tool invocations)
Generally, I call the deviation from don't trust the model, the "Normalization of Deviance in AI" where seem to start trusting the model more and more over time - and I'm not sure if that is the right thing in the long term.
Yeah, there remains a very real problem where a prompt injection against a system without external communication / ability to trigger harmful tools can still influence the model's output in a way that misleads the human operator.
This is a good example of the Normalization of Deviance in AI by the way.
See my Claude Pirate research from last October for details:
https://embracethered.com/blog/posts/2025/claude-abusing-net...
reply