I'm curious how this works if the green team writes an implementation that makes a network call like an RPC.
Red team might not anticipate this if the spec does detail every expected RPC (which seems unreasonable: this could vary based on implementation). But a unit test would need mocks.
Is green team allowed to suggest mocks to add to the test? (Even if they can't read the tests themselves?) This also seems gamaeable though (e.g. mock the entire implementation). Unless another agent makes a judgement call on the reasonability of the mock (though that starts to feel like code review more generally).
Maybe record/replay tests could work? But there are drawbacks in the added complexity.
I think the solution here is: Don't mock and inject dependencies explicitly, as function parameters / monads / algebraic effects. Make side effects part of the spec/interface.
FWIW he gives his ethical reasoning on his website:
> Broadly, I am supportive of arming democracies with the tools needed to defeat autocracies in the age of AI—I simply don’t think there is any other way. But we cannot ignore the potential for abuse of these technologies by democratic governments themselves. Democracies normally have safeguards that prevent their military and intelligence apparatus from being turned inwards against their own population, but because AI tools require so few people to operate, there is potential for them to circumvent these safeguards and the norms that support them. It is also worth noting that some of these safeguards are already gradually eroding in some democracies. Thus, we should arm democracies with AI, but we should do so carefully and within limits: they are the immune system we need to fight autocracies, but like the immune system, there is some risk of them turning on us and becoming a threat themselves.
Basically, he's afraid that not arming the government with AI puts it at a disadvantage vs. other governments he trusts less. Plus, if Anthropic is in the loop that gives them the chance to steer the direction of things a bit (what they were kicked out for doing).
It's not the purest ethical argument, but I also would not say that there is a clearly correct answer.
Basically he's asking everyone to trust him that he won't cross the line himself. Whatever argument he makes for democracies applies to him as well, and he's not somehow above it. That's the flaw in his argument.
Brutally honest, to me it just sounds like a very elaborate way to say "trust me, bro"
I would agree if not for the fact that they just let a $200M contract slip through over it. You could argue it's "safety theater" in itself but that seems like a risky gambit especially with this administration. I definitely trust Anthropic more than OpenAI. In fact I'd go as far as to say it's probably pretty imperative that Anthropic stays a frontrunner in this race and doesn't leave the field exclusively to OAI (and maybe Google which is just as bad). That doesn't mean I'm exactly happy with Anthropic's comments like "mass surveillance bad but only for the US". But Anthropic at least regularly asks questions about the direction of AI development. I haven't seen the other frontier model companies do any such thing.
Regardless, I think if you are thinking purely from a ruthless business standpoint then standing up to the DoD was an incredibly ill-advised move. It's basically free financial and technological backing at the cost of ethics. Additionally, basically everyone with functioning eyeballs knows that the current US administration is incredibly vindicative, reckless and short-tempered. I would agree that in a more tame administration, you might do something like this as a publicity stunt. In the Trump administration, and while the AI arms race is still in full force, it feels like there has to be at least somewhat genuine sentiment behind it, otherwise it just doesn't really make sense. Like what do they accomplish from this? You'll get some users who will view you more favourably for it but it probably won't make up for the lost revenue, and no matter how many people like you, if you are first to AGI in this industry you win. The prior sentiment basically won't matter at that point. In the most critical interpretation I guess you could say if the bubble pops it might be more of a matter of sentiment. I don't know, in my mind the math just doesn't work for it to be a business move.
>Regardless, I think if you are thinking purely from a ruthless business standpoint then standing up to the DoD was an incredibly ill-advised move.
It wasn't, there's been non-stop talk here for days about how Anthropic is a step-above, better-than-the-rest, the "only good AI" company. Enough already. It is a marketing tactic they are taking in opposition to OpenAI.
It seems like best etiquette would be to have a username with "bot" in it and include something in the post explicitly indicating it's a bot (e.g. a signature).
This isn't even a new problem where a good cultural solution hasn't been figured out yet. Reddit has had bot etiquette for years.
It's so interesting to read these comments because this this is literally my job (to help Workspace teams integrate Gemini faster)
Some thoughts:
- Hardware constraints are real, even at Google.
- Features are often released for enterprise users first, before being released for the general public.
- I only started my current role last year.
- 2 years ago is forever ago in AI and things are changing a lot. For example, with older model generations people used to do a lot more fine-tuning (slow) vs. prompt engineering (fast). The implication is that things that are easier to do today might have been hard to do not that long ago. The rapid changes also create churn for internal platforms and dev tooling.
- Google is less yolo and cares about safety, prompt injection, etc. so some time goes into that
- Typical big company bureaucracy also applies, but TBH there's a lot of pressure to deliver Gemini related stuff so I think there's less of that
In my experience, using LLMs to code encouraged me to write better documentation, because I can get better results when I feed the documentation to the LLM.
Also, I've noticed failure modes in LLM coding agents when there is less clarity and more complexity in abstractions or APIs. It's actually made me consider simplifying APIs so that the LLMs can handle them better.
Though I agree that in specific cases what's helpful for the model and what's helpful for humans won't always overlap. Once I actually added some comments to a markdown file as note to the LLM that most human readers wouldn't see, with some more verbose examples.
I think one of the big problems in general with agents today is that if you run the agent long enough they tend to "go off the rails", so then you need to babysit them and intervene when they go off track.
I guess in modern parlance, maintaining a good codebase can be framed as part of a broader "context engineering" problem.
I've also noticed that going off the rails. At the start of a session, they're pretty sharp and focused, but the longer the session lasts, the more confused they get. At some point they start hallucinating bullshit that they wouldn't have earlier in the session.
It's a vital skill to recognise when that happens and start a new session.
Over a decade ago I tried getting the HPV vaccine in my early 20s, but the doctor told me it wasn't recommended for men and that insurance won't cover it. I was young and didn't have the money to pay out of pocket.
I went to Planned Parenthood and got the vaccine last year. At some point they changed the recommendation to men under 45 now and I got all 3 shots free.
Honestly, though I'm glad to have finally got the vaccine it's been a pretty frustrating experience.
On a quick skim, my interpretation is that the article critiques the classic (but simplistic) advice that asking questions and letting the other person talk more than you is the key to having a good conversation, especially to ensuring that the other person is happy with the conversation.
The classic advice is basically a caution against being a boring monologuer. And it has its merit. But this is an extra "level 2 conversationalist" lesson. It's the old: "OK remember those rules you learned in level 1? Here's when you can break them".
Th affordance analogy is that you want to give yourself and your conversation partner an abundance of options and opportunities for good conversation. Asking questions often is a way of doing that, but it's not the only way, and not all questions are equally helpful.
no, "doorknob" is merely higher-frequency due to its other meanings. it's never used in this context - probably because it's a terrible affordance (see Norman - push or pull?)
Red team might not anticipate this if the spec does detail every expected RPC (which seems unreasonable: this could vary based on implementation). But a unit test would need mocks.
Is green team allowed to suggest mocks to add to the test? (Even if they can't read the tests themselves?) This also seems gamaeable though (e.g. mock the entire implementation). Unless another agent makes a judgement call on the reasonability of the mock (though that starts to feel like code review more generally).
Maybe record/replay tests could work? But there are drawbacks in the added complexity.
reply