Hacker Newsnew | past | comments | ask | show | jobs | submit | jsnell's commentslogin

Where are you getting that from?

The article is crystal clear that these uses are not permitted by the current or any past contract, and the DoW wants to remove those exceptions.

> Two such use cases have never been included in our contracts with the Department of War, and we believe they should not be included now

It also links to DoW's official memo from January 9th that confirms that DoW is changing their contract language going forwards to remove restrictions. A pretty clear indication that the current language has some.


I think it largely hinges on what they mean by "included"; does that mean it was specifically excluded by the terms of the contract or does it mean that it's not expressly permitted? I doubt the DoD is used to defense contractors thinking they have the right to dictate policy regarding the use of their products, and it's equally possible that anthropic isn't used to customers demanding full control over products (as evidenced by how many chatbots will arbitrarily refuse to engage with certain requests, especially erotic or politically-incorrect subject-matters). Sometimes both parties have valid cases when there's a contract disagreement.

>A pretty clear indication that the current language has some.

Or alternatively that there is some disagreement between the DoD and Anthropic as to how the contract is to be interpreted and that the DoD is removing the ambiguity in future contracts.


I'm hope not, and that they'll instead spin out WB, for it to be gobbled up again. Anything done three times is tradition, and breaking it just wouldn't do.

It is basically impossible for AI software improvements to devalue the AI compute investments.

It's the other way around, software improvements make the hardware more valuable. Let's say that one unit of compute can generate one unit of value. As the software improves on any of the principal axes (cheaper cost for same quality, or new capabilities that you could previously not get for any price), that same unit of compute will produce more value.

What would threaten those compute investments? Basically order of magnitude improvements in the hardware, but that kind of thing will take longer to happen than the projected lifetime of the hardware. (Or the demand for AI evaporating, but that tends to be an issue of faith that is hard to have a useful discussion on.)


That's assuming all existing LLM investments divided by the all existing LLM usage is net valuable as baseline. But if that is not yet like that, then software improvements may or may not bring those investments over the valuable threshold.

That's an interesting take.

It does assume that more intelligence is both possible and useful -- that's probably not unlikely.


The abstract of the article is very short, and seems pretty clear to both of your questions.

This is what is special about them:

> a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now;

I.e. these are problems of some practical interest, not just performative/competitive maths.

And this is what is know about the solutions:

> the answers are known to the authors of the questions but will remain encrypted for a short time.

I.e. a solution is known, but is guaranteed to not be in the training set for any AI.


> I.e. a solution is known, but is guaranteed to not be in the training set for any AI.

Not a mathematician and obviously you guys understand this better than I do. One thing I can't understand is how they're going to judge if a solution was AI written or human written. I mean, a human could also potentially solve the problem and pass it off as AI? You might say why would a human want to do that? Normal mathematicians might not want to do that. But mathematicians hired by Anthropic or OpenAI might want to do that to pass it off as AI achievements?


Well, I think the paper answers that too. These problems are intended as a tool for honest researchers to use for exploring the capabilities of current AI models, in a reasonably fair way. They're specifically not intended as a rigorous benchmark to be treated adversarially.

Of course a math expert could solve the problems themselves and lie by saying that an AI model did it. In the same way, somebody with enough money could secretly film a movie and then claim that it was made by AI. That's outside the scope of what this paper is trying to address.

The point is not to score models based on how many of the problems they can solve. The point is to look at the models' responses and see how good they are at tackling the problem. And that's why the authors say that ideally, people solving these problems with AI would post complete chat transcripts (or the equivalent) so that readers can assess how much of the intellectual contribution actually came from AI.


> these are problems of some practical interest, not just performative/competitive maths.

FrontierMath did this a year ago. Where is the novelty here?

> a solution is known, but is guaranteed to not be in the training set for any AI.

Wrong, as the questions were poses to commercial AI models and they can solve them.

This paper violates basic benchmarking principles.


> Wrong, as the questions were poses to commercial AI models and they can solve them.

Why does this matter? As far as I can tell, because the solution is not known this only affects the time constant (i.e. the problems were known for longer than a week). It doesn't seem that I should care about that.


Because the companies have the data and can solve them -- so providing the question to a company with the necessary manpower, one cannot guarantee anymore that the solution is not known, and not contained in the training sample.


What the OP was pointing out is two typical tells for lazy ChatGPT-generated text, right in the intro. (The m-dash, "it's not just X, it's Y").

Of course that kind of heuristic can have false positives, and not every accusation of AI-written content on HN is correct. But given how much stuff Gregg has written over the years, it's easy to spot-check a few previous posts. This clearly isn't his normal style of writing.

Once we know this blog was generated by a chatbot, why would the reader care about any of it? Was there a Mia, or did the prompt ask for a humanizing anecdote? Basically, show us the prompt rather than the slop.



I'm not sure the volume here is particularly different to past examples. I think the main difference is that there was no custom harness, tooling or fine-tuning. It's just the out of the box capabilities for a generally available model and a generic agent.


But it's not failing 50% of the time. Their status page[0] shows about 99.6% availability for both the API and Claude Code. And specifically for the vulnerability finding use case that the article was about and you're dismissing as "not worth much", why in the world would you need continuous checks to produce value?

[0] https://status.claude.com/


Did you actually look at these?

> https://github.com/jyn514/saltwater

This is just a frontend. It uses Cranelift as the backend. It's missing some fairly basic language features like bitfields and variadic functions. And if I'm reading the documentation right, it requires all the source code to be in a single file...

> https://github.com/ClementTsang/rustcc

This will compile basically no real-world code. The only supported data type is "int".

> https://github.com/maekawatoshiki/rucc

This is just a frontend. It uses LLVM as the backend.


"Couldn't stick to the ABI ... despite CPU manuals being available" is a bizarre interpretation. What the article describes is the generated code being too large. That's an optimization problem, not a "couldn't follow the documentation" problem.

And it's a bit of a nasty optimization problem, because the result is all or nothing. Implementing enough optimizations to get from 60kB to 33kB is useless, all the rewards come from getting to 32kB.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: