Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Did anyone else notice that o3-mini's SWE bench dropped from 61% in the leaked System Card earlier today to 49.3% in this blog post, which puts o3-mini back in line with Claude on real-world coding tasks?

Am I missing something?



I think this is with and without "tools." They explain it in the system card:

> We evaluate SWE-bench in two settings: > *• Agentless*, which is used for all models except o3-mini (tools). This setting uses the Agentless 1.0 scaffold, and models are given 5 tries to generate a candidate patch. We compute pass@1 by averaging the per-instance pass rates of all samples that generated a valid (i.e., non-empty) patch. If the model fails to generate a valid patch on every attempt, that instance is considered incorrect.

> *• o3-mini (tools)*, which uses an internal tool scaffold designed for efficient iterative file editing and debugging. In this setting, we average over 4 tries per instance to compute pass@1 (unlike Agentless, the error rate does not significantly impact results). o3-mini (tools) was evaluated using a non-final checkpoint that differs slightly from the o3-mini launch candidate.


So am I to understand that they used their internal tooling scaffold on the o3(tools) results only? Because if so, I really don't like that.

While it's nonetheless impressive that they scored 61% on SWE-bench with o3-mini combined with their tool scaffolding, comparing Agentless performance with other models seems less impressive, 40% vs 35% when compared to o1-mini if you look at the graph on page 28 of their system card pdf (https://cdn.openai.com/o3-mini-system-card.pdf).

It just feels like data manipulation to suggest that o3-mini is much more performant than past models. A fairer picture would still paint a performance improvement, but it look less exciting and more incremental.

Of course the real improvement is cost, but still, it kind of rubs me the wrong way.


YC usually says “a startup is the point in your life where tricks stop working”.

Sam Altman is somehow finding this out now, the hard way.

Most paying customers will find out within minutes whether the models can serve their use case, a benchmark isn’t going to change that except for media manipulation (and even that doesn’t work all that well, since journalists don’t really know what they are saying and readers can tell).


My guess is this cheap mini-model comes out now after DeepSeek very recently shook the stock-market greatly with its cheap price and relatively good performance. .


o3 mini has been coming for a while, and iirc was "a couple of weeks" away a few weeks ago before R1 hit the news.


Makes sense. Thanks for the correction.


The caption on the graph explains.

> including with the open-source Agentless scaffold (39%) and an internal tools scaffold (61%), see our system card .

I have no idea what an "internal tools scaffold" is but the graph on the card that they link directly to specifies "o3-mini (tools)" where the blog post is talking about others.


I'm guessing an "internal tools scaffold" is something like Goose: https://github.com/block/goose

Instead of just generating a patch (copilot style), it generates the patch, applies the patch, runs the code, and then iterates based on the execution output.


Maybe they found a need to quantize it further for release, or lobotomise it with more "alignment".


> lobotomise

Anyone can write very fast software if you don't mind it sometimes crashing or having weird bugs.

Why do people try to meme as if AI is different? It has unexpected outputs sometimes, getting it to not do that is 50% "more alignment" and 50% "hallucinate less".

Just today I saw someone get the Amazon bot to roleplay furry erotica. Funny, sure, but it's still obviously a bug that a *sales bot* would do that.

And given these models do actually get stuff wrong, is it really incorrect for them to refuse to help with things they might be dangerous if the user isn't already skilled, like Claude in this story about DIY fusion? https://www.corememory.com/p/a-young-man-used-ai-to-build-a-...


If somebody wants their Amazon bot to role play as an erotic furry, that’s up to them, right? Who cares. It is working as intended if it keeps them going back to the site and buying things I guess.

I don’t know why somebody would want that, seems annoying. But I also don’t expect people to explain why they do this kind of stuff.


It's still a bug. Not really working as intended — it doesn't sell anything from that.

A very funny bug, but a bug nonetheless.

And given this was shared via screenshots, it was done for a laugh.


Who determines who gets access to what information? The OpenAI board? Sam? What qualifies as dangerous information? Maybe it’s dangerous to allow the model to answer questions about a person. What happens when limiting information becomes a service you can sell? For the right price anything can become too dangerous for the average person to know about.


> What qualifies as dangerous information?

The reports are public, and if you don't feel like reading them because they're too long and thorough in their explanations of what and why you can always put them into an AI and ask it to summarise them for you.

OpenAI is allowed to unilaterally limit the capability of their own models, just like any other software company can unilaterally limit the performance of their own software.

And they still are even when they're just blantantly wrong or even just lazy — it's not like people complain about Google "lobotomising" their web browsers for no longer supporting Flash or Java applets.


They are implying the release was rushed and they had to reduce the functionality of the model in order to make sure it did not teach people how to make dirty bombs


The problem is that they don't make the LLM better at instruction following, they just make it unable to product furry erotica even if Amazon wants it to.


> Anyone can write very fast software if you don't mind it sometimes crashing or having weird bugs.

Isn’t that exactly what VCs want?


I doubt it.

The advice I've always been given in (admittedly: small) business startup sessions was "focus on quality rather than price because someone will always undercut you on price".

The models are in a constant race on both price and quality, but right now they're so cheap that paying for the best makes sense for any "creative" task (like writing software, even if only to reduce the number of bugs the human code reviewer needs to fix), while price sensitivity only matters for the grunt work classification tasks (such as "based on comments, what is the public response to this policy?")


Or the number was never real to begin with.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: