Hacker News new | past | comments | ask | show | jobs | submit login
O3-mini System Card [pdf] (cdn.openai.com)
56 points by synthwave 83 days ago | hide | past | favorite | 24 comments



I'm still reading the details, but my first thought is that I like that the competition is actually working in this situation. I hope that someday it will be more open from all actors. And that we don't make more polarized that it is now and don't focus on the geopolitical angle and make it the core issue. I know that this hope is far fetched and more ideal than most people would think. But as someone who really find these development interesting and would like LLMs to be more useful (and don't think too much about AGI as it is the fusion project of the Artificial Intelligence field).


This is far too soon after R1 to be a reaction. They were training this model before R1. If they stopped censoring the reasoning steps or (Yud forbid) open sourced it, that would be competition really working. But they won't.


Page 31 is interesting, where apparently in the task of creating PRs for an internal repository the o3-mini models by far have the lowest performance (even worse than gpt-4o). What is up with that?


That also apply to the Multilingual tests they do. I wonder if the overall gain against Base GPT-4o is there. What even strange is that they spent about three pages talking about how hard they worked on making sure that the model doesn't answer questions about nuclear weapons or anything that seems unsafe in this regard. Which is funny because they even said that they do this although they didn't train on classified information and the knowledge it contains is from unclassified information.

Nuclear development is state actors game. If they want to do it they wouldn't need LLM to answer the questions. And most of the work is actually building the program and acquiring materials ..etc. And do all of these development while not make themselves detected by the world (which is impossible task).

But they spent less time and explanation on more important parts like coding performance.


It’s boomer coded language. “We’re stoppin’ these here thinkin’ machines from making nukes! Is ____ doing that?”


Also worse than o1-mini on agentic tasks (page 29), big drop from 39% -> 27%


Yeah, the more pages I read, the more disappointed I became. Here is the reason they cite for the low performance (which is even more worrying):

"The model often attempts to use a hallucinated bash tool rather than python despite constant, multi-shot prompting and feedback that this format is incorrect. This resulted in long conversations that likely hurt its performance."


Good to know openai knows the frustration of trying to argue with their RL based models as well.


aider found that with R1, the best performance was to use R1 to think through the solution, and use claude to implement the solution. I suspect that, in the near term, we'll need combinations of reasoning models and instruction-following coding models for excellent code output.

My experience is that most of the models focused on reasoning improvements has been that they tend to be a bit worse at following specific instructions. It is also notable that a lot of 3rd party fine-tunes of Llamas and others gain in knowledge based benchmarks while reducing instruction following scores.

I wonder why that seems to be some sort of continuum?


Kind of like an ai “thinking fast and thinking slow”.


Sort of? I don't see why thinking slow should inhibit the ability to follow instructions.


I think they're referencing "Thinking, Fast and Slow" - https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow

"The book's main thesis is a differentiation between two modes of thought: "System 1" is fast, instinctive and emotional; "System 2" is slower, more deliberative, and more logical. "


Yes, I understand the reference. I don't understand their argument that this is a good example of that common mental model for LLMs.

In this case "fast, instinctive, and emotional" models are better at instruction following than "slower, more deliberative, and more logical" models.


Buried, but on Page 24 they reveal to me the most surprising massive capability leap - that o3-mini is way better at conning gpt-4o for money (79% win rate for o3-mini vs 27% for full o1!). It isn't surprising to me that "reasoning" can lead to improvements in modeling another LLM, but definitely makes me wary for future persuasive abilities on humans as well.


Interesting that they are doubling down on censorship ("safety and robustness"), given that a major advantage of DeepSeek is its lack of refusals in deployed variants and open weights (can't patch-in more censorship in weights after the fact).


It's amazing how much they talk about anti-jailbreaking measures; I can't think of any other class of product that actively tries to stop users doing what they want to do.


Very roughly, looks like o3-mini is about as good as o1 but cheaper and faster (close to 50/50 in heads up human preferences). The main goal is probably to address the ballooning inference costs that OpenAI is racking up.


More discussion on official post: https://news.ycombinator.com/item?id=42890627


Thank you! So good!


Looking forward to its release. But hopefully it will show its "thinking" process.


For anyone interested it’s available in the app, at least for me on iOS.


Thanks!


I stopped reading when the table went over the margin.


I look forward to people applying the same standards to the OpenAI's O3 as they did Deepseek's R1 release and paper in the discussions last week.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: