O3-mini System Card [pdf]

elashri · 2025-01-31T18:27:28 1738348048

I'm still reading the details, but my first thought is that I like that the competition is actually working in this situation. I hope that someday it will be more open from all actors. And that we don't make more polarized that it is now and don't focus on the geopolitical angle and make it the core issue. I know that this hope is far fetched and more ideal than most people would think. But as someone who really find these development interesting and would like LLMs to be more useful (and don't think too much about AGI as it is the fusion project of the Artificial Intelligence field).

modeless · 2025-01-31T18:46:52 1738349212

This is far too soon after R1 to be a reaction. They were training this model before R1. If they stopped censoring the reasoning steps or (Yud forbid) open sourced it, that would be competition really working. But they won't.

hexomancer · 2025-01-31T18:36:15 1738348575

Page 31 is interesting, where apparently in the task of creating PRs for an internal repository the o3-mini models by far have the lowest performance (even worse than gpt-4o). What is up with that?

elashri · 2025-01-31T18:47:25 1738349245

That also apply to the Multilingual tests they do. I wonder if the overall gain against Base GPT-4o is there. What even strange is that they spent about three pages talking about how hard they worked on making sure that the model doesn't answer questions about nuclear weapons or anything that seems unsafe in this regard. Which is funny because they even said that they do this although they didn't train on classified information and the knowledge it contains is from unclassified information.

Nuclear development is state actors game. If they want to do it they wouldn't need LLM to answer the questions. And most of the work is actually building the program and acquiring materials ..etc. And do all of these development while not make themselves detected by the world (which is impossible task).

But they spent less time and explanation on more important parts like coding performance.

bn-l · 2025-01-31T18:56:37 1738349797

It’s boomer coded language. “We’re stoppin’ these here thinkin’ machines from making nukes! Is ____ doing that?”

GavCo · 2025-01-31T18:51:33 1738349493

Also worse than o1-mini on agentic tasks (page 29), big drop from 39% -> 27%

bhu8 · 2025-01-31T18:40:14 1738348814

Yeah, the more pages I read, the more disappointed I became. Here is the reason they cite for the low performance (which is even more worrying):

"The model often attempts to use a hallucinated bash tool rather than python despite constant, multi-shot prompting and feedback that this format is incorrect. This resulted in long conversations that likely hurt its performance."

tippytippytango · 2025-01-31T18:48:30 1738349310

Good to know openai knows the frustration of trying to argue with their RL based models as well.

eightysixfour · 2025-01-31T18:47:13 1738349233

aider found that with R1, the best performance was to use R1 to think through the solution, and use claude to implement the solution. I suspect that, in the near term, we'll need combinations of reasoning models and instruction-following coding models for excellent code output.

My experience is that most of the models focused on reasoning improvements has been that they tend to be a bit worse at following specific instructions. It is also notable that a lot of 3rd party fine-tunes of Llamas and others gain in knowledge based benchmarks while reducing instruction following scores.

I wonder why that seems to be some sort of continuum?

bn-l · 2025-01-31T18:58:50 1738349930

Kind of like an ai “thinking fast and thinking slow”.

eightysixfour · 2025-01-31T19:05:29 1738350329

Sort of? I don't see why thinking slow should inhibit the ability to follow instructions.

Arcuru · 2025-01-31T19:36:07 1738352167

I think they're referencing "Thinking, Fast and Slow" - https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow

"The book's main thesis is a differentiation between two modes of thought: "System 1" is fast, instinctive and emotional; "System 2" is slower, more deliberative, and more logical. "

eightysixfour · 2025-01-31T20:40:23 1738356023

Yes, I understand the reference. I don't understand their argument that this is a good example of that common mental model for LLMs.

In this case "fast, instinctive, and emotional" models are better at instruction following than "slower, more deliberative, and more logical" models.

zan2434 · 2025-01-31T19:11:58 1738350718

Buried, but on Page 24 they reveal to me the most surprising massive capability leap - that o3-mini is way better at conning gpt-4o for money (79% win rate for o3-mini vs 27% for full o1!). It isn't surprising to me that "reasoning" can lead to improvements in modeling another LLM, but definitely makes me wary for future persuasive abilities on humans as well.

madars · 2025-01-31T19:14:29 1738350869

Interesting that they are doubling down on censorship ("safety and robustness"), given that a major advantage of DeepSeek is its lack of refusals in deployed variants and open weights (can't patch-in more censorship in weights after the fact).

logicchains · 2025-01-31T19:24:57 1738351497

It's amazing how much they talk about anti-jailbreaking measures; I can't think of any other class of product that actively tries to stop users doing what they want to do.

highfrequency · 2025-01-31T19:06:44 1738350404

Very roughly, looks like o3-mini is about as good as o1 but cheaper and faster (close to 50/50 in heads up human preferences). The main goal is probably to address the ballooning inference costs that OpenAI is racking up.

ChrisArchitect · 2025-01-31T19:28:25 1738351705

More discussion on official post: https://news.ycombinator.com/item?id=42890627

okay_yes · 2025-01-31T20:06:19 1738353979

Thank you! So good!

osti · 2025-01-31T18:42:58 1738348978

Looking forward to its release. But hopefully it will show its "thinking" process.

mrcwinn · 2025-01-31T19:07:10 1738350430

For anyone interested it’s available in the app, at least for me on iOS.

brianjking · 2025-01-31T18:18:48 1738347528

Thanks!

smittywerben · 2025-01-31T19:51:19 1738353079

I stopped reading when the table went over the margin.

byefruit · 2025-01-31T18:27:28 1738348048

I look forward to people applying the same standards to the OpenAI's O3 as they did Deepseek's R1 release and paper in the discussions last week.