I'm still reading the details, but my first thought is that I like that the competition is actually working in this situation. I hope that someday it will be more open from all actors. And that we don't make more polarized that it is now and don't focus on the geopolitical angle and make it the core issue. I know that this hope is far fetched and more ideal than most people would think. But as someone who really find these development interesting and would like LLMs to be more useful (and don't think too much about AGI as it is the fusion project of the Artificial Intelligence field).
This is far too soon after R1 to be a reaction. They were training this model before R1. If they stopped censoring the reasoning steps or (Yud forbid) open sourced it, that would be competition really working. But they won't.
Page 31 is interesting, where apparently in the task of creating PRs for an internal repository the o3-mini models by far have the lowest performance (even worse than gpt-4o). What is up with that?
That also apply to the Multilingual tests they do. I wonder if the overall gain against Base GPT-4o is there. What even strange is that they spent about three pages talking about how hard they worked on making sure that the model doesn't answer questions about nuclear weapons or anything that seems unsafe in this regard. Which is funny because they even said that they do this although they didn't train on classified information and the knowledge it contains is from unclassified information.
Nuclear development is state actors game. If they want to do it they wouldn't need LLM to answer the questions. And most of the work is actually building the program and acquiring materials ..etc. And do all of these development while not make themselves detected by the world (which is impossible task).
But they spent less time and explanation on more important parts like coding performance.
Yeah, the more pages I read, the more disappointed I became. Here is the reason they cite for the low performance (which is even more worrying):
"The model often attempts to use a hallucinated bash tool rather than python despite constant, multi-shot prompting and feedback that this format is incorrect. This resulted in long conversations that likely hurt its performance."
aider found that with R1, the best performance was to use R1 to think through the solution, and use claude to implement the solution. I suspect that, in the near term, we'll need combinations of reasoning models and instruction-following coding models for excellent code output.
My experience is that most of the models focused on reasoning improvements has been that they tend to be a bit worse at following specific instructions. It is also notable that a lot of 3rd party fine-tunes of Llamas and others gain in knowledge based benchmarks while reducing instruction following scores.
I wonder why that seems to be some sort of continuum?
"The book's main thesis is a differentiation between two modes of thought: "System 1" is fast, instinctive and emotional; "System 2" is slower, more deliberative, and more logical. "
Buried, but on Page 24 they reveal to me the most surprising massive capability leap - that o3-mini is way better at conning gpt-4o for money (79% win rate for o3-mini vs 27% for full o1!). It isn't surprising to me that "reasoning" can lead to improvements in modeling another LLM, but definitely makes me wary for future persuasive abilities on humans as well.
Interesting that they are doubling down on censorship ("safety and robustness"), given that a major advantage of DeepSeek is its lack of refusals in deployed variants and open weights (can't patch-in more censorship in weights after the fact).
It's amazing how much they talk about anti-jailbreaking measures; I can't think of any other class of product that actively tries to stop users doing what they want to do.
Very roughly, looks like o3-mini is about as good as o1 but cheaper and faster (close to 50/50 in heads up human preferences). The main goal is probably to address the ballooning inference costs that OpenAI is racking up.