The benchmark numbers don't really mean anything -- Google says that Gemini 2.5 Pro has an AIME score of 86.7 which beats o3-mini's score of 86.5, but OpenAI's announcement post [1] said that o3-mini-high has a score of 87.3 which Gemini 2.5 would lose to. The chart says "All numbers are sourced from providers' self-reported numbers" but the only mention of o3-mini having a score of 86.5 I could find was from this other source [2]
[1] https://openai.com/index/openai-o3-mini/ [2] https://www.vals.ai/benchmarks/aime-2025-03-24
You just have to use the models yourself and see. In my experience o3-mini is much worse than o1.