Hacker News new | past | comments | ask | show | jobs | submit login
Qwen2-Math (qwenlm.github.io)
128 points by limoce 9 months ago | hide | past | favorite | 38 comments



Sample solution for Balkan MO 2023 seems .. questionable?

The problem involves players removing stones sequentially and asking which will win with perfect play: the listed answer definitely doesn’t list all possible types of strategies.

The answer it gives may be right; in fact I bet it is correct (the second player), but does the qwen team offer the solution as correct including the logic? And is the solution logic correct?


Using human language models for math problems is a bad idea for exactly this reason: judging correctness of proofs in English is subjective, whereas proofs in Lean are objective and easy to machine-check. Deepmind's AlphaProof was so remarkable precisely because it spoke Lean, so its proofs were verifiable. Why waste time on anything else?


The real answer in this case is that AlphaProof is a massively more complex training pipeline than the Qwen team’s pipeline. Sadly. Agreed that integrating Lean makes a ton of sense; it’s just probably an order of magnitude easier to do it inside DeepMind than elsewhere right now.


I would think of two:

1) Much easier to state the problem and basically all the knowledge we have of math is not in the form of Lean proofs

2) It can be applied to a much broader range of domains, math is kinda unique as in most cases verifying something is 100% correct is impossible (and getting such a signal for RL)


1) You could train a different language model to translate between Lean and English. Use AlphaProof to do the hard work of theorem proving; use the translation model for interpretability and as a user interface. In fact, such a system could be ideal for formalizing a hundred years of English-language proofs.

2) This model is already specialized for math; applying it to other domains is out of scope. And as you point out, speaking Lean (and thus being verifiable) gives you an RL reward signal that's way more precise and readily available than piddly RLHF from human reviewers. Gyms like this are where AGI will happen.

(P.S. if anyone wants to work with me on reproducing AlphaProof, hit me up on Discord.)


Hi! I've been working on theorem proving systems for some time now. I would love to help out with an AlphaProof reproduction, but I can't reach you on discord for some reason!


ack! try again, I forgot to update my account name since Discord got rid of # tags. I also put my email as a fallback.


1) Then more math should get formalised in lean.

2) How is a solution by LLMs supposed to be verified without such a formalisation?


I was going to write the same thing. I checked the first three problems and all solutions are partial at best. Now, don't get me wrong, this is still impressive. But putting the problems there with the implication that qwen solves them correctly when it doesn't does not really inspire trust


Same with the Lusophon Mathematical Olympiad 2023 example.

It makes a bunch of specious claims about parity. (Adding 4 to a number changes the parity? Therefore, the parity of each colour always changes for each turn? Therefore, the parity will always be the same as it was initially?) And then it concludes that since the parity is right, 2022 of each colour must be a reachable state.

Which, as you say, is quite possible the correct answer, but it's really weird to put it out there as an example with no comment on the reasoning.


This is my experience doing RLHF for math for LLMs. So many times it’s a partial answer. It may find the minimal case for example, but it doesn’t say why it’s the minimal one, even when the prompt specially asks for the why.


It's complete nonsense. "The total number of moves in the game is equal to the number of stones initially in the pile, which is 5000."

Similarly on the Martian question "If we transform 1 red and 1 green Martian, we get 4 blue Martians. This changes the parity of red and green Martians from even to odd, and the parity of blue Martians from even to odd." - is complete nonsense too.

"the sum of the digits of a number k modulo 2 is equivalent to k mod 2" - 12?

Basically all "solutions" are regurgitated techniques from math competitions which are used completely incorrectly but with a lot of confidence



Nobody runs these things at full precision, a 4 bit quant of 70B range models fits into 64 GB. Though you would need more for long context.


A simple equation to approximate it is `memory_in_gb = parameters_in_billions * (bits/8)`

So at 32 bit full precision, 70 * (32 / 8) ~= 280GB

fp16, 70 * (16 / 8) ~= 140GB

8 bit, 70 * (8 / 8) ~= 70GB

4 bit, 70 * (4 / 8) ~= 35GB

However in things like llama.cpp quants sometimes it's mixed so some of the weights are Q5, some Q4, etc, so you usually want to take the higher number.


Well that and you also need a fair bit more space for the KV cache which can be a bit unpredictable. Models without GQA, flash attention or 4 bit cache support are really terrible in that regard, plus it depends on context length. Haven't found a good rule of thumb for that yet.


How much ELO on chatbot arena does 4 bit lose vs full fp16/bf16?


I think this is the only known benchmark that actually compares different quants: https://oobabooga.github.io/benchmark.html

It's not really all that consistent, but larger models can be compressed more without as much loss.


First solution (IMO 2002) is completely wrong. It shows that 1,2,3 cubes are not sufficient, and provide an obstacle that doesn't rule out 4 cubes, but does not prove that there actually are 4 cubes that sum to the given number. This is much harder (and I don't know the true answer)


I think it’s about the same level of difficulty as showing you need at least four, but yeah I agree that what is written is insufficient. One solution is that you can write

2002 = 10^3 + 10^3 + 1^3 + 1^3

Then, multiply through by 2002^2001, which is itself a cube since 2001 is divisible by 3.


Actually, I concede. I just didn't spot that.


These solutions aren't perfect, but imagine how many more people can become mathematicians now that the price of an elite IMO medal winning tutor can be quantified as Hugging Face hosting costs!


I don't think that's this is useful tool to become a mathematician. Becoming one does not necessarily involve solving these kinds of puzzle exercises.

It's a useful to hone your creative thinking and learning how to approach a mathematical problem, but it wont make you a mathematician.


A mathematician tutor that doesn’t know how to write proofs though. Or at the most charitable, not how human mathematicians write proofs. I’m not talking about using Lean or similar, but the common parlance and rigor outside of theorem provers.


I see that they do some decontamination of the datasets, in the hope that the models won't just recite answers from the training data. But in the recent interview with Subbarao Kambhampati on MLST (https://www.youtube.com/watch?v=y1WnHpedi2A) they explain that models fail as soon as one slightly rephrases the test problems (indicating that they are indeed mostly reciting). I expect this to be the case with this model too.


There's a lot of bunko 3rd-time-rephrased "wisdom" about AI, and enough interest in it that if you have the right title, you can get away with repeating it.

I'm a bit surprised that an HN reader, who presumably has first hand experience with them, isn't sure if its just a hashmap lookup.


This is just not true. There are plenty of private tests with problems that are not in the training set. GPQA is an excellent example.


It is obious that all of these problems are still way too hard, although sometimes it has ideas. It flawlessly demonstrates how to simplify (2002^2002) mod 9. I recall that there was once a scandalous university exam for future math teachers in germany, which asked to do tasks like that, but all failed the test. With Qwen-2 at hand this might not have happened.


Kind of surprised this was released in English first given it was produced by a Chinese group (Alibaba Cloud). I wonder why that is.


Maybe Training material also mostly English


I mean I think 90% of AI researchers in China, use English as the first choice for publication ... so not surprised?


The solution for IMO 2022 is barely a 1/7 solution. It just says ‘ might not satisfy the inequality for all y’ without a proof. That was the point of the question.


> This model mainly supports English. We will release bilingual (English and Chinese) math models soon

The irony.


Qwen2 has been quite good, but still can't compare 9.9 and 9.11


Where could I try this?



It appears that no inference provider currently supports the 72B version.





Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: