Qwen2-Math

vessenes · 2024-08-08T16:23:11 1723134191

Sample solution for Balkan MO 2023 seems .. questionable?

The problem involves players removing stones sequentially and asking which will win with perfect play: the listed answer definitely doesn’t list all possible types of strategies.

The answer it gives may be right; in fact I bet it is correct (the second player), but does the qwen team offer the solution as correct including the logic? And is the solution logic correct?

sterlind · 2024-08-08T17:34:29 1723138469

Using human language models for math problems is a bad idea for exactly this reason: judging correctness of proofs in English is subjective, whereas proofs in Lean are objective and easy to machine-check. Deepmind's AlphaProof was so remarkable precisely because it spoke Lean, so its proofs were verifiable. Why waste time on anything else?

vessenes · 2024-08-08T21:08:51 1723151331

The real answer in this case is that AlphaProof is a massively more complex training pipeline than the Qwen team’s pipeline. Sadly. Agreed that integrating Lean makes a ton of sense; it’s just probably an order of magnitude easier to do it inside DeepMind than elsewhere right now.

margorczynski · 2024-08-08T18:15:31 1723140931

I would think of two:

1) Much easier to state the problem and basically all the knowledge we have of math is not in the form of Lean proofs

2) It can be applied to a much broader range of domains, math is kinda unique as in most cases verifying something is 100% correct is impossible (and getting such a signal for RL)

sterlind · 2024-08-08T18:56:29 1723143389

1) You could train a different language model to translate between Lean and English. Use AlphaProof to do the hard work of theorem proving; use the translation model for interpretability and as a user interface. In fact, such a system could be ideal for formalizing a hundred years of English-language proofs.

2) This model is already specialized for math; applying it to other domains is out of scope. And as you point out, speaking Lean (and thus being verifiable) gives you an RL reward signal that's way more precise and readily available than piddly RLHF from human reviewers. Gyms like this are where AGI will happen.

(P.S. if anyone wants to work with me on reproducing AlphaProof, hit me up on Discord.)

hackpert · 2024-08-09T06:09:48 1723183788

Hi! I've been working on theorem proving systems for some time now. I would love to help out with an AlphaProof reproduction, but I can't reach you on discord for some reason!

sterlind · 2024-08-09T13:57:20 1723211840

ack! try again, I forgot to update my account name since Discord got rid of # tags. I also put my email as a fallback.

dimask · 2024-08-08T18:52:16 1723143136

1) Then more math should get formalised in lean.

2) How is a solution by LLMs supposed to be verified without such a formalisation?

pathsjs · 2024-08-08T16:28:51 1723134531

I was going to write the same thing. I checked the first three problems and all solutions are partial at best. Now, don't get me wrong, this is still impressive. But putting the problems there with the implication that qwen solves them correctly when it doesn't does not really inspire trust

tveita · 2024-08-08T16:54:27 1723136067

Same with the Lusophon Mathematical Olympiad 2023 example.

It makes a bunch of specious claims about parity. (Adding 4 to a number changes the parity? Therefore, the parity of each colour always changes for each turn? Therefore, the parity will always be the same as it was initially?) And then it concludes that since the parity is right, 2022 of each colour must be a reachable state.

Which, as you say, is quite possible the correct answer, but it's really weird to put it out there as an example with no comment on the reasoning.

meroes · 2024-08-08T20:15:30 1723148130

This is my experience doing RLHF for math for LLMs. So many times it’s a partial answer. It may find the minimal case for example, but it doesn’t say why it’s the minimal one, even when the prompt specially asks for the why.

ivancho · 2024-08-09T13:10:30 1723209030

It's complete nonsense. "The total number of moves in the game is equal to the number of stones initially in the pile, which is 5000."

Similarly on the Martian question "If we transform 1 red and 1 green Martian, we get 4 blue Martians. This changes the parity of red and green Martians from even to odd, and the parity of blue Martians from even to odd." - is complete nonsense too.

"the sum of the digits of a number k modulo 2 is equivalent to k mod 2" - 12?

Basically all "solutions" are regurgitated techniques from math competitions which are used completely incorrectly but with a lot of confidence

rahimnathwani · 2024-08-08T16:19:53 1723133993

FYI the model is almost 150GB: https://huggingface.co/Qwen/Qwen2-Math-72B-Instruct/tree/mai...

moffkalast · 2024-08-08T17:21:27 1723137687

Nobody runs these things at full precision, a 4 bit quant of 70B range models fits into 64 GB. Though you would need more for long context.

extheat · 2024-08-08T18:09:13 1723140553

A simple equation to approximate it is `memory_in_gb = parameters_in_billions * (bits/8)`

So at 32 bit full precision, 70 * (32 / 8) ~= 280GB

fp16, 70 * (16 / 8) ~= 140GB

8 bit, 70 * (8 / 8) ~= 70GB

4 bit, 70 * (4 / 8) ~= 35GB

However in things like llama.cpp quants sometimes it's mixed so some of the weights are Q5, some Q4, etc, so you usually want to take the higher number.

moffkalast · 2024-08-08T18:39:49 1723142389

Well that and you also need a fair bit more space for the KV cache which can be a bit unpredictable. Models without GQA, flash attention or 4 bit cache support are really terrible in that regard, plus it depends on context length. Haven't found a good rule of thumb for that yet.

lostmsu · 2024-08-08T22:31:00 1723156260

How much ELO on chatbot arena does 4 bit lose vs full fp16/bf16?

moffkalast · 2024-08-09T09:19:30 1723195170

I think this is the only known benchmark that actually compares different quants: https://oobabooga.github.io/benchmark.html

It's not really all that consistent, but larger models can be compressed more without as much loss.

tempfile · 2024-08-08T16:33:53 1723134833

First solution (IMO 2002) is completely wrong. It shows that 1,2,3 cubes are not sufficient, and provide an obstacle that doesn't rule out 4 cubes, but does not prove that there actually are 4 cubes that sum to the given number. This is much harder (and I don't know the true answer)

kevinventullo · 2024-08-08T21:15:07 1723151707

I think it’s about the same level of difficulty as showing you need at least four, but yeah I agree that what is written is insufficient. One solution is that you can write

2002 = 10^3 + 10^3 + 1^3 + 1^3

Then, multiply through by 2002^2001, which is itself a cube since 2001 is divisible by 3.

tempfile · 2024-08-12T08:46:04 1723452364

Actually, I concede. I just didn't spot that.

ipnon · 2024-08-08T17:47:54 1723139274

These solutions aren't perfect, but imagine how many more people can become mathematicians now that the price of an elite IMO medal winning tutor can be quantified as Hugging Face hosting costs!

Strix97 · 2024-08-08T18:04:46 1723140286

I don't think that's this is useful tool to become a mathematician. Becoming one does not necessarily involve solving these kinds of puzzle exercises.

It's a useful to hone your creative thinking and learning how to approach a mathematical problem, but it wont make you a mathematician.

meroes · 2024-08-08T20:18:43 1723148323

A mathematician tutor that doesn’t know how to write proofs though. Or at the most charitable, not how human mathematicians write proofs. I’m not talking about using Lean or similar, but the common parlance and rigor outside of theorem provers.

ziofill · 2024-08-08T17:00:43 1723136443

I see that they do some decontamination of the datasets, in the hope that the models won't just recite answers from the training data. But in the recent interview with Subbarao Kambhampati on MLST (https://www.youtube.com/watch?v=y1WnHpedi2A) they explain that models fail as soon as one slightly rephrases the test problems (indicating that they are indeed mostly reciting). I expect this to be the case with this model too.

refulgentis · 2024-08-08T17:44:13 1723139053

There's a lot of bunko 3rd-time-rephrased "wisdom" about AI, and enough interest in it that if you have the right title, you can get away with repeating it.

I'm a bit surprised that an HN reader, who presumably has first hand experience with them, isn't sure if its just a hashmap lookup.

eightysixfour · 2024-08-08T17:03:36 1723136616

This is just not true. There are plenty of private tests with problems that are not in the training set. GPQA is an excellent example.

beyondCritics · 2024-08-08T17:09:46 1723136986

It is obious that all of these problems are still way too hard, although sometimes it has ideas. It flawlessly demonstrates how to simplify (2002^2002) mod 9. I recall that there was once a scandalous university exam for future math teachers in germany, which asked to do tasks like that, but all failed the test. With Qwen-2 at hand this might not have happened.

next_xibalba · 2024-08-08T17:40:54 1723138854

Kind of surprised this was released in English first given it was produced by a Chinese group (Alibaba Cloud). I wonder why that is.

m3kw9 · 2024-08-08T17:58:27 1723139907

Maybe Training material also mostly English

karmasimida · 2024-08-08T17:55:56 1723139756

I mean I think 90% of AI researchers in China, use English as the first choice for publication ... so not surprised?

qrian · 2024-08-08T19:41:38 1723146098

The solution for IMO 2022 is barely a 1/7 solution. It just says ‘ might not satisfy the inequality for all y’ without a proof. That was the point of the question.

azinman2 · 2024-08-08T18:13:42 1723140822

> This model mainly supports English. We will release bilingual (English and Chinese) math models soon

The irony.

allanren · 2024-08-09T08:42:59 1723192979

Qwen2 has been quite good, but still can't compare 9.9 and 9.11

Leary · 2024-08-08T15:58:59 1723132739

Where could I try this?

tatsuya4 · 2024-08-09T04:25:47 1723177547

ModelBox have a playground for Qwen Math

7B: https://model.box/try/playground/qwen/qwen2-math-7b-instruct 1.5b: https://model.box/try/playground/qwen/qwen2-math-1.5b-instru...

tatsuya4 · 2024-08-09T05:50:46 1723182646

It appears that no inference provider currently supports the 72B version.

currycurry16 · 2024-08-08T16:12:31 1723133551

https://huggingface.co/Qwen/Qwen2-Math-72B-Instruct