Not to pour too much cold water on this, but the claim of 100% accuracy has a huge caveat. In the paper (Page 4) they state:
Interaction. The original question may not be a prompt that synthesizes a program whose execution results in the correct answer. In addition, the answer may require multiple steps with clear plots or other modalities. We therefore may interactively prompt Codex until reaching the correct answer or visualizations, making the minimum necessary changes from the original question
Which to me basically sounds like they had a human in the loop (that knows how to solve these math problems) that kept changing the question until it gave the correct answer. They do measure the distance (using a sentence embedding model) of the original question to the one that yielded the correct answer, but that feels a bit contrived to me.
Nevertheless, its still really cool that the correct answer is indeed inside the model.
Proving Douglas Adams correct. The question is harder than the answer.
This makes the "at scale" claim in the abstract clearly false IMO. Any AI system that requires that much human intervention is not scalable. When they have a second AI to produce the prompts automatically from the original questions, then they can claim to have achieved scalability.
But even without that, a system like this can still certainly be useful. And I expect rapid progress in the next few years.
although, the correct answer is also likely on the web. With a suitable search query you would see the correct paper/textbook/wiki page with the right answer. A text highlighting model could also likely extract this answer from the text. The training probably achieves a good degree of memorization for these known results.
This begs the question, would we be impressed with a similar compression algorithm for storing past web documents?
Well the trivial test to make sure it’s not memorized would be to change constants in the input that alter the correct answer but don’t make the problem any more difficult if it is actually doing the calculation.
But the correct answer isn't inside the model at all, in none of their examples. The correct answer is inside SymPy or NumPy, at least 99% of the time. That is, the model doesn't respond with a demonstration or with the answer itself: it responds with a Python program that poses the given question to SymPy or NumPy, and then they run that program and report the answer.
Here is a basic example:
MIT Course question: Solve each equation for x. ln(x2 − 1) = 3
Model input: Using Sympy, solve Eq ln(x*2-1)=3 for x.
Model output:
from sympy import *
x = symbols(’x’)
solve(log(x**2-1) - 3, x)
As you can see, the model has simply translated a mechanized form of the original question to equivalent Python code. The model has 0 idea of how to solve an equation: it's using a symbolic equation solver.
In other cases, they even "tidy" the original question to a representation of the solution. Here is their example E in Figure 2:
Original question:
> Outside of their humdrum duties as 6.042 TAs, Sayan is trying to learn to levitate using only intense concentration and Jelani is launching a “Nelson 2008” presidential campaign. Suppose that Sayan’s probability of levitating is 1/6, Jelani’s chance of becoming president is 1/4, and the success of one does
not alter the other’s chances. If at most one of them succeeds, what is the probability that Jelani becomes the president of the United States?
"Tidyed question":
> Suppose that Sayan’s probability Sayan succeeding is 1/6, and Jelani’s probability of succeeding is 1/4.
> Use numpy to find the probability at most one of Jelani and Sayan succeed.
> Use numpy to find the probability Jelani succeeds but Sayan does not succeed.
> Divide the former by the latter probability.
Model solution:
import numpy as np
# Probability of Sayan succeeding
p_sayan = 1/6
# Probability of Jelani succeeding
p_jelani = 1/4
# Probability of at most one of Jelani and Sayan succeeding
p_at_most_one = 1 - (p_sayan * p_jelani)
# Probability of Jelani succeeding but Sayan not succeeding
p_jelani_succeed_sayan_not = p_jelani * (1 - p_sayan)
# Probability of Jelani succeeding but Sayan not succeeding divided by
probability of at most one of Jelani and Sayan succeeding
p_jelani_succeed_sayan_not / p_at_most_one
# Probability of at least one of Jelani and Sayan succeeding
p_at_least_one = 1 - p_at_most_one
Tidying up the extra verbiage of the question is absolutely fair. But then, they also explain exactly how to compute the result using the data in the question; the model then generates code that perfectly matches the described algorithm, it's again not using even the tiniest bit of mathematical understanding.
I have browsed their examples, and I have not seen even a single one where the model does more than rephrase the question into a 1:1 Python representation of the question itself.
None of the answers would pass even the simplest undergrad exam. They are literally of the form "how would you solve equation E?" "I would write a program that says sympy.solve(E)".
Well, they do say very clearly that they "solve" problems by program synthesis
and what they describe is perfectly legit program synthesis.
To clarify, program synthesis (or automatic programming) is the task of
generating programs from specifications. There are two kinds of program
synthesis: deductive program synthesis, from a complete specification of the
target program; and inductive program synthesis, or program induction, from an
incomplete specification (such as sets of program inputs and outputs, or
traces). An example of deductive program synthesis is the generation of
low-level code from a high-level language by a compiler.
What the paper describes is a kind of deductive program synthesis from a
complete specification in natural lanaguage. I suspect the true contribution of
the work is the demonstration of using natural language as a complete
specification, where earlier work generally only demonstrated the use of natural
language as incomplete specification (for example, comments describing intent
rather than implementation) and the combination of natural language with code;
as in the original Codex work [Edit: actually, now that I look again, the codex
paper also has examples of comments that fully specify the target program, e.g.
in Figure 2: https://arxiv.org/abs/2107.03374; so the work above is typically
incremental].
On the other hand it's clear to me that the training has made the model memorise
answers and all the work in prompt engineering, described under "Workflow"
serves to find the right prompts to retrieve the desired memorisations, much
like one must fire just the right SQL query to get back the right data.
Certainly interesting to see in action and useful for everyday work, but far
from "solving" anything in the gradniose way that it is announced by the authors
(e.g. "These astounding results..." in section "Conclusion", etc).
I was hoping their breakthrough was that they had found a general way to parse conceptual problems into the language of math and logic. That is the truly hard part, and what people spend alot of time learning to do. Software like octave and mathematica can already evaluate tons of things once parsed.
Looking further, it is even worse than that. For example, this question:
> Calculate the probability of getting a two-pair poker hand.
Was rewritten to, this, close to code translated to human text:
> A hand is a set of 5 cards that are drawn randomly from a standard 52 card deck with 13 ranks of 4 cards each. A two-pair poker hand is a hand that contains 3 unique ranks, where no more than 2 cards in the hand can share the same rank. That is, 3 or more cards cannot share the same rank. Write a program that generates simulations for calculating the average probability of getting a twopair poker hand.
They even changed the question to allow simulations and not just exact answers. So then the AI is "correct" here when it generates 1000 random samples of taking 5 elements from a set of 52 elements and seeing how many are two pairs.
Not sure what the point of this paper is then. It doesn't do anything novel with the mathematics as it either just uses libraries or it doesn't really solve the problem, nor does it to anything novel with text recognition.
Yeah, this isn't an AI that does maths, it's an AI that takes a natural language description of an algorithm and converts it into sympy. Which is still impressive but not very surprising given things like github copilot.
It's just as well. If an artificial neural network was actually doing undergraduate mathematics I would have dropped out of my mathematics PhD yesterday.
> Now the challenge is to mimic the human creating the "right" question with high fidelity so the whole ML system returns appropriate search results.
This was always the hard part. If you can generate the "right" question then you can generate the code, and since there are already many good math libraries for solving undergrad problems that is all you have to do.
Do you accept that the questions are qualitatively different?
I think the first phrasing is asking you to directly use the definition:
f'(x) = \lim_{h\to 0}{{f(x+h) - f(x)} \over h}
And the second phrasing is asking for the application of some more mechanical rules (e.g. the derivative of a constant/power/scaling or the product rule) or for sympy to apply them.
FWIW, this is also not what I would expect ‘University-level mathematics’ to mean. The second meaning fits into high school, at least where I grew up. I was expecting proofs to be honest.
No, because literally all their AI does is translate a sufficiently structured question into Python. It's not showing a single ounce of mathematical knowledge, it doesn't even seem to (or need to) know that 1 + 1 = 2. Literally, if you asked it "what is 1 + 1", it would answer with "print(1 + 1)".
It's not doing anything more than that, in any of the examples shown in the paper.
Just from that sentence, you know the paper is suspect.
Whatever the merits of the algorithm -- there's no reason to be believe that what it produces (a set of random auto-generated questions) will be useful for actual education in any way.
There are more red flags -- see autogen question U9, for example:
If f(x) = (x^2 - x) / (x^2 + x), what is f(2)?
When was the last time your human teacher (at university level) handed you a rational polynomial that was not in reduced terms?
And my favorite, X7:
Find the smallest prime number that is divisible by 7 and 13.
The English 'and' doesn't have the same precedence as in software - it actually goes in the opposite direction. What it does in English is this:
Find the smallest prime number that (
(is divisible by 7) and (is divisible by 13)
)
The point I was trying to make was is: the English meaning of this sentence is unambiguous. And it's obvious the algorithm here in no way understands this meaning. It's just dissecting parse trees, and hitting the shuffle button.
Yeah, I can see how surprise formulations like these can be useful.
The thing is, it almost certainly wasn't ntended by the algorithm in this case. Nor does the it seem to have any understanding of what may be "off" about problem statement like these.
The model generated this as a question that the MIT undergrad Number Theory course could ask of its human students - helping develop the course is part of their goal. It's a laughably easy question at first glance, though.
I expected them to have trained a novel network to solve the problems. In fact, it seems they have used OpenAI Codex (a similar network to the one used by Github copilot) to generate the answer.
Their insight was to edit the question-answer pairs to include domain knowledge and solve the questions as python code.
Quite impressive results nevertheless! It makes wonder whether this kind of problem can be solved using a custom-made solution, or is training large networks that can generalize from vast quantity of information the only way. Because if it is, then all these solutions would be monopolized by the large players..
How does it remain impressive? Is this just deference to the institutions of the authors? Or the general aura around AI/ML/DL?
A couple of years ago when FB trained some kind of architecture to compute derivatives by presenting it with problem,answer pairs I called it out as "just wrote memorization" and got promptly downvoted. Now here we have people using GPT3 (a network widely recognized to be just memorizing) for a similar task and people are still impressed? Like you said, their only insight was to augment the data and translate to a computable (parsable) form.
I'm guessing people don't understand what a papermill is, especially at these schools that have access to the gobs of compute that you need to make these projects go through. It's junk science - probably not reproducible, definitely not extensible, doesn't transfer out of the sample set - purely for the sake of incrementing publication count for all involved (cf. the number of authors on the paper). And before people label me a hater: I speak from experience as someone at one of these types of schools that has their name on several of the same sorts of "turn the crank" papers.
They probably weren't judging the unique contribution here but rather the system as a whole. If you set out to solve this from scratch it would be very difficult. It's just recognizing what has been achieved not just by this group but the entire field which made this possible. This isn't an academic conference paper review session.
This is so charitable thank you. But please tell me are you this charitable when some derivative node package gets published? Or python library? Or when the 100th data science startup is announced?
Not just solving the problems without any mistakes but also generating the Python code that gives the solution; and it can generate new questions on the same material. That's pretty impressive.
But it's not: it's rephrasing the question in terms of Python code that asks SymPy or other libraries to answer the question; or uses very very basic formulas of its own (such as "probability of X not happening = 1 - P(X)"). That's not really impressive, especially since it only achieves this with very explicit prompting, and all intermediary steps covered in the prompts.
Not knocking the significance of this, but there are a lot of 'answers' that would get a straight up F for a university test:
> Original Question: Let rands denote the two real roots of x^2−x√(5 + 1) = 0. Then determine r^8+s^8.
> Codex Input: Calculate the roots of $xˆ2 - x \sqrt{5} + 1 = 0. Call the two roots r and s. Calculate z = rˆ8 + sˆ8. Evaluatez as an integer. Use sympy.
> Codex Output:
> from sympy import *
> x = Symbol(’x’)
> r = solve(x*2 - xsqrt(5) + 1, x)[0]
> s = solve(x*2 - xsqrt(5) + 1, x)[1]
> z = r*8 + s*8
> print(z.evalf())
> Solution: 47
If the solution did not have the derivation of the algebra, the solution is wrong in university math. Now solving these problems for application, this is quite interesting and powerful.
Demonstrating understanding of why we get to the solution IS mathematical reasoning (at this level). This paper demonstrates being able to leap across mathematical reasoning, but not the reasoning itself.
This isn't solving the problem at all, in my view. It's translating human speech into programmed code; it's sympy that actually solves the problem. As you said, this can be very useful, but knowing how to type the problem into wolfram alpha is not university level math. Problems that can be entered into sympy can be considered "solved", what's remaining (and what they have done) is UX.
Also, this problem arguably is a problem of computation, not math. Math is about finding proofs, which is a much harder class of problems (NP-complete at best, undecidable at worst).
> It's translating human speech into programmed code
That would be very valuable, but it doesn't even do that. First the human researcher translates the original human speech to very structured code like human speech, and then the AI translates the very structured human speech into code.
I wouldn't call this a maths question either. This is arithmetic.
I learned from someone much wiser than I that "Mathematics is the art of avoiding calculation."
This is a first year university math question (taken from an exam paper which I took):
A positron and an electron, each of rest mass m, annihilate, creating two photons each of energy pc. Show that in the frame S in which the electron was at rest, the angle between the directions of motion of the two photos is
2 sin^-1 (mc/p)^0.5
One of the photons then scatters off a second electron, also at rest in S, and subsequently has energy qc. Show that if the photos are now moving in the same direction, then q = p/3
I dislike this line of research. It just demonstrates large models are capable of memorizing a large number of things, without any understanding.
So I tried the first problem in Appendix A on OpenAI playground.
When I use the prompt "# Sketch the graph of the function f(x) = x + |x|", with the model davinci-codex and a larger response length (other parameters as default), the result seems fine: https://pastebin.com/VT8tPbu6
When I change the prompt to "# Sketch the graph of the function f(x) = x + |x| + x*2", it becomes garbage. It changes the prompt to "# Sketch the graph of the function f(x) = x + |x| + x^2 + x^3 + x^4 + x^5 + x^6 + x^7 + x^8 + x^9 + x^10" and then writes new empty comment lines: https://pastebin.com/2bNEuqaH
This is more like solving programming than doing math, it either outsources by writing a sympy program or generates a brute-force or monte carlo simulation based answer.
My biggest gripe with this paper is how unclear it is on its methods. How many rewrites per question? How did they select the solved questions? They state they used a model trained on language and fine-tuned on code, but fine-tuning could describe either their or codex's process.
The biggest hint in favor of fine-tuning is that AFAICT, their dataset adds up to 205 questions but they've solved 235 questions. In which case I'd suspect overfitting on question form. In intro level math, problems are usually in template form and solving them boils down to matching on the correct template and slot filling the answer.
To prove whether it's been fine-tuned, people with davinci codex access should try to see if the can replicate this.
To prove it's not overfit, authors should release their training dataset if there is one and allow people to test with it.
How many parameters does davinci codex have? The original codex was 12 biliion IIRC and certainly wasn't this capable.
---
Some of the answers look wrong or incomplete.
Table 221 seems to be missing a factorial, such leniency suggests they must not be automatically scoring this.
Not sure what's going on in Table 144. Table 44 too.
In 42 and 47, particularly 42, the solution program seems incomplete.
211 is impressive, it also writes code for LU decomposition, permutations, runge kutta, card combinatorics and other clever stuff.
Even though for most of the more challenging problems, hard part was pre-digested or worse, it was hand fed the answer's algorithm, there were a few where the network had to have had a solid understanding of the math and code requirements. A program like they claim would revolutionize and hugely simplify mathematically and algorithmically involved programming.
The biggest cause for worry that it might have overfit is that it works with no sampling at all and gets a perfect score (they claim).
They are using OpenAI Davinci-Codex, comparable to GPT3-175B, without any further fine-tuning as far as I can tell. Over-fitting, in the traditional sense, specifically to these questions is very unlikely.
What is problematic is at the bottom of page 6:
> Prompts that result in correct solutions, whether from original or modified prompts, are used for evaluation metrics.
In other words, they reject all results without correct solutions and claim perfect accuracy...
And as you noticed even their evaluation is really not good enough to establish perfect accuracy.
I would prefer if we could have more certainty than that. I have a list of questions.
1) Does no one else have access to this codex? Has anyone tried to replicate this? It should be easy.
2) If it was not fine-tuned, there is still the issue of an over-fit question set, as you say. How easy is it to break? How general is it?
3) Are there really so many extensive examples on the use of sympy? Some of my googling (I know the math but not sympy) did not support this but it could be my lack of familiarity of the library ecosystem.
4) And the biggest issue where when it gave answers to certain limit problems without any computation, which should not have been possible unless it computed it internally, complexity of which kinda contradicts the entire exercise.
Depending on how generalizable this result is, it could be anything from merely impressive to mind blowing and revolutionary. Imagine if the language is Coq or Lean instead of sympy!
1) I have access to OpenAI Davinci-Codex. I can and I did recreate their results on few prompts.
2) It is not really an issue of how easy it is to break. If it is able to solve the problem it will do it somewhat reliably. The issue is that questions provide solutions. So, "Solve each equation for x: ln(x*2-1)=3" will just keep on generating more similar problems, but "Using Sympy, solve Eq ln(x*2-1)=3 for x." will work.
For example, you will never get this prompt to work: "Find the limits as x → ∞ and as x → −∞ . Use this information, together with intercepts, to give a rough sketch of the graph as in Example 12. y = x 2 (x 2 − 1) 2 (x + 2)" but their step by step what to do prompt does work.
If you don't know Python ecosystem and you don't know how the solution should look like then you are out of luck.
3) Sympy does not seem hard to use. Codex is actually pretty good at writing "template" code. It struggles with anything that requires some level of reasoning instead of matching template to a problem.
4) Could you point me to the exact questions? I can try to recreate it.
OpenAI Codex is actually quite good. It is groundbreaking in comparison to anything before it, but it is far, far, far from perfect. It's like really bad stackoverflow in the sense that if you probe it couple of times it will provide you reasonably looking solution that may or may not work.
I would also not focus on the current state. Trajectory of improvement is more important. 5 years ago you could not get anything even comparable to Codex, in next 5 years it may just be far from perfect, and in 10 years it may actually work. On many metrics, on average over years, Deep Learning is improving 2x every year or so.
How do they guarantee that an answer is correct? That requires a small verified kernel in theorem provers. This seems hard to achieve with a neural network. Or is the goal to produce solutions that could be correct / are likely correct?
That's powerful. When I was in uni Wolfram seemed to have a monopoly on this sort of thing. I would have loved to have some free alternative back then.
Maxima has been free software for ages! Now it has a rubbish UI in my opinion, but just like now, the free options are there but they are less easy to use, getting better though. Both SymPy and Sagemath are interesting to look at.
These are problems from first year intro mathematics courses. For a lot of students these are mostly review from high school.
To me, a “university-level” problem is more like this:
Let W be an infinite-dimensional normed linear space over the real numbers. Use the Baire Category Theorem to show that if W is not countable-dimensional then W is not complete.
The above is a typical problem from the 3rd year real analysis course I took in the fall.
Man some hn posts bring the cringiest comments; please do tell us which school has functional analysis as a "typical" topic for juniors taking real analysis. Even if you're getting a second helping of real analysis by then, you're probably looking at Lebesgue integration on R and such, rather than general topological spaces.
I'll never understand why some people try to flex on an anonymous forum.
This is from PMATH 351 at University of Waterloo. Every pure math student takes the course in 3rd year. Lebesgue integration isn't covered until 4th year, though it is not restricted to R at that point.
I'm sorry you think my comment was intended to be a "flex". I was trying to make a point about university mathematics which is this: at university level students should be going beyond solving simple computational problems. Synthesizing a proof requires a higher level of understanding than the application of a standard problem-solving technique. See Bloom's taxonomy [1] for details.
There is this big push in academia to work in AI because funding. Many mathematicians are a bit fed up with AI because the paper of the fields tend to be a bit liberal with claims and they are not subject to the same standard as fields such as statistics or numerical analysis.
Many mathematical researchers I have known have abstained from jumping in the AI bandwagon until they see the name of a well-known professor in an AI paper. Then they cave.
This, and a great many papers, is not best presented as a paper. When presented that way it comes off as dishonest. I’m sure the authors know this, but they all also know that they have to churn out papers, even though it cheapens the whole academic enterprise.
This is impressive. I wonder how quickly and how well this could be adapted to other domains in science and engineering. It could be a really useful tool for engineering teams
Have you looked at the actual questions or prompts? It's doing only trivial translations of the actual input question (which is often arbitrarily more complex than the original textbook question) into Python code. A lot of the time, they don't even attempt exact solutions, they directly ask for iterative approaches:
Original textbook question (number theory course):
> The product of the positive integer divisors of a positive integer n is 729. Find n.
Actual model input/prompt:
> Write a function which finds the divisors (including 1 and itself) of the input and multiplies them together. Iterate over all possible integers until it finds an input where the product is 729.
Model output program:
def divisors(n):
divisors = []
for i in range(1, n+1):
if n % i == 0:
divisors.append(i)
return divisors
def product(n):
return reduce(lambda x, y: x*y, n)
def find_n(n):
for i in range(1, n):
if product(divisors(i)) == 729:
return i
print find_n(1000000)
Not only is the prompt explaining exactly what to do, it's using an iterative approach instead of an analytical approach, as might be expected from a math class.
Note: any errors in the program are my own transcribing error.
As noted in another comment, that is legit program synthesis from a complete
specification in natural language. The workflow described in the paper and
abstracted in your comment can be very useful, as long as the user can have good
confidence in the correctness of the results (or the ability to eyball and
correct them as needed).
The problem is that this approach relies on a large language model trained on a
copy of the entire web (GPT-3, trained on Common Crawl plus network extras),
fine-tuned on github (Codex) and then fine-tuned again on the kinds of problems
that it is supposed to solve. That is an insane amount of resources to spend to
solve simple programming problems with known solutions that can be implemented
by hand at much lower cost and effort. In that sense it's a little bit
disappointing: so much data, so much compute, and all you can do is tell
computer how to code python?
> and then fine-tuned again on the kinds of problems that it is supposed to solve
Could you point me to where they claim to have fine-tuned Codex?
From what I can see they claim:
> We use OpenAI’s davinci-codex engine for all of our generations. We fix all of Codex’s hyperparameters to be the same for all experiments: top-p which is the portion p of the token probability mass a language model
samples from at each step is set to 1, sampling temperature is
set to 0 (i.e. argmax), and response length is set to 200 tokens.
BTW - OpenAI Davinci costs $0.06 per 1000 tokens. Codex is currently free in closed beta, but I guess cost will be the same. I would be happy to pay even 100x ($6) for correct solutions to advanced mathematical problems. The issue is that this paper has ridiculous evaluation and Codex does not work anywhere near as good as they claim. It is pure hype by people who do not appear to be affiliated with OpenAI.
This is legit program synthesis for an iterative brute force solution to the problem. But it is in no sense "solving university level mathematical problems". It's not even figuring out that it can brute force it based on the original problem - a human looked at the original problem and told the model how to brute force it, and it did. It's cool that it did, but this achievement has nothing to do with the title and abstract of the paper.
My daughter is going to love this...but I suspect MIT psets may evolve from solving the questions to devising automated methods (like this) to solve them
No they’ll just more heavily weight exams in the final grade. And not doing psets will screw you over in many if not most STEM courses. They’re often essential preparation for exams.
Interaction. The original question may not be a prompt that synthesizes a program whose execution results in the correct answer. In addition, the answer may require multiple steps with clear plots or other modalities. We therefore may interactively prompt Codex until reaching the correct answer or visualizations, making the minimum necessary changes from the original question
Which to me basically sounds like they had a human in the loop (that knows how to solve these math problems) that kept changing the question until it gave the correct answer. They do measure the distance (using a sentence embedding model) of the original question to the one that yielded the correct answer, but that feels a bit contrived to me.
Nevertheless, its still really cool that the correct answer is indeed inside the model.